tencent cloud

Data Accelerator Goose FileSystem

Release Notes and Announcements
Release Notes
Product Selection Guide
GooseFSx
Product Introduction
Quick Start
Purchase Guide
Console Guide
Tool Guide
Practical Tutorial
Service Level Agreement
Glossary
GooseFS
Product Introduction
Billing Overview
Quick Start
Core Features
Console Guide
Developer Guide
Client Tools
Cluster Configuration Practice
Data Security
Service Level Agreement
GooseFS-Lite
GooseFS-Lite Tool
Practical Tutorial
Use GooseFS in Kubernetes to Speed Up Spark Data
Access Bucket Natively with POSIX Semantics Using GooseFS
GooseFS Distributedload Tuning Practice
FAQs
GooseFS Policy
Privacy Policy
Data Processing And Security Agreement
DocumentaçãoData Accelerator Goose FileSystemGooseFSCluster Configuration PracticeAI Scenario Production Environment Configuration Practice

AI Scenario Production Environment Configuration Practice

PDF
Modo Foco
Tamanho da Fonte
Última atualização: 2025-07-17 17:42:55

Overview

Data Accelerator Goose FileSystem (GooseFS) provides multiple deployment methods, including control plane deployment, TKE cluster deployment, EMR cluster deployment and other methods. In AI scenarios, control plane deployment and TKE cluster deployment are usually used, and a high-availability architecture is adopted to meet business continuity requirements.

High-availability architecture refers to a Master-backup active-active architecture with multiple Master nodes. Among these nodes, only one serves as the primary (Leader) node to provide services externally, while the rest Standby nodes maintain the same file system status as the primary node by synchronizing shared logs (Journal). If the primary node fails or goes down, a Standby node is automatically selected from the current secondary nodes to take over and continue providing services externally. This eliminates the system's single point of failure and achieves overall high availability. Currently, GooseFS supports strong consistency for Master-backup node status through two methods: Raft logs and Zookeeper. In container scenarios, we recommend deploying your high-availability architecture based on the Raft log mode. This document focuses on Raft-based high-availability deployment configurations and distinguishes between sequential read and random read in different scenarios.

High-Availability Architecture Deployment Configuration Based on Raft (Sequential Read Scenario)

In the scenario where sequential read is required, see the following recommended configuration and copy-paste this configuration item to the goosefs-site.properties file to complete your high availability architecture configuration:
goosefs.master.embedded.journal.addresses=<master1>:9202,<master2>:9202,<master3>:9202

goosefs.master.metastore=ROCKS
Use when it's uncertain whether rocksdb is fixed
goosefs.master.metastore.block=HEAP

Depends on the memory size
goosefs.master.metastore.inode.cache.max.size=10000000

# rocksdb data storage place
goosefs.master.metastore.dir=/meta-1/metastore

# Mount path for the root directory must be placed in a secure directory to prevent accidental deletion
goosefs.master.mount.table.root.ufs=/meta-1/underFSStorage

# raft log storage place
goosefs.master.journal.folder=/meta-1/journal

# Timeout period for triggering master switchover should not be too low (jvm gc can cause switchover oscillation) or too large (will impact recovery availability time)
goosefs.master.embedded.journal.election.timeout=20s

# For large data volume, strongly recommend disabling
goosefs.master.startup.block.integrity.check.enabled=false

The timing to trigger a checkpoint should not be too small (frequent checkpoints will prevent participation in leader election during the checkpoint period) or too large (affecting the service restart duration). It can be estimated based on the checkpoint loading duration.
goosefs.master.journal.checkpoint.period.entries=20000000

# acl authentication switch, set based on scenario
goosefs.security.authorization.permission.enabled=false

# Recommend enabling, otherwise hostname will be used, and hostnames may be identical.
goosefs.network.ip.address.used=true

# Worker properties
goosefs.worker.tieredstore.levels=1
goosefs.worker.tieredstore.level0.alias=HDD
goosefs.worker.tieredstore.level0.dirs.quota=7TB,7TB
goosefs.worker.tieredstore.level0.dirs.path=/data-1,/data-2

# worker restart timeout period, increase as much as possible for large quantities.
goosefs.worker.registry.get.timeout.ms=3600s

# read data response timeout, default 1h
goosefs.user.streaming.data.timeout=60s

Write policy, LocalFirstPolicy is selected by default, possibly causing data imbalance
goosefs.user.block.write.location.policy.class=com.qcloud.cos.goosefs.client.block.policy.RoundRobinPolicy

# Impacts distributedLoad speed. Without considering online read impact, set it to cpu count * 2.
gosefs.job.worker.threadpool.size=50

High-Availability Architecture Deployment Configuration Based on Raft (Random Read Scenario)

In the scenario where random read is required, see the following recommended configuration and copy-paste this configuration item into the goosefs-site.properties file to complete your high availability architecture configuration:
goosefs.master.embedded.journal.addresses=<master1>:9202,<master2>:9202,<master3>:9202

goosefs.master.metastore=ROCKS
Use when it's uncertain whether rocksdb is fixed
goosefs.master.metastore.block=HEAP

# Based on memory size
goosefs.master.metastore.inode.cache.max.size=10000000

# rocksdb data storage place
goosefs.master.metastore.dir=/meta-1/metastore

# Mount path for the root directory must be placed in a secure directory to prevent accidental deletion
goosefs.master.mount.table.root.ufs=/meta-1/underFSStorage

# raft log storage place
goosefs.master.journal.folder=/meta-1/journal

# Timeout period for triggering master switchover should not be too low (jvm gc can cause switchover oscillation) or too large (will impact recovery availability time)
goosefs.master.embedded.journal.election.timeout=20s

# For large data volume, strongly recommend disabling
goosefs.master.startup.block.integrity.check.enabled=false

The timing to trigger a checkpoint should not be too low (frequent checkpoints will prevent participation in leader election during the checkpoint period) or too large (impacting service restart duration). It can be estimated based on checkpoint loading duration.
goosefs.master.journal.checkpoint.period.entries=20000000

# acl authentication switch, based on scenario
goosefs.security.authorization.permission.enabled=false

# Recommend enabling, otherwise hostname will be used, and hostnames may be identical.
goosefs.network.ip.address.used=true

# Worker properties
goosefs.worker.tieredstore.levels=1
goosefs.worker.tieredstore.level0.alias=HDD
goosefs.worker.tieredstore.level0.dirs.quota=7TB,7TB
goosefs.worker.tieredstore.level0.dirs.path=/data-1,/data-2

# worker restart timeout period, increase as much as possible for large quantities.
goosefs.worker.registry.get.timeout.ms=3600s

# read data response timeout, default 1h
goosefs.user.streaming.data.timeout=60s

# Write policy, LocalFirstPolicy is selected by default, possibly causing data imbalance
goosefs.user.block.write.location.policy.class=com.qcloud.cos.goosefs.client.block.policy.RoundRobinPolicy

# For random read cases, it is advisable to reduce the value (default 1MB) to prevent read bloat
goosefs.user.streaming.reader.chunk.size.bytes=256KB
goosefs.user.local.reader.chunk.size.bytes=256KB

# Time to wait for worker read stream to close. For large number of small file reads or random read cases, it is advisable to reduce the value (default 5s) to avoid performance decrease caused by long tail.
goosefs.user.streaming.reader.close.timeout=100ms



Ajuda e Suporte

Esta página foi útil?

comentários