tencent cloud

Cloud Object Storage

동향 및 공지
릴리스 노트
제품 공지
제품 소개
제품 개요
기능 개요
적용 시나리오
제품 장점
기본 개념
리전 및 액세스 도메인
규격 및 제한
제품 요금
과금 개요
과금 방식
과금 항목
프리 티어
과금 예시
청구서 보기 및 다운로드
연체 안내
FAQ
빠른 시작
콘솔 시작하기
COSBrowser 시작하기
사용자 가이드
요청 생성
버킷
객체
데이터 관리
일괄 프로세스
글로벌 가속
모니터링 및 알람
운영 센터
데이터 처리
스마트 툴 박스 사용 가이드
데이터 워크플로
애플리케이션 통합
툴 가이드
툴 개요
환경 설치 및 설정
COSBrowser 툴
COSCLI 툴
COSCMD 툴
COS Migration 툴
FTP Server 툴
Hadoop 툴
COSDistCp 툴
HDFS TO COS 툴
온라인 도구 (Onrain Dogu)
자가 진단 도구
실습 튜토리얼
개요
액세스 제어 및 권한 관리
성능 최적화
AWS S3 SDK를 사용하여 COS에 액세스하기
데이터 재해 복구 백업
도메인 관리 사례
이미지 처리 사례
COS 오디오/비디오 플레이어 사례
데이터 다이렉트 업로드
데이터 보안
데이터 검증
빅 데이터 사례
COS 비용 최적화 솔루션
3rd party 애플리케이션에서 COS 사용
마이그레이션 가이드
로컬 데이터 COS로 마이그레이션
타사 클라우드 스토리지 데이터를 COS로 마이그레이션
URL이 소스 주소인 데이터를 COS로 마이그레이션
COS 간 데이터 마이그레이션
Hadoop 파일 시스템과 COS 간 데이터 마이그레이션
데이터 레이크 스토리지
클라우드 네이티브 데이터 레이크
메타데이터 가속
데이터 레이크 가속기 GooseFS
데이터 처리
데이터 처리 개요
이미지 처리
미디어 처리
콘텐츠 조정
파일 처리
문서 미리보기
장애 처리
RequestId 가져오기
공용 네트워크로 COS에 파일 업로드 시 속도가 느린 문제
COS 액세스 시 403 에러 코드 반환
리소스 액세스 오류
POST Object 자주 발생하는 오류
보안 및 컴플라이언스
데이터 재해 복구
데이터 보안
액세스 관리
자주 묻는 질문
인기 질문
일반 문제
과금
도메인 규정 준수 문제
버킷 설정 문제
도메인 및 CDN 문제
파일 작업 문제
로그 모니터링 문제
권한 관리
데이터 처리 문제
데이터 보안 문제
사전 서명 URL 관련 문제
SDK FAQ
툴 관련 문제
API 관련 문제
Agreements
Service Level Agreement
개인 정보 보호 정책
데이터 처리 및 보안 계약
연락처
용어집
문서Cloud Object Storage

Using DataX to Sync Data Between Buckets with Metadata Acceleration Enabled

포커스 모드
폰트 크기
마지막 업데이트 시간: 2024-03-25 16:04:01

Overview

DataX is an open-source offline data sync tool. It can efficiently sync data between various heterogeneous data sources, including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, and ODPS.
COS buckets with metadata acceleration enabled can act as the HDFS service in the Hadoop system to provide Hadoop Compatible File System (HCFS) semantics-based access for your business.
This document describes how to use DataX to sync data between two buckets with metadata acceleration enabled.

Environmental Dependencies

HADOOP-COS and the corresponding cos_api-bundle.
DataX version: DataX 3.0.

Download and Installation

Downloading hadoop-cos

Download hadoop-cos and cos_api-bundle of the corresponding version from GitHub.
Download chdfs-hadoop-plugin from GitHub.

Downloading DataX package

Download DataX from GitHub.

Installing hadoop-cos

After downloading hadoop-cos, copy hadoop-cos-2.x.x-${version}.jar, cos_api-bundle-${version}.jar, and chdfs_hadoop_plugin_network-${version}.jar to plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/ in the extracted DataX path.

How to Use

Bucket configuration

Enter the bucket with metadata acceleration enabled, and configure the VPC where the DataX server runs in HDFS Permission Configuration.
Note:
The source and destination buckets should at least allow read and write requests in the VPC respectively.

DataX configuration

1. Modify the datax.py script

Open the bin/datax.py script in the DataX decompression directory, and modify the CLASS_PATH variable in the script as follows:
CLASS_PATH = ("%s/lib/*:%s/plugin/reader/hdfsreader/libs/*:%s/plugin/writer/hdfswriter/libs/*:.") % (DATAX_HOME, DATAX_HOME, DATAX_HOME)

2. Configure hdfsreader and hdfswriter in the configuration JSON file

A sample JSON file is as shown below:
{
"job": {
"setting": {
"speed": {
"byte": 10485760
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/test/",
"defaultFS": "cosn://examplebucket1-1250000000/",
"column": ["*"],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
"fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
"fs.cosn.bucket.region": "ap-guangzhou",
"fs.cosn.tmp.dir": "/tmp/hadoop_cos",
"fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
"fs.cosn.userinfo.secretId": "COS_SECRETID",
"fs.cosn.userinfo.secretKey": "COS_SECRETKEY",
"fs.cosn.trsf.fs.ofs.user.appid": "1250000000"
},
"fieldDelimiter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"path": "/",
"fileName": "hive.test",
"defaultFS": "cosn://examplebucket2-1250000000/",
"column": [
{"name":"col1","type":"int"},
{"name":"col2","type":"string"}
],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
"fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
"fs.cosn.bucket.region": "ap-guangzhou",
"fs.cosn.tmp.dir": "/tmp/hadoop_cos",
"fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
"fs.cosn.userinfo.secretId": "COS_SECRETID",
"fs.cosn.userinfo.secretKey": "COS_SECRETKEY",
"fs.cosn.trsf.fs.ofs.user.appid": "1250000000"
},
"fieldDelimiter": ",",
"writeMode": "append"
}
}
}]
}
}
Notes:
Configure hadoopConfig as required for cosn.
Set defaultFS to the COSN path, such as cosn://examplebucket-1250000000/.
Change fs.cosn.userinfo.region and fs.cosn.trsf.fs.ofs.bucket.region to the bucket region, such as ap-guangzhou. For more information, see Regions and Access Endpoints.
For COS_SECRETID and COS_SECRETKEY, use your own COS key information.
Change fs.ofs.user.appid and fs.cosn.trsf.fs.ofs.user.appid to your appid.
Note:
fs.cosn.trsf.fs.ofs.bucket.region and fs.cosn.trsf.fs.ofs.user.appid have been removed from Hadoop-COS 8.1.7 and later. Therefore, note the version difference during use. For other configurations, see the reader and writer configuration items of HDFS.

Migrating data

Save the configuration file as hdfs_job.json in the job directory and run the following command:
[root@172 /usr/local/service/datax]# python bin/datax.py job/hdfs_job.json
The resulting output is as shown below:
2022-10-23 00:25:24.954 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%

[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.034s | 0.034s | 0.034s
PS Scavenge | 14 | 14 | 14 | 0.059s | 0.059s | 0.059s
2022-10-23 00:25:24.954 [job-0] INFO JobContainer - PerfTrace not enable!
2022-10-23 00:25:24.954 [job-0] INFO StandAloneJobContainerCommunicator - Total 1000003 records, 9322478 bytes | Speed 910.40KB/s, 100000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 1.000s | All Task WaitReaderTime 6.259s | Percentage 100.00%
2022-10-23 00:25:24.955 [job-0] INFO JobContainer -
Job start time : 2022-10-23 00:25:12
Job end time : 2022-10-23 00:25:24
Job duration : 12s
Average job traffic : 910.40 KB/s
Record write speed : 100000 records/s
Total number of read records : 1000003
Read/Write failure count : 0

Ranger and Kerberos Use Cases

In the Hadoop permission system, Kerberos and Ranger are responsible for authentication and authorization respectively. After Ranger and Kerberos are enabled, you can use DataX to connect buckets with metadata acceleration enabled in similar steps, but you need to perform additional operations and configurations.
1. A bucket with metadata acceleration enabled supports the COS Ranger service, which will be automatically installed when you purchase the Ranger and COS Ranger components in the EMR console. You can also install it by yourself as instructed in CHDFS Ranger Permission System Solution.
2. Copy cosn-ranger-interface-1.x.x-${version}.jar and hadoop-ranger-client-for-hadoop-${version}.jar to plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/ in the extracted DataX path. Click here to download them.
3. Enter the bucket with metadata acceleration enabled, select Ranger authentication for HDFS Authentication Mode, and configure the Ranger address (not the COS Ranger address).
4. Configure hdfsreader and hdfswriter in the JSON configuration file.
{
"job": {
"setting": {
"speed": {
"byte": 10485760
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/test/",
"defaultFS": "cosn://examplebucket1-1250000000/",
"column": ["*"],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
"fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
"fs.cosn.bucket.region": "ap-guangzhou",
"fs.cosn.tmp.dir": "/tmp/hadoop_cos",
"fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
"fs.cosn.trsf.fs.ofs.user.appid": "1250000000",
"fs.cosn.credentials.provider": "org.apache.hadoop.fs.auth.RangerCredentialsProvider",
"qcloud.object.storage.zk.address": "172.16.0.30:2181",
"qcloud.object.storage.ranger.service.address": "172.16.0.30:9999",
"qcloud.object.storage.kerberos.principal": "hadoop/172.16.0.30@EMR-5IUR9VWW"
},
"haveKerberos": "true",
"kerberosKeytabFilePath": "/var/krb5kdc/emr.keytab",
"kerberosPrincipal": "hadoop/172.16.0.30@EMR-5IUR9VWW",
"fieldDelimiter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"path": "/",
"fileName": "hive.test",
"defaultFS": "cosn://examplebucket2-1250000000/",
"column": [
{"name":"col1","type":"int"},
{"name":"col2","type":"string"}
],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
"fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
"fs.cosn.bucket.region": "ap-guangzhou",
"fs.cosn.tmp.dir": "/tmp/hadoop_cos",
"fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
"fs.cosn.trsf.fs.ofs.user.appid": "1250000000",
"fs.cosn.credentials.provider": "org.apache.hadoop.fs.auth.RangerCredentialsProvider",
"qcloud.object.storage.zk.address": "172.16.0.30:2181",
"qcloud.object.storage.ranger.service.address": "172.16.0.30:9999",
"qcloud.object.storage.kerberos.principal": "hadoop/172.16.0.30@EMR-5IUR9VWW"
},
"haveKerberos": "true",
"kerberosKeytabFilePath": "/var/krb5kdc/emr.keytab",
"kerberosPrincipal": "hadoop/172.16.0.30@EMR-5IUR9VWW",
"fieldDelimiter": ",",
"writeMode": "append"
}
}
}]
}
}
The new configuration items are as detailed below:
Set fs.cosn.credentials.provider to org.apache.hadoop.fs.auth.RangerCredentialsProvider to use Ranger for authorization.
Set qcloud.object.storage.zk.address to the ZooKeeper address.
Set qcloud.object.storage.ranger.service.address to the COS Ranger address.
Set haveKerberos to true.
Set qcloud.object.storage.kerberos.principal and kerberosPrincipal to the Kerberos authentication principal name (which can be read from core-site.xml in the EMR environment with Kerberos enabled).
Set kerberosKeytabFilePath to the absolute path of the keytab authentication file (which can be read from ranger-admin-site.xml in the EMR environment with Kerberos enabled).

FAQs

What should I do if the java.io.IOException: Permission denied: no access groups bound to this mountPoint examplebucket2-1250000000, access denied or java.io.IOException: Permission denied: No access rules matched error is reported?

Check whether the IP address or IP range of the server is set in the VPC network configuration in HDFS Permission Configuration; for example, the IP addresses of all nodes must be configured for EMR.

What should I do if the java. lang. RuntimeException: java. lang.ClassNotFoundException: Class org.apache.hadoop.fs.con.ranger.client.RangerQcloudObjectStorageClientImpl not found error is reported?

Check whether cosn-ranger-interface-1.x.x-${version}.jar and hadoop-ranger-client-for-hadoop-${version}.jar have been copied to plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/ in the extracted DataX path (click here to download them).

What should I do if the java.io.IOException: Login failure for hadoop/_HOST@EMR-5IUR9VWW from keytab /var/krb5kdc/emr.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user error is reported?

Check whether kerberosPrincipal and qcloud.object.storage.kerberos.principal are mistakenly set to hadoop/_HOST@EMR-5IUR9VWW instead of hadoop/172.16.0.30@EMR-5IUR9VWW. As DataX cannot resolve a _HOST domain name, you need to replace _HOST with an IP. You can run the klist -ket /var/krb5kdc/emr.keytab command to find an appropriate principal.

What should I do if the java.io.IOException: init fs.cosn.ranger.plugin.client.impl failed error is reported?

Check whether qcloud.object.storage.kerberos.principal is configured in hadoopConfig in the JSON file, and if not, you need to configure it.

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백