동향 및 공지

릴리스 노트

제품 공지

제품 소개

제품 개요

기능 개요

적용 시나리오

제품 장점

기본 개념

리전 및 액세스 도메인

규격 및 제한

제품 요금

과금 개요

과금 방식

과금 항목

프리 티어

과금 예시

청구서 보기 및 다운로드

연체 안내

FAQ

빠른 시작

콘솔 시작하기

COSBrowser 시작하기

사용자 가이드

요청 생성

버킷

객체

데이터 관리

일괄 프로세스

글로벌 가속

모니터링 및 알람

운영 센터

데이터 처리

스마트 툴 박스 사용 가이드

데이터 워크플로

애플리케이션 통합

툴 가이드

툴 개요

환경 설치 및 설정

COSBrowser 툴

COSCLI 툴

COSCMD 툴

COS Migration 툴

FTP Server 툴

Hadoop 툴

COSDistCp 툴

HDFS TO COS 툴

온라인 도구 (Onrain Dogu)

자가 진단 도구

실습 튜토리얼

개요

액세스 제어 및 권한 관리

성능 최적화

AWS S3 SDK를 사용하여 COS에 액세스하기

데이터 재해 복구 백업

도메인 관리 사례

이미지 처리 사례

COS 오디오/비디오 플레이어 사례

데이터 다이렉트 업로드

데이터 보안

데이터 검증

빅 데이터 사례

COS 비용 최적화 솔루션

3rd party 애플리케이션에서 COS 사용

마이그레이션 가이드

로컬 데이터 COS로 마이그레이션

타사 클라우드 스토리지 데이터를 COS로 마이그레이션

URL이 소스 주소인 데이터를 COS로 마이그레이션

COS 간 데이터 마이그레이션

Hadoop 파일 시스템과 COS 간 데이터 마이그레이션

데이터 레이크 스토리지

클라우드 네이티브 데이터 레이크

메타데이터 가속

데이터 레이크 가속기 GooseFS

데이터 처리

데이터 처리 개요

이미지 처리

미디어 처리

콘텐츠 조정

파일 처리

문서 미리보기

장애 처리

RequestId 가져오기

공용 네트워크로 COS에 파일 업로드 시 속도가 느린 문제

COS 액세스 시 403 에러 코드 반환

리소스 액세스 오류

POST Object 자주 발생하는 오류

보안 및 컴플라이언스

데이터 재해 복구

데이터 보안

액세스 관리

자주 묻는 질문

Using DataX to Sync Data Between Buckets with Metadata Acceleration Enabled

포커스 모드

폰트 크기

마지막 업데이트 시간: 2024-03-25 16:04:01

Overview
DataX is an open-source offline data sync tool. It can efficiently sync data between various heterogeneous data sources, including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, and ODPS.
COS buckets with metadata acceleration enabled can act as the HDFS service in the Hadoop system to provide Hadoop Compatible File System (HCFS) semantics-based access for your business.
This document describes how to use DataX to sync data between two buckets with metadata acceleration enabled.
Environmental Dependencies
HADOOP-COS and the corresponding cos_api-bundle.
DataX version: DataX 3.0.
Download and Installation
Downloading hadoop-cos
Download hadoop-cos and cos_api-bundle of the corresponding version from GitHub.
Download chdfs-hadoop-plugin from GitHub.
Downloading DataX package
Download DataX from GitHub.
Installing hadoop-cos
After downloading hadoop-cos, copy hadoop-cos-2.x.x-${version}.jar, cos_api-bundle-${version}.jar, and chdfs_hadoop_plugin_network-${version}.jar to plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/ in the extracted DataX path.
How to Use
Bucket configuration
Enter the bucket with metadata acceleration enabled, and configure the VPC where the DataX server runs in HDFS Permission Configuration.
Note: 
 The source and destination buckets should at least allow read and write requests in the VPC respectively.
DataX configuration
1. Modify the datax.py script
Open the bin/datax.py script in the DataX decompression directory, and modify the CLASS_PATH variable in the script as follows:
CLASS_PATH = ("%s/lib/*:%s/plugin/reader/hdfsreader/libs/*:%s/plugin/writer/hdfswriter/libs/*:.") % (DATAX_HOME, DATAX_HOME, DATAX_HOME)
2. Configure hdfsreader and hdfswriter in the configuration JSON file
A sample JSON file is as shown below:
{
  "job": {
  "setting": {
    "speed": {
      "byte": 10485760
    },
    "errorLimit": {
      "record": 0,
      "percentage": 0.02
    }
  },
  "content": [{
    "reader": {
      "name": "hdfsreader",
      "parameter": {
        "path": "/test/",
        "defaultFS": "cosn://examplebucket1-1250000000/",
        "column": ["*"],
        "fileType": "text",
        "encoding": "UTF-8",
        "hadoopConfig": {
          "fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
          "fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
          "fs.cosn.bucket.region": "ap-guangzhou",
          "fs.cosn.tmp.dir": "/tmp/hadoop_cos",
          "fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
          "fs.cosn.userinfo.secretId": "COS_SECRETID",
          "fs.cosn.userinfo.secretKey": "COS_SECRETKEY",
          "fs.cosn.trsf.fs.ofs.user.appid": "1250000000"
        },
        "fieldDelimiter": ","
      }
    },
    "writer": {
      "name": "hdfswriter",
      "parameter": {
        "path": "/",
        "fileName": "hive.test",
        "defaultFS": "cosn://examplebucket2-1250000000/",
        "column": [
          {"name":"col1","type":"int"},
          {"name":"col2","type":"string"}
        ],
        "fileType": "text",
        "encoding": "UTF-8",
        "hadoopConfig": {
          "fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
          "fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
          "fs.cosn.bucket.region": "ap-guangzhou",
          "fs.cosn.tmp.dir": "/tmp/hadoop_cos",
          "fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
          "fs.cosn.userinfo.secretId": "COS_SECRETID",
          "fs.cosn.userinfo.secretKey": "COS_SECRETKEY",
          "fs.cosn.trsf.fs.ofs.user.appid": "1250000000"
        },
        "fieldDelimiter": ",",
        "writeMode": "append"
      }
    }
  }]
 }
}
Notes:
Configure hadoopConfig as required for cosn.
Set defaultFS to the COSN path, such as cosn://examplebucket-1250000000/.
Change fs.cosn.userinfo.region and fs.cosn.trsf.fs.ofs.bucket.region to the bucket region, such as ap-guangzhou. For more information, see Regions and Access Endpoints.
For COS_SECRETID and COS_SECRETKEY, use your own COS key information.
Change fs.ofs.user.appid and fs.cosn.trsf.fs.ofs.user.appid to your appid.
Note: 
fs.cosn.trsf.fs.ofs.bucket.region and fs.cosn.trsf.fs.ofs.user.appid have been removed from Hadoop-COS 8.1.7 and later. Therefore, note the version difference during use. For other configurations, see the reader and writer configuration items of HDFS.
Migrating data
Save the configuration file as hdfs_job.json in the job directory and run the following command:
[root@172 /usr/local/service/datax]# python bin/datax.py job/hdfs_job.json 
The resulting output is as shown below:
2022-10-23 00:25:24.954 [job-0] INFO  JobContainer - 
         [total cpu info] => 
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
                -1.00%                         | -1.00%                         | -1.00%
﻿
         [total gc info] => 
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
                 PS MarkSweep         | 1                  | 1                  | 1                  | 0.034s             | 0.034s             | 0.034s             
                 PS Scavenge          | 14                 | 14                 | 14                 | 0.059s             | 0.059s             | 0.059s             
2022-10-23 00:25:24.954 [job-0] INFO  JobContainer - PerfTrace not enable!
2022-10-23 00:25:24.954 [job-0] INFO  StandAloneJobContainerCommunicator - Total 1000003 records, 9322478 bytes | Speed 910.40KB/s, 100000 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 1.000s |  All Task WaitReaderTime 6.259s | Percentage 100.00%
2022-10-23 00:25:24.955 [job-0] INFO  JobContainer - 
Job start time                    : 2022-10-23 00:25:12
Job end time                    : 2022-10-23 00:25:24
Job duration                    :                 12s
Average job traffic                    :          910.40 KB/s
Record write speed                    :         100000 records/s
Total number of read records                    :             1000003
Read/Write failure count                    :                   0
Ranger and Kerberos Use Cases
In the Hadoop permission system, Kerberos and Ranger are responsible for authentication and authorization respectively. After Ranger and Kerberos are enabled, you can use DataX to connect buckets with metadata acceleration enabled in similar steps, but you need to perform additional operations and configurations.
1. A bucket with metadata acceleration enabled supports the COS Ranger service, which will be automatically installed when you purchase the Ranger and COS Ranger components in the EMR console. You can also install it by yourself as instructed in CHDFS Ranger Permission System Solution.
2. Copy cosn-ranger-interface-1.x.x-${version}.jar and hadoop-ranger-client-for-hadoop-${version}.jar to plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/ in the extracted DataX path. Click here to download them.
3. Enter the bucket with metadata acceleration enabled, select Ranger authentication for HDFS Authentication Mode, and configure the Ranger address (not the COS Ranger address).
4. Configure hdfsreader and hdfswriter in the JSON configuration file.
{
  "job": {
  "setting": {
 "speed": {
   "byte": 10485760
 },
 "errorLimit": {
   "record": 0,
   "percentage": 0.02
 }
  },
  "content": [{
 "reader": {
   "name": "hdfsreader",
   "parameter": {
     "path": "/test/",
     "defaultFS": "cosn://examplebucket1-1250000000/",
     "column": ["*"],
     "fileType": "text",
     "encoding": "UTF-8",
     "hadoopConfig": {
       "fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
       "fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
       "fs.cosn.bucket.region": "ap-guangzhou",
       "fs.cosn.tmp.dir": "/tmp/hadoop_cos",
       "fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
       "fs.cosn.trsf.fs.ofs.user.appid": "1250000000",
       "fs.cosn.credentials.provider": "org.apache.hadoop.fs.auth.RangerCredentialsProvider",
       "qcloud.object.storage.zk.address": "172.16.0.30:2181",
       "qcloud.object.storage.ranger.service.address": "172.16.0.30:9999",
       "qcloud.object.storage.kerberos.principal": "hadoop/172.16.0.30@EMR-5IUR9VWW"
     },
     "haveKerberos": "true",
     "kerberosKeytabFilePath": "/var/krb5kdc/emr.keytab",
     "kerberosPrincipal": "hadoop/172.16.0.30@EMR-5IUR9VWW",
     "fieldDelimiter": ","
   }
 },
 "writer": {
   "name": "hdfswriter",
   "parameter": {
     "path": "/",
     "fileName": "hive.test",
     "defaultFS": "cosn://examplebucket2-1250000000/",
     "column": [
       {"name":"col1","type":"int"},
       {"name":"col2","type":"string"}
     ],
     "fileType": "text",
     "encoding": "UTF-8",
     "hadoopConfig": {
       "fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
       "fs.cosn.trsf.fs.ofs.bucket.region": "ap-guangzhou",
       "fs.cosn.bucket.region": "ap-guangzhou",
       "fs.cosn.tmp.dir": "/tmp/hadoop_cos",
       "fs.cosn.trsf.fs.ofs.tmp.cache.dir": "/tmp/",
       "fs.cosn.trsf.fs.ofs.user.appid": "1250000000",
       "fs.cosn.credentials.provider": "org.apache.hadoop.fs.auth.RangerCredentialsProvider",
       "qcloud.object.storage.zk.address": "172.16.0.30:2181",
       "qcloud.object.storage.ranger.service.address": "172.16.0.30:9999",
       "qcloud.object.storage.kerberos.principal": "hadoop/172.16.0.30@EMR-5IUR9VWW"
     },
     "haveKerberos": "true",
     "kerberosKeytabFilePath": "/var/krb5kdc/emr.keytab",
     "kerberosPrincipal": "hadoop/172.16.0.30@EMR-5IUR9VWW",
     "fieldDelimiter": ",",
     "writeMode": "append"
   }
 }
  }]
 }
}
The new configuration items are as detailed below:
Set fs.cosn.credentials.provider to org.apache.hadoop.fs.auth.RangerCredentialsProvider to use Ranger for authorization.
Set qcloud.object.storage.zk.address to the ZooKeeper address.
Set qcloud.object.storage.ranger.service.address to the COS Ranger address.
Set haveKerberos to true.
Set qcloud.object.storage.kerberos.principal and kerberosPrincipal to the Kerberos authentication principal name (which can be read from core-site.xml in the EMR environment with Kerberos enabled).
Set kerberosKeytabFilePath to the absolute path of the keytab authentication file (which can be read from ranger-admin-site.xml in the EMR environment with Kerberos enabled).
FAQs
What should I do if the java.io.IOException: Permission denied: no access groups bound to this mountPoint examplebucket2-1250000000, access denied or java.io.IOException: Permission denied: No access rules matched error is reported?
Check whether the IP address or IP range of the server is set in the VPC network configuration in HDFS Permission Configuration; for example, the IP addresses of all nodes must be configured for EMR.
What should I do if the java. lang. RuntimeException: java. lang.ClassNotFoundException: Class org.apache.hadoop.fs.con.ranger.client.RangerQcloudObjectStorageClientImpl not found error is reported?
Check whether cosn-ranger-interface-1.x.x-${version}.jar and hadoop-ranger-client-for-hadoop-${version}.jar have been copied to plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/ in the extracted DataX path (click here to download them).
What should I do if the java.io.IOException: Login failure for hadoop/_HOST@EMR-5IUR9VWW from keytab /var/krb5kdc/emr.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user error is reported?
Check whether kerberosPrincipal and qcloud.object.storage.kerberos.principal are mistakenly set to hadoop/_HOST@EMR-5IUR9VWW instead of hadoop/172.16.0.30@EMR-5IUR9VWW. As DataX cannot resolve a _HOST domain name, you need to replace _HOST with an IP. You can run the klist -ket /var/krb5kdc/emr.keytab command to find an appropriate principal.
What should I do if the java.io.IOException: init fs.cosn.ranger.plugin.client.impl failed error is reported?
Check whether qcloud.object.storage.kerberos.principal is configured in hadoopConfig in the JSON file, and if not, you need to configure it.

도움말 및 지원

문제 해결에 도움이 되었나요?

더 자세한 내용은 문의하기 또는 티겟 제출 을 통해 문의할 수 있습니다.

피드백

tencent cloud

Cloud Object Storage

Using DataX to Sync Data Between Buckets with Metadata Acceleration Enabled

Overview

Environmental Dependencies

Download and Installation

Downloading hadoop-cos

Downloading DataX package

Installing hadoop-cos

How to Use

Bucket configuration

DataX configuration

1. Modify the `datax.py` script

2. Configure `hdfsreader` and `hdfswriter` in the configuration JSON file

Migrating data

Ranger and Kerberos Use Cases

FAQs

What should I do if the `java.io.IOException: Permission denied: no access groups bound to this mountPoint examplebucket2-1250000000, access denied` or `java.io.IOException: Permission denied: No access rules matched` error is reported?

What should I do if the `java. lang. RuntimeException: java. lang.ClassNotFoundException: Class org.apache.hadoop.fs.con.ranger.client.RangerQcloudObjectStorageClientImpl not found` error is reported?

What should I do if the `java.io.IOException: init fs.cosn.ranger.plugin.client.impl failed` error is reported?

도움말 및 지원