mxshift

本文档介绍了数据迁移工具 mxshift。

1 描述

mxshift 工具目前支持以下功能:

  • 全量迁移:以数据库为单位,实现从 Greenplum 4.3.X/5/6 / YMatrix 集群并行迁移数据表数据定义语言(DDL)及数据到新的 YMatrix 集群。
  • 增量迁移:以表为单位,实现从 Greenplum 5/6 / YMatrix 集群并行迁移目标数据库中新增数据表的数据定义语言(DDL)及数据到新的 YMatrix 集群。
  • 条件迁移:通过 WHERE 条件实现数据的条件迁移。
  • 支持的数据分布策略:哈希分布、随机分布、复制表(REPLICATED)、Master-only。

2 快速上手

完整的模版文件命名为 config.toml。此模版可通过执行 mxshift --config 获得。

我们在下文给出了针对不同使用场景的示例模版,请以此为参考,并根据具体情况使用 vim config.toml 命令修改文件完成配置。

2.1 全量迁移

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ##       1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ##       2. Target database is un-reachable, and 'ddl.mode' is 'output'
        # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun  5 2023 15:45:24"
        ## The installation directory of database
        install-dir="/usr/local/greenplum-db-6.7.1"
        [[database.source.hostname-to-ip]]
             ## The content within <> should be replaced with actual information and <> should be removed
             node-hostname="<mdw>" 
             node-ip="<127.0.0.1>"
        [[database.source.hostname-to-ip]]
             node-hostname="<sdw1>"
             node-ip="<127.0.0.2>"
        [[database.source.hostname-to-ip]]
             node-hostname="<sdw2>"
             node-ip="<127.0.0.3>"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ##       1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ##       2. Target database is un-reachable, and 'ddl.mode' is 'output' */
        # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun  5 2023 15:45:24"

[scope]
## Whether to disable data incremental migration, by default, it is true.
# disable-data-increment=true
disable-connector=false
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd        
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database        
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"
        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
with-index=true
## Whether to disable transfer Data, by default, it is false.
# disabled-data-transfer=false
## Table with size under this threshold are categorized as "small",
## transfer between masters instead of segments to avoid creating external table.
# small-table-mb=32

[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## During the DDL transfer, whether to skip the transfer of resource queue or group, by default, it is true.
# skip-resource-queue-and-group=true
## During the DDL transfer, whether to skip the transfer of tablespace, by default, it is true.
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for the case of migration from Greenplum to YMatrix
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"
## Whether to disable ddl incremental migration, by default, it is true.
# disable-ddl-increment=true
## DDL transfer will stop and exit when executing sql encounter errors, unless the type of DDL is set to 'ignore error'.Types of DDL are:
## "all","collation","default acl","extension","external table","function","index","language",
## "namespace","operator","resource","role","table","tablespace","type","view"
# ignore-error=[]

[verify]
## Whether to enable data verify, by default, it is false.
# enabled=false
## mode for verifying data consistency between source and target database,
## and value restricted to simple-count/column-count/length-sum/hash-bit-bloom.
## 'simple-count' for counting number of rows
## 'column-count' for counting not-null value of a certain column
## 'length-sum' for summing the length of some columns(e.g. column of type text, xml, json, varchar)
## 'hash-bit-bloom' for calculating the hash for one or more columns, and compare it's binary content
##  (except for column of floating-point type, the result of witch can be affected by software/hardware)
# mode="simple-count"
## Columns used in verification, will randomly select columns if not set
# columns=[]
## Maximum number of randomly selected columns used in verification
# max-column-count=3
# [[verify.table-list]]
#    schema="test_schema_1"
#    name="table_001"
# [[verify.exclude-table-list]]
#    schema="test_schema_3"
#    name="table_003"
# verify.schema-list=["test_schema_1"]
# verify.exclude-schema-list=["test_schema_5"]

# [[verify.override]]
#    mode="length-sum"
#    max-column-count=3
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_3"
#      name="table_003"
#    verify.override.schema-list=[]

# [[verify.override]]
#    mode="column-count"
#    max-column-count=2
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_1"
#      name="table_001"
#    verify.override.schema-list=[]

准备好 TOML 文件后,执行如下命令以启动 mxshift 进行全量迁移。

$ mxshift -c config_path_full.toml

2.2 增量迁移

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ##       1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ##       2. Target database is un-reachable, and 'ddl.mode' is 'output' */
        # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun  5 2023 15:45:24"
        ## The installation directory of database
        install-dir="/usr/local/greenplum-db-6.7.1"
        [[database.source.hostname-to-ip]]
             ## The content within <> should be replaced with actual information and <> should be removed
             node-hostname="<mdw>" 
             node-ip="<127.0.0.1>"
        [[database.source.hostname-to-ip]]
             node-hostname="<sdw1>"
             node-ip="<127.0.0.2>"
        [[database.source.hostname-to-ip]]
             node-hostname="<sdw2>"
             node-ip="<127.0.0.3>"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ##       1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ##       2. Target database is un-reachable, and 'ddl.mode' is 'output' */
        # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun  5 2023 15:45:24"

[scope]
disable-connector=false
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd        
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database        
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Whether to disable data incremental migration, by default, it is true.
disable-data-increment=false
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"

        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"        
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
with-index=true
## Whether to disable transfer Data, by default, it is false.
# disabled-data-transfer=false
## Table with size under this threshold are categorized as "small",
## transfer between masters instead of segments to avoid creating external table.
# small-table-mb=32

[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## During the DDL transfer, whether to skip the transfer of resource queue or group, by default, it is true.
# skip-resource-queue-and-group=true
## During the DDL transfer, whether to skip the transfer of tablespace, by default, it is true.
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for the case of migration from Greenplum to YMatrix
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"
## Whether to disable ddl incremental migration, by default, it is true.
disable-ddl-increment=false
## DDL transfer will stop and exit when executing sql encounter errors, unless the type of DDL is set to 'ignore error'.Types of DDL are:
## "all","collation","default acl","extension","external table","function","index","language",
## "namespace","operator","resource","role","table","tablespace","type","view"
# ignore-error=[]

[verify]
## Whether to enable data verify, by default, it is false.
# enabled=false
## mode for verifying data consistency between source and target database,
## and value restricted to simple-count/column-count/length-sum/hash-bit-bloom.
## 'simple-count' for counting number of rows
## 'column-count' for counting not-null value of a certain column
## 'length-sum' for summing the length of some columns(e.g. column of type text, xml, json, varchar)
## 'hash-bit-bloom' for calculating the hash for one or more columns, and compare it's binary content
##  (except for column of floating-point type, the result of witch can be affected by software/hardware)
# mode="simple-count"
## Columns used in verification, will randomly select columns if not set
# columns=[]
## Maximum number of randomly selected columns used in verification
# max-column-count=3
# [[verify.table-list]]
#    schema="test_schema_1"
#    name="table_001"
# [[verify.exclude-table-list]]
#    schema="test_schema_3"
#    name="table_003"
# verify.schema-list=["test_schema_1"]
# verify.exclude-schema-list=["test_schema_5"]

# [[verify.override]]
#    mode="length-sum"
#    max-column-count=3
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_3"
#      name="table_003"
#    verify.override.schema-list=[]

# [[verify.override]]
#    mode="column-count"
#    max-column-count=2
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_1"
#      name="table_001"
#    verify.override.schema-list=[]

准备好 TOML 文件后,执行如下命令以启动 mxshift 进行增量迁移。

$ mxshift -c config_path_incremental.toml

2.3 条件迁移

通过 WHERE 条件实现满足某些条件的数据迁移或实现过期数据不迁移:

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## The installation directory of database
        install-dir="/usr/local/greenplum-db-6.7.1"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"

[scope]
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd        
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database        
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema=......
                name=......
        [[scope.table-list]]
        ......

        [[scope.exclude-table-list]]
        ......      
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]
disable-connector=false

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
## Table with size under this threshold are categorized as "small",
## transfer between masters instead of segments to avoid creating external table.
# small-table-mb=32
[transfer.table-data-where-sql]
enabled=true
global="txdate >= '2022-10-01' AND batchnum >= 100000000"
[[transfer.table-data-where-sql.override]]
        where="abc > 10"
        [transfer.table-data-where-sql.override.table]
        schema="test_schema_1"
        name="table_001"
[[transfer.table-data-where-sql.override]]
        where="tag != 'aabbcc' AND ts > '2022-01-01'"
        [transfer.table-data-where-sql.override.table]
        schema="test_schema_2"
        name="another_table"

[verify]
## Whether to enable data verify, by default, it is false.
# enabled=false
## mode for verifying data consistency between source and target database,
## and value restricted to simple-count/column-count/length-sum/hash-bit-bloom.
## 'simple-count' for counting number of rows
## 'column-count' for counting not-null value of a certain column
## 'length-sum' for summing the length of some columns(e.g. column of type text, xml, json, varchar)
## 'hash-bit-bloom' for calculating the hash for one or more columns, and compare it's binary content
##  (except for column of floating-point type, the result of witch can be affected by software/hardware)
# mode="simple-count"
## Columns used in verification, will randomly select columns if not set
# columns=[]
## Maximum number of randomly selected columns used in verification
# max-column-count=3
# [[verify.table-list]]
#    schema="test_schema_1"
#    name="table_001"
# [[verify.exclude-table-list]]
#    schema="test_schema_3"
#    name="table_003"
# verify.schema-list=["test_schema_1"]
# verify.exclude-schema-list=["test_schema_5"]

# [[verify.override]]
#    mode="length-sum"
#    max-column-count=3
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_3"
#      name="table_003"
#    verify.override.schema-list=[]

# [[verify.override]]
#    mode="column-count"
#    max-column-count=2
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_1"
#      name="table_001"
#    verify.override.schema-list=[]

准备好 TOML 文件后,执行如下命令以启动 mxshift 进行条件迁移。

$ mxshift -c config_path_conditional.toml

2.4 迁移复制表

如果是迁移某个数据库,或者某个 Schema 下的所有复制表,不需要在 config_path.toml 里面做额外的配置,mxshfit 会自动判断哪些表是复制表,执行相应的迁移策略,防止重复迁移数据的问题发生。你仅需要在 scope.table-list 中配置好 schemaname 即可。

   [[scope.table-list]]
                schema=“public1”
                name=“table_replicated1”
   [[scope.table-list]]
                schema=“public2”
                name=“table_replicated2”

准备好 TOML 文件后,执行如下命令以启动 mxshift 进行复制表迁移。

$ mxshift -c config_path_replicated.toml

注意!
数据迁移思路及完整步骤介绍请见 数据迁移

3 功能提示

3.1 导入/导出表元信息

对于数据迁移,mxshift 需要先查询 catalog 目录获得迁移表的目录信息,再执行数据迁移任务。

随着数据库使用的加重,catalog 膨胀,查询的耗时也随之增长。为快捷调试以及缩短迁移期间业务停机时间,mxshift 增加以下命令行参数实现表元信息导出,并支持在后续迁移过程中导入复用:

## 将迁移表的元信息导出到 mxshift_meta_file 文件,并退出
$ mxshift -c <迁移配置文件> --write-table-meta-file mxshift_meta_file

## 从 mxshift_meta_file 文件导入迁移表的元信息,并进行数据迁移
$ mxshift -c <迁移配置文件> --read-table-meta-file mxshift_meta_file

注意!
导出的表元信息并不建议长期使用。当迁移表范围改变或者表元信息发生更改时,表元信息需要重新生成。

3.2 导入/导出表 DDL

在 mshift 的配置文件中作如下配置:

[ddl]
enabled=true
file-path="/tmp/mxshift.sql"
mode="output"
  • file-path:用于导入/导出 DDL sql 的文件路径,详见下文附录。禁止修改或者删除任何以 -- DO NOT EDIT. 开头的注释内容
  • mode:表示 DDL 的处理模式,分 execoutput 以及 input 三种,含义分别为:
    • exec:默认值,直接在目标库执行
    • output:输出到文件。此时 DDL 导出完毕后将直接退出,不会进行数据迁移
    • input:从文件读入

若在 output 下无法连接目标库,则可按照如下示例配置 database.target.db-version,保障导出正常执行:

[database]

    [database.target]
    ...
    ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
    ##         1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
    ##         2. Target database is un-reachable, and 'ddl.mode' is 'output'
    db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jul  5 2022 15:45:24"

若在input 下无法连接源库,则可按照如下示例配置 database.source.db-version 以及 ddl.only-ddl,保障导入正常执行:

[database]

    [database.target]
    ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
    ##         1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
    ##         2. Target database is un-reachable, and 'ddl.mode' is 'output'
    db-version="PostgreSQL 12 (MatrixDB 4.5.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jul  5 2022 15:45:24"

[ddl]
only-ddl=true

3.3 动态调整并发数

mxshift 工具支持在迁移运行时调节并发数,步骤如下:

  1. 修改配置文件(启动 mxshift 时使用的配置文件)中的 controller.concurrency
[controller]
## The number of table transferred at the same time
concurrency=xxx
  1. 获取 mxshift 当前的进程号

    命令行运行 ps aux | grep mxshift ,记下第二列的数字 {{mxshift PID}}

  2. 向 mxshift 进程发 SIGUSR2 信号,令 mxshift 重新载入配置文件,更新并发数

    命令行运行 kill -SIGUSR2 {{mxshift PID}}

3.4 数据一致性检验

3.4.1 旧版数据检验与数据迁移解耦

在 v5.2.1 之前的版本中,[transfer] 下有配置项 with-indexverify 分别用以控制是否迁移索引是否检验表行数,但是索引迁移、数据检验(旧版,仅检验表行数)均与数据迁移存在绑定关系,即索引迁移和数据检验均不能单独运行,必须在数据迁移的进程中进行。

为了增加灵活性,从 v5.2.1 版本开始支持索引迁移和旧版数据检验单独运行。在 [transfer] 类别下增加配置参数 disabled-data-transfer 用以跳过数据迁移。使用示例如下:

  1. 同时迁移数据和索引,可采用以下配置:
[transfer]
## Whether to disable transfer Data, by default, it is false.
disabled-data-transfer=false
## Transfer indexes of the tables, by default, it is false.
with-index=true
  1. 迁移数据并检验表行数,可采用以下配置:
[transfer]
## Whether to disable transfer Data, by default, it is false.
disabled-data-transfer=false
## Verify the number of record of every table, by default, it is disabled.
verify=true
  1. 同时迁移数据和索引,并检验表行数,可采用以下配置:
[transfer]
## Whether to disable transfer Data, by default, it is false.
disabled-data-transfer=false
## Transfer indexes of the tables, by default, it is false.
with-index=true
## Verify the number of record of every table, by default, it is disabled.
verify=true
  1. 仅迁移数据,不迁移索引,也不检验表行数,可采用以下配置:
[transfer]
## Whether to disable transfer Data, by default, it is false.
disabled-data-transfer=false
## Transfer indexes of the tables, by default, it is false.
with-index=false
## Verify the number of record of every table, by default, it is disabled.
verify=false
  1. 不迁移数据,仅迁移索引,可采用以下配置:
[transfer]
## Whether to disable transfer Data, by default, it is false.
disabled-data-transfer=true
## Transfer indexes of the tables, by default, it is false.
with-index=true
  1. 不迁移数据,仅检验表行数,可采用以下配置:
[transfer]
## Whether to disable transfer Data, by default, it is false.
disabled-data-transfer=true
## Verify the number of record of every table, by default, it is disabled.
verify=true

3.4.2 v5.2.1 版数据检验

YMatrix mxshift 从 v5.2.1 版本开始支持新的数据检验算法,可通过配置新增的 [verify] 类别运行。

详细的全局配置参数信息请见下文附录,完整的 [verify] 类别展示见上文全量迁移配置文件。这里特别介绍检验范围的配置。

[verify]
# [[verify.table-list]]
#    schema="test_schema_1"
#    name="table_001"
# [[verify.exclude-table-list]]
#    schema="test_schema_3"
#    name="table_003"
# verify.schema-list=["test_schema_1"]
# verify.exclude-schema-list=["test_schema_5"]

# [[verify.override]]
#    mode="length-sum"
#    max-column-count=3
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_3"
#      name="table_003"
#    verify.override.schema-list=[]

# [[verify.override]]
#    mode="column-count"
#    max-column-count=2
#    columns=[]
#    [[verify.override.table-list]]
#      schema="test_schema_1"
#      name="table_001"
#    verify.override.schema-list=[]

指定检验范围相关配置存在于 [verify.table-list][verify.exclude-table-list][verify.schema-list] 以及 [verify.exclude-schema-list] 四个子类别。

检验优先级顺序如下:

  1. 如果 table-list 有配置具体的表,则只检验 table-list 中的表。

  2. 如果没有配置具体表,则检查 schema-list 中是否配置了模式(Schema)。

  3. 如果配置了模式,则检验该模式下的所有表。

  4. 如果没有配置模式,则检验最近一次传输的所有表。

  • verify.table-list:参与 v5.2.1 版数据检验的表名列表。
  • verify.exclude-table-list:不参与 v5.2.1 版数据检验的表名列表。
  • verify.schema-list:参与 v5.2.1 版数据检验的模式列表。
  • verify.exclude-schema-list:不参与 v5.2.1 版数据检验的模式列表。

[verify.override] 子类别用于覆盖指定的全局检验配置。对指定范围内的表使用此子类别下的配置。

附录:参数信息

此部分主要介绍 mxshift 工具相关参数信息。

配置文件

使用 mxshift 工具需要先准备一个 TOML 文件。以下是配置文件中的详细参数介绍:

参数名 描述 必选
[database.source]类别
--db-host 源集群 Master 节点主机名(Host)或 IP 地址。源数据库集群指需要迁出数据的数据库集群
--db-port 源数据库 Master 的端口号
--db-database 源数据库名
--db-user 源数据库用户名
--db-password 源数据库用户密码
--install-dir 数据库安装路径
--hostname-to-ip 指定主机名对应的 IP 地址。在源库和目标库主机名相同时,使用该配置可保障路由的准确性
[database.target]类别
--db-host 目标集群 Master 节点主机名(Host)或 IP 地址。目标集群指数据迁入的数据库集群
--db-port 目标数据库 Master 的端口号
--db-database 目标数据库名
--db-user 目标数据库用户名
--db-password 目标数据库用户密码
[scope]类别
--mode 运行模式。允许下列几个值:normaldryrunfetchmotion
normal:默认值,指正常的数据迁移流程;
dryrun:验证两边数据库的连通性,使用简单的数据库定义语言(DDL)进行验证操作,但不进行数据的读取和写入;
fetch:在 dryrun 的基础上,读取源端的数据并发送到目标端数据节点(Segment),但直接丢弃,不进行处理,此选项用于验证源端读取效率和源端到目标端网络传输效率;
motion:在 fetch 的基础上,在目标端数据库解析数据并交换到真正的目标 Segment,此选项用于进一步测试目标端数据库的数据交换速率
--compress-method 压缩方式:gzip / zstd / lz4 / 0,默认值为 0 ,即不压缩。
若 scope.disable-connector=true, 开启任何压缩算法,都需要目标端的每个 Segment 的 mxadmin 用户可以免密 ssh 登陆到源端的每个 Segment,建议配置后用 dryrun 进行测试
--table-list 源数据表的模式(Schema)和名称列表,如果是分区表需填写分区表,如果需要迁移整个数据库,该参数配置为空即可,mxshift 会从大到小开始迁移所有大小不为 0 的表
--exclude-table-list 需排除的表的 Schema 和名称列表
--schema-list 源数据库的 Schema,优先级是如果 table-list 有配置具体的表,则只传输 table-list 中的表,如果没有,则检查 schema-list 中是否配置了 Schema,如果配置了,则传输该 Schema 下的所有表,如果没有,则传输 [database.source] db-database 下的所有表
--exclude-schema-list 排除源数据库的 Schema,不对该 Schema 下的表进行数据迁移,如果 schema-listexclude-schema-list 中配置了同名的 Schema,那么该 Schema 会被排除掉
--disable-data-increment 是否启用数据增量迁移功能。默认为 true,即禁用此功能。若开启,mxshift 会对比源库相关表的最近修改时间与上一次执行 mxshift 的时候的修改时间,以此判断在此期间数据表是否发生修改,并仅对在此期间发生修改的数据表进行重迁移。注意:首次开启该功能时,会对配置文件指定的所有表进行迁移
--disable-connector 是否需要为目标集群的所有节点添加白名单。默认 false,即无需为目标集群的所有节点添加白名单,只需在源集群为目标集群的 Master 配置白名单即可。注意:如果源集群为 Greenplum,仅支持 4.3.17 及以上版本
[log]类别
--log-level 默认值为 info,会打印正常的进度信息以及错误和警告信息。如果改为 debug,则会输出更多调试信息,仅用于开发和故障定位
--no-color 默认为 falsetrue 则关闭颜色标签
[controller]类别
--both-way 默认为 false,会先迁移最大的表,以达到最短时间内迁移完全体数据。如果设置 both-waytrue,则会从最大的表和最小的表两端同时迁移
--concurrency 支持并行迁移的表数量,默认 3,不超过 10
[transfer]类别
--verify 校验数据迁移前后的行数是否匹配,默认为 false
--with-index 是否迁移索引,默认为 false,选择不进行自动迁移。注意:如果你使用的是 MARS2 存储引擎建表,是必须创建索引表才生效的。因此 MARS2 表迁移 DDL 的同时会自动把索引也迁移过去,不受此选项的影响
--disabled-data-transfer 是否跳过迁移数据,仅迁移索引并检验表行数,完成数据迁移与检验的解耦。默认为 false,即不跳过对数据的迁移
--small-table-mb 表大小低于该数值(单位 MB)则会被认定为小表,并进行针对性数据传输优化(小表不进行 Segment 间的点对点传输,而是直接通过 Master 传输数据),默认为 32MB。若无需使用上述优化策略则将该值置为 -1 即可
[ddl]类别
--enabled 是否迁移数据表 DDL,默认为 false。mxshift 在启动后,迁移数据表的数据之前,会优先按数据库为单位迁移其 DDL。
DDL 迁移支持的数据库版本:
Greenplum4.3.X/5/6 -> YMatrix4/YMatrix5;
YMatrix4/5 -> YMatrix4/5
--file-path 用于导入/导出 DDL SQL 的文件路径。当文件不存在时会自动创建,若文件中包含其他内容会将文件清空后再行写入。导出的文件可查看或修改,但切记不要修改或者删除任何以“-- DO NOT EDIT.” 开头的注释内容,否则会影响导入效果,甚至导入出错
--mode DDL 的处理模式。有 execoutput 以及 input 三个选项。exec 为默认值,意为直接在目标数据库执行;output 意为输出到文件, DDL 导出完毕后将直接退出,不会进行数据迁移;input 意为从文件读入。使用此模式必须将 --file-path 指向 mxshift 所导出的文件,否则 mxshift 会无法识别。SPEC:如果 output 模式下目标数据库无法连接,则可在 [database.target] 类别下增加 --db-version 配置;如果 input 模式下源库无法连接,则可将 [ddl] 类别下的 --only-ddl 设置为 true,并在 [database.source] 类别下增加 --db-version 配置
--only-ddl 是否仅迁移 DDL。默认为 false。当配置为 false 时,DDL 迁移完成后会自动迁移相关表的数据;当配置为 true 时,DDL 迁移完成后即退出程序,不会迁移表数据
--skip-resource-queue-and-group 配置是否迁移资源队列(组)。当 skip-resource-queue-and-group=true,则在 DDL 迁移过程中会跳过资源组和资源队列
--skip-table-space 配置是否迁移表空间。当 skip-table-space=true,则在 DDL 迁移过程中会跳过表空间

--disable-ddl-increment
配置是否开启 DDL 增量迁移功能。默认为 true,即禁用此功能。若开启,mxshift 会分别解析源库和目标库的 catalog,并仅对目标库缺乏的部分进行迁移。开启此功能后,会多出解析目标库并比对的步骤,因而运行速度相比于禁用时稍慢
--ignore-error 是否允许 DDL 迁移过程忽略指定错误继续执行。若未添加该配置,则执行 DDL 迁移过程中,遇到任何 DDL 执行的报错均会停止迁移并退出;若添加该配置,那么相关种类的 DDL 执行报错,只会进行日志记录,并不会中断执行。
默认值为空,即不允许忽略任何错误;
all 代表所有种类;
支持填写多个种类,用逗号隔开,如 ["extension","function"]
[verify]类别
--enabled 是否启用 v5.2.1 版数据检验(有多种检验模式供选择),默认为 false
--mode 选择检验模式,默认为 simple-count。mxshift 共提供 simple-countcolumn-countlength-sumhash-bit-bloom 四种检验模式:simple-count 与旧版数据检验方式(transfer.verify)相同,对比表数据行数是否一致;column-count 对比某一列非空数据量是否一致。注意,若配置文件中所指定的列存在非空约束(要求该列必须包含非空的值,即不能为 NULL)或默认值(即向表中插入新行时,在该列中自动填充的值),会随机选取其他符合条件的列进行计算;length-sum 计算某(几)列的长度和。要求列为字符串类型,包括:textxmljsonjsonbjsonpathvarcharvarbit。若配置文件中所指定的列不符合要求,会随机选取其他符合要求的列进行计算;hash-bit-bloom 计算某(几)列的哈希值,对比哈希值的二进制内容的特征是否一致。要求该列的数据类型能够使用 cast() 函数转换成 text 类型。若配置文件中所指定的列不符合要求,会随机选取其他符合要求的列进行计算
--columns 指定参与 v5.2.1 版数据检验的列名
--max-column-count 参与 v5.2.1 版数据检验的列数量的最大值。在随机选取列的时候使用。默认值为 3

如果你想要通过 WHERE 条件实现条件迁移,那么可能会使用到以下参数:

类别 子类别 子参数 子子类别 子子参数 描述 必选
[transfer]类别 [transfer.table-data-where-sql]类别 --enabled 如果为 `true`,则利用 WHERE 条件过滤源数据库数据, 仅迁移满足 WHERE 条件的数据,实现增量迁移。本功能默认关闭,启用需手动指认 `true`
--global 此参数指 SQL 的 `WHERE` 表达式,mxshift 会对本次迁移的所有表附加此条件表达式,如果表中不包含 `WHERE` 关键字,则提示 `SKIPPED` 自动跳过该表
[[transfer.table-data-where-sql.override]]类别 --where 表示 WHERE 子表达式,优先级高于 global 中的 WHERE 表达式,仅对某个特定表进行过滤的表级应用
[transfer.table-data-where-sql.override.table]类别 --schema 需迁移的模式名
--name 需迁移的表名

命令行参数

mxshift 命令行参数如下:

参数名 描述
--config 打印配置文件默认内容
-c / --config-path mxshift 工具全局配置文件路径
-a / --all 无需确认,直接执行迁移。默认跳过已完成的表和空表
-v / --version mxshift 工具版本
-l / --show-progress 打印最近一次迁移的进度。若目前未在迁移,则展示上一次的迁移结果;若目前正在迁移,则展示当前迁移进度。其中,“STATUS”取值及其含义为:Not start:不在迁移范围;Skip:跳过(已迁移过或者为空表);Processing/Interrupted:正在迁移或者迁移过程中被打断;Complete:完成迁移;Failed:迁移失败
-R / --redo 若表已迁移过,不跳过,重新迁移(该选项会清除目标库中表的所有数据,并重新迁移)
--write-table-meta-file 将迁移表的元信息导出到 mxshift_meta_file 文件,并退出
--read-table-meta-file mxshift_meta_file 文件导入迁移表的元信息,并进行数据迁移