在线扩容

对于海量存储的分布式数据库，随着数据量的增大，已有集群容量与算力不够用是不可避免的问题，所以扩容操作不可或缺。MatrixDB可以做到在线扩容，包括：

不停机增加新节点
表数据重分布并行化
未重分布表与已重分布表可以进行连接操作

在线扩容分两个步骤：

增加新节点
数据重分布

1. 增加新节点

该环节为集群增加新的segment节点，并启动使用，但原有表数据还存储在老的节点上。

增加新节点首先要提供部署方案配置文件，该文件可以手动编写，也可以使用gpexpand工具自动生成，推荐使用自动生成方式。

1.1 生成扩容部署配置文件

执行gpexpand：

[mxadmin@mdw ~]$ gpexpand
......
Please refer to the Admin Guide for more information.

Would you like to initiate a new System Expansion Yy|Nn (default=N):

输入y回车：

> y

然后要求输入新增节点主机名，逗号分割：

Enter a comma separated list of new hosts you want
to add to your array.  Do not include interface hostnames.
**Enter a blank line to only add segments to existing hosts**[]:
> sdw3,sdw4

确认新增主机名网络联通，并且与原节点已建立互信。并确保已安装rsync

每个节点额外增加多少机器？这里输入的值是在原有节点部署segment数量的基础上增加多少segment。如果每台主机部署节点数与原主机相同，则直接输入0即可。

How many new primary segments per host do you want to add? (default=0):
> 0

Generating configuration file...

20211102:14:36:24:024562 gpexpand:mdw:mxadmin-[INFO]:-Generating input file...

Input configuration file was written to 'gpexpand_inputfile_20211102_143624'.

Please review the file and make sure that it is correct then re-run
with: gpexpand -i gpexpand_inputfile_20211102_143624

20211102:14:36:24:024562 gpexpand:mdw:mxadmin-[INFO]:-Exiting...

执行完毕会在当前目录生成gpexpand_inputfile_20211102_143624配置文件：

[mxadmin@mdw ~]$ cat gpexpand_inputfile_20211102_143624
sdw3|sdw3|7002|/home/mxadmin/gpdemo/datadirs/dbfast1/demoDataDir3|5|3|p
sdw3|sdw3|7003|/home/mxadmin/gpdemo/datadirs/dbfast2/demoDataDir4|6|4|p
sdw3|sdw3|7004|/home/mxadmin/gpdemo/datadirs/dbfast3/demoDataDir5|7|5|p
sdw4|sdw4|7002|/home/mxadmin/gpdemo/datadirs/dbfast1/demoDataDir6|8|6|p
sdw4|sdw4|7003|/home/mxadmin/gpdemo/datadirs/dbfast2/demoDataDir7|9|7|p
sdw4|sdw4|7004|/home/mxadmin/gpdemo/datadirs/dbfast3/demoDataDir8|10|8|p

配置文件中包含了新主机信息，以及新segment端口号、数据目录、dbid、角色等信息。这些信息与gp_segment_configuration表对应。该配置文件也可以按照该格式手动编写。

1.2 执行新增节点操作

准备好配置文件后，下面执行增加新节点操作：

[mxadmin@mdw ~]$ gpexpand -i gpexpand_inputfile_20211102_143624

......
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Unlocking catalog
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Unlocked catalog
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Creating expansion schema
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Populating gpexpand.status_detail with data from database postgres
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Populating gpexpand.status_detail with data from database template1
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Populating gpexpand.status_detail with data from database mxadmin
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-************************************************
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Initialization of the system expansion complete.
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-To begin table expansion onto the new segments
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-rerun gpexpand
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-************************************************
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Exiting...

执行成功后，查询gp_segment_configuration表，即可看到新节点已加入。

2. 数据重分布

在线扩容的第二步操作，是将表数据重分布到新节点上，因为新增加的segment节点上并没有数据。

MatrixDB支持的3种分布方式：随机、哈希、复制，都要按照规则同步到新segment节点上。其中，复制表依然要全量存储在新增加的segment节点上；哈希表则要按照规则做重分布；随机表因为数据分布无规律，不会做数据迁移，只是新插入的数据会按照概率落到新的segment节点上。

第一阶段执行完毕后，会在postgres数据库创建schema:gpexpand，包含了扩容状态信息：

postgres=# \d
                     List of relations
  Schema  |        Name        | Type  |  Owner  | Storage
----------+--------------------+-------+---------+---------
 gpexpand | expansion_progress | view  | mxadmin |
 gpexpand | status             | table | mxadmin | heap
 gpexpand | status_detail      | table | mxadmin | heap
(3 rows)

其中，status_detail表记录了所有需要做重分布的表及重分布状态：

postgres=# select * from gpexpand.status_detail ;
 dbname  |   fq_name   | table_oid | root_partition_name | rank | external_writable |   status    | expansion_started | expansion_finished | source_bytes
---------+-------------+-----------+---------------------+------+-------------------+-------------+-------------------+--------------------+--------------
 mxadmin | public.t2   |     16388 |                     |    2 | f                 | NOT STARTED |                   |                    |         8192
 mxadmin | public.t1   |     16385 |                     |    2 | f                 | NOT STARTED |                   |                    |        16384
 mxadmin | public.disk |     16391 | public.disk         |    2 | f                 | NOT STARTED |                   |                    |            0
(3 rows)

直接运行gpexpand即可执行表重分布：

[mxadmin@mdw ~]$ gpexpand
......
gpexpand:mdw:mxadmin-[INFO]:-Querying gpexpand schema for current expansion state
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Expanding mxadmin.public.t2
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Finished expanding mxadmin.public.t2
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Expanding mxadmin.public.t1
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Finished expanding mxadmin.public.t1
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Expanding mxadmin.public.disk
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Finished expanding mxadmin.public.disk
20211102:15:18:32:026291 gpexpand:mdw:mxadmin-[INFO]:-EXPANSION COMPLETED SUCCESSFULLY
20211102:15:18:32:026291 gpexpand:mdw:mxadmin-[INFO]:-Exiting...

表重分布完毕后再查询status_detail表，会看到状态已经变成了COMPLETED：

postgres=# select * from gpexpand.status_detail ;
 dbname  |   fq_name   | table_oid | root_partition_name | rank | external_writable |  status   |     expansion_started      |     expansion_finished     | source_bytes
---------+-------------+-----------+---------------------+------+-------------------+-----------+----------------------------+----------------------------+--------------
 mxadmin | public.t2   |     16388 |                     |    2 | f                 | COMPLETED | 2021-11-02 15:18:27.326247 | 2021-11-02 15:18:27.408379 |         8192
 mxadmin | public.t1   |     16385 |                     |    2 | f                 | COMPLETED | 2021-11-02 15:18:27.431481 | 2021-11-02 15:18:27.507591 |        16384
 mxadmin | public.disk |     16391 | public.disk         |    2 | f                 | COMPLETED | 2021-11-02 15:18:27.531727 | 2021-11-02 15:18:27.570559 |            0
(3 rows)

如果想增加重分布并发度，可以在执行gpexpand命令时加-B参数，默认值是16，上限是128，如：

[mxadmin@mdw ~]$ gpexpand -B 32

重分布完毕后，运行gpexpand -c来清理扩容过程的中间表：

[mxadmin@mdw ~]$ gpexpand -c
20211102:15:24:41:026524 gpexpand:mdw:mxadmin-[INFO]:-local Greenplum Version: 'postgres (MatrixDB) 5.0.0-enterprise~alpha (Greenplum Database) 7.0.0 build dev'
20211102:15:24:41:026524 gpexpand:mdw:mxadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 12 (MatrixDB 5.0.0-enterprise~alpha) (Greenplum Database 7.0.0 build dev) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit compiled on Oct 25 2021 15:24:16'
20211102:15:24:41:026524 gpexpand:mdw:mxadmin-[INFO]:-Querying gpexpand schema for current expansion state


Do you want to dump the gpexpand.status_detail table to file? Yy|Nn (default=Y):
> n
20211102:15:24:46:026524 gpexpand:mdw:mxadmin-[INFO]:-Removing gpexpand schema
20211102:15:24:46:026524 gpexpand:mdw:mxadmin-[INFO]:-Cleanup Finished.  exiting...

到此，扩容成功。

3. FAQ

扩容过程中新旧节点的catalog是一致的吗
- catalog始终是一致的，新节点上即使还没有数据，但是catalog也是存在的。
扩容过程中对于其他查询的影响
- 因为是在线扩容，期间不会中断正在运行的所有查询，只是在增加新节点的时候因为要锁catalog，所以在此期间不支持DDL，加完新节点立即释放catalog锁。
- 在数据重分布的时候因为加了最高级别的锁，会阻塞读操作。

gpexpand的详细使用方法请参考文档