YMatrix 运维常见问题
本文档介绍 YMatrix 运维的常见问题。
1 忘记密码
问题描述
使用 YMatrix 忘记密码。
解决方案
- 数据库密码忘记
给数据库用户重新设置一个新的密码。
=# alter user mxadmin with password 'you password';
- 图形化界面的登录密码忘记
查看
/etc/matrixdb/auth.conf
。# sudo cat /etc/matrixdb/auth.conf
2 ssh_exchange_identification: Connection closed by remote host
问题描述
在使用 gpssh
命令,或者使用 ssh 登录命令的时候会遇到如下报错:
ssh_exchange_identification: Connection closed by remote host
问题分析
ssh 连接数超过了限制。
解决方案
- 打开目标服务器的 sshd 配置文件
# vim /etc/ssh/sshd_config
- 找到 MaxStartup 参数配置
# MaxStartup 10:30:60
参数配置含义:
MaxStartup
三元组形式 10:30:60
10:当连接数达到 10 时就开始拒绝连接,不过不是全部拒绝。
30:当连接数到达 10 时,之后的连接有 30 的概率被拒绝掉。
60:当连接数达到 60 时,之后的连接就全部拒绝了。
- 修改参数并将
#
去掉,MaxStartup 30:30:60
。 - 重启 sshd 服务
systemctl restart sshd # or service sshd restart
- 使用 ssh 重新登录。
3 psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 192.168.247.129:6005 pid=448116)
日志
psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 10.162.115.5:6005 pid=448116)
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
VACUUM
问题分析
事务的 age 超过限制。
解决方案
$ psql -d testdb -c "vacuum freeze;"
4 could not read block 3684552 of temporary file
问题描述
在运行 SQL 语句的时候会遇到如下报错:
could not read block 3684552 of temporary file: read only 0 of 8192 bytes
问题分析
-
在 SQL 计算时如果服务器磁盘不足会产生磁盘相关问题。
-
在 SQL 计算时磁盘异常也会出现该问题。
解决方案
- 在 SQL 计算时用以下命令实时观察磁盘的使用情况。
du -sh
- 使用
journalctl
命令查找关于磁盘的错误信息。// 查看 journalctl 日志保存总大小 journalctl --disk-usage // 查看 2023 年 3 月 30 日下午 20:20 之后的日志 journalctl --since "2023-03-30 20:20:00" // 查看从 2023 年 3 月 1 日 00:00:00 到 2023 年 3 月 31 日 00:00:00 之间的所有日志 journalctl --since "2023-03-01 00:00:00" --until "2023-03-31 00:00:00" // 可以使用 "yesterday"、"today"、"tomorrow"或者 "now" 获取日志的时间段 journalctl --since yesterday // 获取 2023-03-25 00 之后的日志保存到本地磁盘上 journalctl --since "2023-03-25 00:00:00" > journalctl.log // 在 journalctl.log 日志中查找关于磁盘错误的日志 cat journalctl.log|grep "I/O"
5 YMatrix 可以修改服务器主机名(Hostname)吗?
可以的。
-
修改
/etc/hosts
备份保存/etc/hosts
文件。cp /etc/hosts /tmp/hosts_bak
修改
/etc/hosts
文件中 YMatrix 的 Hostname。 -
修改 YMatrix 元数据信息
切换用户。$ su - mxadmin
关闭 YMatrix。
$ mxstop -af
启动 YMatrix Master 节点。
$ mxstart -m
进入 YMatrix 命令行,运行命令。
PGOPTIONS='-c gp_session_role=utility' psql -U mxadmin postgres set allow_system_table_mods='ture'; update gp_segment_configuration set hostname=<更改之后的主机名>,address=<更改之后的主机名> where address=<原来的主机名>;
关闭 YMatrix。
$ mxstop -mf
启动 YMatrix。
$ mxstart -m
-
修改 YMatrix 监控信息
部署 Grafana 监控需要进行如下操作:psql -d matrixmgr matrixmgr=# SELECT mxmgr_remove_all('local'); matrixmgr=# truncate local.matrix_manager_config; matrixmgr=# SELECT mxmgr_init_local();
修改
dashboard.json
中与 Hostname 相关的信息。
例如原本 Hostname 为 dw1,修改后为 sdw1,需要将dashboard.json
中所有的 dw1 修改为 sdw1,并重新导入 Dashboard。
部署 Prometheus 监控进行如下操作
matrixmgr=# SELECT mxmgr_remove_exporter();
matrixmgr=# SELECT mxmgr_remove_gate_exporter();
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_exporter();
6 数据库启动报错 FATAL: database "postgres" does not exist
问题分析
主机异常,重启后集群启动失败,检查数据库日志,报错:
FATAL","3D000","database ""postgres"" does not exist","The database subdirectory ""base/13376"" is missing
检查对应的文件目录,发现 13376
目录丢失。
问题复现
- 搭建一套集群
- 将集群关闭
[mxadmin@mdw3 base]$ mxstop -af
- 将
13376
目录删除或者改名[mxadmin@mdw3 base]$ mv 13376 13376bak
- 启动集群
[mxadmin@mdw3 ~]$ mxstart -a 2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Starting MatrixDB cluster with: [no-prompt:true]. 2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Connecting to physical cluster's etcd cluster ... 2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Connected to physical cluster's etcd cluster: [http://192.168.8.28:4679]. 2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting database cluster ID ... 2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collected database cluster ID: AuWFhsrjyywC4xfMahgyor 2023-05-24:23:26:50.271 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting cluster info ... 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:---------------------------- 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master instance parameters 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:---------------------------- 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Host = mdw3 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Port = 5432 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Directory = /mxdata_20230514185455/master/mxseg-1 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:---------------------------------------- 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Segment instances that will be started 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:---------------------------------------- 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Host Port DataDir Role 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3 6000 /mxdata_20230514185455/primary/mxseg0 PRIMARY 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3 6001 /mxdata_20230514185455/primary/mxseg1 PRIMARY 2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:50.275 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:50.603 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [1/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:51.605 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:51.608 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:51.748 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [2/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:52.749 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:52.752 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:52.781 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [3/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:53.782 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:53.785 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:53.906 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [4/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:54.907 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:54.912 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:54.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [5/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:55.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:55.936 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:55.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [6/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:56.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:56.961 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:57.085 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [7/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:58.086 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:58.089 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:58.114 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [8/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:26:59.115 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:26:59.118 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:26:59.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:27:00.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ... 2023-05-24:23:27:00.143 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service 2023-05-24:23:27:00.299 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)). 2023-05-24:23:27:00.301 mxstart:mxadmin:mdw3:017121-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))'. Please confirm database cluster has been deployed. Error: rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))
- 检查日志
解决方案
- 主机异常导致 Master 节点
postgres
库文件丢失。如果有此集群中 Standby,将 Standby 切换接管 Master,重建 Standby。 - 在没有配置 Standby 的情况下,如果只是 postgres 库文件丢失,在没有写入业务表的情况下,可以将其他节点的库文件拷贝到 Master 节点。
注意!
建议配置完备的 Standby、Mirror 镜像机制,保证集群的安全可用。
7 rpc error: code = Unknown desc = primary not found in shard state
问题描述
执行 mxstart -a
时报错:
2023-05-31:16:03:25.325 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.326 mxstart:mxadmin:mdw:026098-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-31:16:03:26.335 mxstart:mxadmin:mdw:026098-[INFO]:-Send start to cluster service
2023-05-31:16:03:26.716 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.721 mxstart:mxadmin:mdw:026098-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state'. Please confirm database cluster has been deployed.
Error: rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state
问题分析
- 集群发生了 double failure 导致不能启动。
解决方案
- 使用
mxpacklogs
工具收集日志发送给我们的售前/售后人员进行分析。
8 mxaddmirrors 执行部署添加 Mirror 报错
报错信息
[mxadmin@mdw ~]$ mxaddmirrors setup --plan-file /tmp/mxaddmirrors.plan --mode cli
[20230606:14:37:04][INFO] start: create_user_and_directories on 3 hosts
[20230606:14:37:04][INFO] done
[20230606:14:37:05][INFO] start: initialize_deployer on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: reg_mirrors on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: init_mirrors on 3 hosts
[20230606:14:37:06][ERROR] init_mirrors: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off
[20230606:14:37:06][INFO] revert start: init_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: reg_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: initialize_deployer on hosts
[20230606:14:37:13][INFO] revert done
[20230606:14:37:13][INFO] revert start: create_user_and_directories on 3 hosts
[20230606:14:37:13][INFO] revert done
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off
解决方案
- 在所有
segment
节点上面找到pg_hba.conf
文件,添加白名单权限for i in `find /mxdata_20230606142513/ -name pg_hba.conf` ; do echo "host all all 0.0.0.0/0 md5" >>$i; done
2. 重新加载配置文件
mxstop -u
9 ERROR: replication slot "internal_wal_replication_slot" already exists
报错信息
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not send replication command "CREATE_REPLICATION_SLOT "internal_wal_replication_slot" PHYSICAL RESERVE_WAL": ERROR: replication slot "internal_wal_replication_slot" already exists
解决方案
-
查询 Segment 节点链接信息
=# SELECT E'PGOPTIONS=\'-c gp_role\=utility\' psql -h ' ||hostname ||' -p ' ||port FROM gp_segment_configuration where content<>-1; ?column? ----------------------------------------------------- PGOPTIONS='-c gp_role=utility' psql -h sdw3 -p 6000 PGOPTIONS='-c gp_role=utility' psql -h sdw2 -p 6000 PGOPTIONS='-c gp_role=utility' psql -h sdw1 -p 6000 (3 rows)
-
分别登录第一步查询输出的节点连接信息,查询
slot
信息
SELECT * FROM pg_replication_slots ;
- 分别登录第一步查询输出的节点连接信息,删除
slot
SELECT pg_drop_replication_slot('internal_wal_replication_slot');
10 ERROR: interconnect Error: Could not set up tcp listener socket
报错信息
**** con19047721,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
**** con19047731,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
**** con19047723,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
问题分析
-
使用
journalctl
发现NetworkManager
一直在重启dhclient
服务,在重启期间会导致网络中断现象。NetworkManager[2285]: <info> [1691711564.7553] device (ens10f0): state change: config -> ip-config (rea NetworkManager[2285]: <info> [1691711564.7562] dhcp4 (ens10f0): activation: beginning transaction (time NetworkManager[2285]: <info> [1691711564.7599] dhcp4 (ens10f0): dhclient started with pid 88186 dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 6 (xid=0x34e8d695) dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 12 (xid=0x34e8d695)
-
在 CentOS 7 中会存在
network
和NetworkManager
两个与网络配置有关的服务,network
是通过静态网络接口配置,而NetworkManager
是动态网络来实现网络的管理,然而这两个服务同时启动时,会产生冲突,NetworkManager
会在网络断开的时候会清理路由,导致网络断开。 -
在监控上查看找
NetStat
面板,查看连接使用情况。
解决方案
- 关掉
NetworkManager
服务systemctl stop NetworkManager systemctl disable NetworkManager
- 查看其他的应用是否有连接没有关闭的情况。
11 使用 mxshift 工具进行迁移之前,如何增加白名单?
将目标端 IP 网段配置到源端 Master 机器上的 pg_hba
文件中即可。
12 如果我在内网环境使用 mxshift 工具,需要开通哪些端口?
- 源端数据库:如果未配置
disable-connector = false
,则会在源端数据库的 Segment 节点上为相应 Primary 实例生成随机端口;配置为true
,系统会则根据源端集群初始化时配置的 Primary 实例端口进行分配。 - 目标端数据库:YMatrix 会在目标端数据库开启随机端口,如果需要手动指定,则需通过修改环境变量来实现:
export MXSHIFT_WORKER_PORT_BASE = <端口号>
,同时需指定配置文件中的压缩方式--compress-method
。开启压缩会使得再次执行 mxshift 后,执行日志能够显示端口范围。范围中的最大值会根据并发度以及 Segment 数量变化。
13 index ron requires 19160 bvtes, maximum size is 8191
报错信息
java. lang.RuntimeException: org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 Internal Server Error:
[ERROR: index row requires 19160 bytes, maximum size is 8191 (seg35 10.17.8.132:6012 pid=15508) (SQLSTATE 54000)]
问题分析
插入的数据时超过了 btree
索引的行数限制。
postgres=# CREATE TABLE test ( x text ) DISTRIBUTED BY(x);
CREATE TABLE
Time: 26.768 ms
postgres=# INSERT INTO test(x) VALUES ( repeat('x', 900000) );
INSERT 0 1
Time: 16.707 ms
postgres=# CREATE INDEX test_x ON test USING btree(x);
ERROR: index row requires 10328 bytes, maximum size is 8191 (seg0 172.16.172.148:6000 pid=21674)
Time: 16.257 ms
postgres=# CREATE INDEX test_x ON test USING brin(x);
CREATE INDEX
Time: 19.866 ms
解决方案
减少字段的长度或者更换其他的索引。