本文档介绍 YMatrix 运维的常见问题。
问题描述
使用 YMatrix 忘记密码。
解决方案
=# alter user mxadmin with password 'you password';
/etc/matrixdb/auth.conf
。 # sudo cat /etc/matrixdb/auth.conf
ssh_exchange_identification: Connection closed by remote host
问题描述
在使用 gpssh
命令,或者使用 ssh 登录命令的时候会遇到如下报错:
ssh_exchange_identification: Connection closed by remote host
问题分析
ssh 连接数超过了限制。
解决方案
# vim /etc/ssh/sshd_config
# MaxStartup 10:30:60
参数配置含义:
MaxStartup
三元组形式 10:30:60
10:当连接数达到 10 时就开始拒绝连接,不过不是全部拒绝。
30:当连接数到达 10 时,之后的连接有 30 的概率被拒绝掉。
60:当连接数达到 60 时,之后的连接就全部拒绝了。
#
去掉,MaxStartup 30:30:60
。systemctl restart sshd
# or
service sshd restart
psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 192.168.247.129:6005 pid=448116)
日志
psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 10.162.115.5:6005 pid=448116)
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
VACUUM
问题分析
事务的 age 超过限制。
解决方案
$ psql -d testdb -c "vacuum freeze;"
could not read block 3684552 of temporary file
问题描述
在运行 SQL 语句的时候会遇到如下报错:
could not read block 3684552 of temporary file: read only 0 of 8192 bytes
问题分析
在 SQL 计算时如果服务器磁盘不足会产生磁盘相关问题。
在 SQL 计算时磁盘异常也会出现该问题。
解决方案
du -sh
journalctl
命令查找关于磁盘的错误信息。// 查看 journalctl 日志保存总大小
journalctl --disk-usage
// 查看 2023 年 3 月 30 日下午 20:20 之后的日志
journalctl --since "2023-03-30 20:20:00"
// 查看从 2023 年 3 月 1 日 00:00:00 到 2023 年 3 月 31 日 00:00:00 之间的所有日志
journalctl --since "2023-03-01 00:00:00" --until "2023-03-31 00:00:00"
// 可以使用 "yesterday"、"today"、"tomorrow"或者 "now" 获取日志的时间段
journalctl --since yesterday
// 获取 2023-03-25 00 之后的日志保存到本地磁盘上
journalctl --since "2023-03-25 00:00:00" > journalctl.log
// 在 journalctl.log 日志中查找关于磁盘错误的日志
cat journalctl.log|grep "I/O"
可以的。
修改 /etc/hosts
备份保存 /etc/hosts
文件。
cp /etc/hosts /tmp/hosts_bak
修改 /etc/hosts
文件中 YMatrix 的 Hostname。
修改 YMatrix 元数据信息
切换用户。
$ su - mxadmin
关闭 YMatrix。
$ mxstop -af
启动 YMatrix Master 节点。
$ mxstart -m
进入 YMatrix 命令行,运行命令。
PGOPTIONS='-c gp_session_role=utility' psql -U mxadmin postgres
set allow_system_table_mods='ture';
update gp_segment_configuration set hostname=<更改之后的主机名>,address=<更改之后的主机名> where address=<原来的主机名>;
关闭 YMatrix。
$ mxstop -mf
启动 YMatrix。
$ mxstart -m
修改 YMatrix 监控信息
部署 Grafana 监控需要进行如下操作:
psql -d matrixmgr
matrixmgr=# SELECT mxmgr_remove_all('local');
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_local();
修改 dashboard.json
中与 Hostname 相关的信息。
例如原本 Hostname 为 dw1,修改后为 sdw1,需要将 dashboard.json
中所有的 dw1 修改为 sdw1,并重新导入 Dashboard。
部署 Prometheus 监控进行如下操作
matrixmgr=# SELECT mxmgr_remove_exporter();
matrixmgr=# SELECT mxmgr_remove_gate_exporter();
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_exporter();
FATAL: database "postgres" does not exist
问题分析
主机异常,重启后集群启动失败,检查数据库日志,报错:
FATAL","3D000","database ""postgres"" does not exist","The database subdirectory ""base/13376"" is missing
检查对应的文件目录,发现 13376
目录丢失。
问题复现
[mxadmin@mdw3 base]$ mxstop -af
13376
目录删除或者改名 [mxadmin@mdw3 base]$ mv 13376 13376bak
[mxadmin@mdw3 ~]$ mxstart -a
2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Starting MatrixDB cluster with: [no-prompt:true].
2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Connecting to physical cluster's etcd cluster ...
2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Connected to physical cluster's etcd cluster: [http://192.168.8.28:4679].
2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting database cluster ID ...
2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collected database cluster ID: AuWFhsrjyywC4xfMahgyor
2023-05-24:23:26:50.271 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting cluster info ...
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master instance parameters
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Host = mdw3
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Port = 5432
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Directory = /mxdata_20230514185455/master/mxseg-1
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Segment instances that will be started
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Host Port DataDir Role
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3 6000 /mxdata_20230514185455/primary/mxseg0 PRIMARY
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3 6001 /mxdata_20230514185455/primary/mxseg1 PRIMARY
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:50.275 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:50.603 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [1/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:51.605 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:51.608 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:51.748 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [2/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:52.749 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:52.752 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:52.781 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [3/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:53.782 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:53.785 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:53.906 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [4/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:54.907 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:54.912 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:54.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [5/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:55.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:55.936 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:55.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [6/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:56.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:56.961 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:57.085 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [7/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:58.086 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:58.089 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:58.114 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [8/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:59.115 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:59.118 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:59.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:27:00.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:27:00.143 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:27:00.299 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:27:00.301 mxstart:mxadmin:mdw3:017121-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))'. Please confirm database cluster has been deployed.
Error: rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))
解决方案
postgres
库文件丢失。如果有此集群中 Standby,将 Standby 切换接管 Master,重建 Standby。注意!
建议配置完备的 Standby、Mirror 镜像机制,保证集群的安全可用。
rpc error: code = Unknown desc = primary not found in shard state
问题描述
执行 mxstart -a
时报错:
2023-05-31:16:03:25.325 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.326 mxstart:mxadmin:mdw:026098-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-31:16:03:26.335 mxstart:mxadmin:mdw:026098-[INFO]:-Send start to cluster service
2023-05-31:16:03:26.716 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.721 mxstart:mxadmin:mdw:026098-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state'. Please confirm database cluster has been deployed.
Error: rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state
问题分析
解决方案
mxpacklogs
工具收集日志发送给我们的售前/售后人员进行分析。报错信息
[mxadmin@mdw ~]$ mxaddmirrors setup --plan-file /tmp/mxaddmirrors.plan --mode cli
[20230606:14:37:04][INFO] start: create_user_and_directories on 3 hosts
[20230606:14:37:04][INFO] done
[20230606:14:37:05][INFO] start: initialize_deployer on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: reg_mirrors on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: init_mirrors on 3 hosts
[20230606:14:37:06][ERROR] init_mirrors: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off
[20230606:14:37:06][INFO] revert start: init_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: reg_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: initialize_deployer on hosts
[20230606:14:37:13][INFO] revert done
[20230606:14:37:13][INFO] revert start: create_user_and_directories on 3 hosts
[20230606:14:37:13][INFO] revert done
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off
解决方案
segment
节点上面找到 pg_hba.conf
文件,添加白名单权限for i in `find /mxdata_20230606142513/ -name pg_hba.conf` ; do echo "host all all 0.0.0.0/0 md5" >>$i; done
2. 重新加载配置文件
mxstop -u
ERROR: replication slot "internal_wal_replication_slot" already exists
报错信息
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not send replication command "CREATE_REPLICATION_SLOT "internal_wal_replication_slot" PHYSICAL RESERVE_WAL": ERROR: replication slot "internal_wal_replication_slot" already exists
解决方案
查询 Segment 节点链接信息
=# SELECT E'PGOPTIONS=\'-c gp_role\=utility\' psql -h ' ||hostname ||' -p ' ||port FROM gp_segment_configuration where content<>-1;
?column?
-----------------------------------------------------
PGOPTIONS='-c gp_role=utility' psql -h sdw3 -p 6000
PGOPTIONS='-c gp_role=utility' psql -h sdw2 -p 6000
PGOPTIONS='-c gp_role=utility' psql -h sdw1 -p 6000
(3 rows)
分别登录第一步查询输出的节点连接信息,查询 slot
信息
SELECT * FROM pg_replication_slots ;
slot
SELECT pg_drop_replication_slot('internal_wal_replication_slot');
ERROR: interconnect Error: Could not set up tcp listener socket
报错信息
**** con19047721,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
**** con19047731,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
**** con19047723,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
问题分析
使用 journalctl
发现 NetworkManager
一直在重启 dhclient
服务,在重启期间会导致网络中断现象。
NetworkManager[2285]: <info> [1691711564.7553] device (ens10f0): state change: config -> ip-config (rea
NetworkManager[2285]: <info> [1691711564.7562] dhcp4 (ens10f0): activation: beginning transaction (time
NetworkManager[2285]: <info> [1691711564.7599] dhcp4 (ens10f0): dhclient started with pid 88186
dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 6 (xid=0x34e8d695)
dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 12 (xid=0x34e8d695)
在 CentOS 7 中会存在 network
和 NetworkManager
两个与网络配置有关的服务,network
是通过静态网络接口配置,而 NetworkManager
是动态网络来实现网络的管理,然而这两个服务同时启动时,会产生冲突,NetworkManager
会在网络断开的时候会清理路由,导致网络断开。
在监控上查看找NetStat
面板,查看连接使用情况。
解决方案
NetworkManager
服务systemctl stop NetworkManager
systemctl disable NetworkManager
将目标端 IP 网段配置到源端 Master 机器上的 pg_hba
文件中即可。
disable-connector = false
,则会在源端数据库的 Segment 节点上为相应 Primary 实例生成随机端口;配置为 true
,系统会则根据源端集群初始化时配置的 Primary 实例端口进行分配。export MXSHIFT_WORKER_PORT_BASE = <端口号>
,同时需指定配置文件中的压缩方式 --compress-method
。开启压缩会使得再次执行 mxshift 后,执行日志能够显示端口范围。范围中的最大值会根据并发度以及 Segment 数量变化。index ron requires 19160 bvtes, maximum size is 8191
报错信息
java. lang.RuntimeException: org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 Internal Server Error:
[ERROR: index row requires 19160 bytes, maximum size is 8191 (seg35 10.17.8.132:6012 pid=15508) (SQLSTATE 54000)]
问题分析
插入的数据时超过了 btree
索引的行数限制。
postgres=# CREATE TABLE test ( x text ) DISTRIBUTED BY(x);
CREATE TABLE
Time: 26.768 ms
postgres=# INSERT INTO test(x) VALUES ( repeat('x', 900000) );
INSERT 0 1
Time: 16.707 ms
postgres=# CREATE INDEX test_x ON test USING btree(x);
ERROR: index row requires 10328 bytes, maximum size is 8191 (seg0 172.16.172.148:6000 pid=21674)
Time: 16.257 ms
postgres=# CREATE INDEX test_x ON test USING brin(x);
CREATE INDEX
Time: 19.866 ms
解决方案
减少字段的长度或者更换其他的索引。