YMatrix 运维常见问题

本文档介绍 YMatrix 运维的常见问题。


1 忘记密码


问题描述

使用 YMatrix 忘记密码。

解决方案

  1. 数据库密码忘记 给数据库用户重新设置一个新的密码。
    =# alter user mxadmin with password 'you password';
  2. 图形化界面的登录密码忘记 查看 /etc/matrixdb/auth.conf
    # sudo cat /etc/matrixdb/auth.conf


2 ssh_exchange_identification: Connection closed by remote host


问题描述 在使用 gpssh 命令,或者使用 ssh 登录命令的时候会遇到如下报错:

ssh_exchange_identification: Connection closed by remote host

问题分析

ssh 连接数超过了限制。

解决方案

  1. 打开目标服务器的 sshd 配置文件
    # vim /etc/ssh/sshd_config
  2. 找到 MaxStartup 参数配置

# MaxStartup 10:30:60

参数配置含义: MaxStartup 三元组形式 10:30:60 10:当连接数达到 10 时就开始拒绝连接,不过不是全部拒绝。 30:当连接数到达 10 时,之后的连接有 30 的概率被拒绝掉。 60:当连接数达到 60 时,之后的连接就全部拒绝了。

  1. 修改参数并将 # 去掉,MaxStartup 30:30:60
  2. 重启 sshd 服务
    systemctl restart sshd 
    # or 
    service sshd restart
  3. 使用 ssh 重新登录。


3 psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 192.168.247.129:6005 pid=448116)


日志

psql: WARNING:  database "testdb" must be vacuumed within 529926866 transactions  (seg35 10.162.115.5:6005 pid=448116)
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
VACUUM

问题分析

事务的 age 超过限制。

解决方案

$ psql -d testdb -c "vacuum freeze;"


4 could not read block 3684552 of temporary file


问题描述

在运行 SQL 语句的时候会遇到如下报错:

could not read block 3684552 of temporary file: read only 0 of 8192 bytes

问题分析

  1. 在 SQL 计算时如果服务器磁盘不足会产生磁盘相关问题。

  2. 在 SQL 计算时磁盘异常也会出现该问题。

解决方案

  1. 在 SQL 计算时用以下命令实时观察磁盘的使用情况。
    du -sh 
  2. 使用 journalctl 命令查找关于磁盘的错误信息。
    // 查看 journalctl 日志保存总大小
    journalctl --disk-usage
    // 查看 2023 年 3 月 30 日下午 20:20 之后的日志
    journalctl --since "2023-03-30 20:20:00"
    // 查看从 2023 年 3 月 1 日 00:00:00 到 2023 年 3 月 31 日 00:00:00 之间的所有日志
    journalctl --since "2023-03-01 00:00:00" --until "2023-03-31 00:00:00"
    // 可以使用 "yesterday"、"today"、"tomorrow"或者 "now" 获取日志的时间段
    journalctl --since yesterday
    // 获取 2023-03-25 00 之后的日志保存到本地磁盘上
    journalctl --since "2023-03-25 00:00:00"  > journalctl.log
    // 在 journalctl.log 日志中查找关于磁盘错误的日志
    cat journalctl.log|grep "I/O"


5 YMatrix 可以修改服务器主机名(Hostname)吗?


可以的。

  1. 修改 /etc/hosts
    备份保存 /etc/hosts 文件。

    cp /etc/hosts /tmp/hosts_bak

    修改 /etc/hosts 文件中 YMatrix 的 Hostname。

  2. 修改 YMatrix 元数据信息
    切换用户。

    $ su - mxadmin

    关闭 YMatrix。

    $ mxstop -af

    启动 YMatrix Master 节点。

    $ mxstart -m

    进入 YMatrix 命令行,运行命令。

    PGOPTIONS='-c gp_session_role=utility' psql -U mxadmin postgres
    set allow_system_table_mods='ture';
    update gp_segment_configuration set hostname=<更改之后的主机名>,address=<更改之后的主机名> where address=<原来的主机名>;

    关闭 YMatrix。

    $ mxstop -mf

    启动 YMatrix。

    $ mxstart -m
  3. 修改 YMatrix 监控信息
    部署 Grafana 监控需要进行如下操作:

    psql -d matrixmgr
    matrixmgr=# SELECT mxmgr_remove_all('local');
    matrixmgr=# truncate local.matrix_manager_config;
    matrixmgr=# SELECT mxmgr_init_local();

    修改 dashboard.json 中与 Hostname 相关的信息。
    例如原本 Hostname 为 dw1,修改后为 sdw1,需要将 dashboard.json 中所有的 dw1 修改为 sdw1,并重新导入 Dashboard。

部署 Prometheus 监控进行如下操作

matrixmgr=# SELECT mxmgr_remove_exporter();
matrixmgr=# SELECT mxmgr_remove_gate_exporter();
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_exporter();


6 数据库启动报错 FATAL: database "postgres" does not exist


问题分析

主机异常,重启后集群启动失败,检查数据库日志,报错:

FATAL","3D000","database ""postgres"" does not exist","The database subdirectory ""base/13376"" is missing

检查对应的文件目录,发现 13376 目录丢失。

问题复现

  1. 搭建一套集群
  2. 将集群关闭
    [mxadmin@mdw3 base]$ mxstop -af
  3. 13376 目录删除或者改名
    [mxadmin@mdw3 base]$ mv 13376 13376bak
  4. 启动集群
    [mxadmin@mdw3 ~]$ mxstart -a
    2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Starting MatrixDB cluster with: [no-prompt:true].
    2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Connecting to physical cluster's etcd cluster ...
    2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Connected to physical cluster's etcd cluster: [http://192.168.8.28:4679].
    2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting database cluster ID ...
    2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collected database cluster ID: AuWFhsrjyywC4xfMahgyor
    2023-05-24:23:26:50.271 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting cluster info ...
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master instance parameters
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Host         = mdw3
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Port         = 5432
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Directory    = /mxdata_20230514185455/master/mxseg-1
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Segment instances that will be started
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Host   Port    DataDir                                 Role      
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3   6000    /mxdata_20230514185455/primary/mxseg0   PRIMARY   
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3   6001    /mxdata_20230514185455/primary/mxseg1   PRIMARY   
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:50.275 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:50.603 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [1/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:51.605 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:51.608 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:51.748 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [2/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:52.749 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:52.752 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:52.781 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [3/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:53.782 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:53.785 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:53.906 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [4/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:54.907 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:54.912 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:54.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [5/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:55.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:55.936 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:55.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [6/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:56.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:56.961 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:57.085 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [7/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:58.086 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:58.089 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:58.114 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [8/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:59.115 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:59.118 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:59.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:27:00.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:27:00.143 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:27:00.299 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:27:00.301 mxstart:mxadmin:mdw3:017121-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))'. Please confirm database cluster has been deployed.
    Error: rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))
  5. 检查日志

解决方案

  1. 主机异常导致 Master 节点 postgres 库文件丢失。如果有此集群中 Standby,将 Standby 切换接管 Master,重建 Standby。
  2. 在没有配置 Standby 的情况下,如果只是 postgres 库文件丢失,在没有写入业务表的情况下,可以将其他节点的库文件拷贝到 Master 节点。

注意!
建议配置完备的 Standby、Mirror 镜像机制,保证集群的安全可用。


7 rpc error: code = Unknown desc = primary not found in shard state


问题描述

执行 mxstart -a 时报错:

2023-05-31:16:03:25.325 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.326 mxstart:mxadmin:mdw:026098-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-31:16:03:26.335 mxstart:mxadmin:mdw:026098-[INFO]:-Send start to cluster service
2023-05-31:16:03:26.716 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.721 mxstart:mxadmin:mdw:026098-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state'. Please confirm database cluster has been deployed.
Error: rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state

问题分析

  1. 集群发生了 double failure 导致不能启动。

解决方案

  1. 使用 mxpacklogs 工具收集日志发送给我们的售前/售后人员进行分析。


8 mxaddmirrors 执行部署添加 Mirror 报错


报错信息

[mxadmin@mdw ~]$ mxaddmirrors setup --plan-file /tmp/mxaddmirrors.plan --mode cli
[20230606:14:37:04][INFO] start: create_user_and_directories on 3 hosts
[20230606:14:37:04][INFO] done
[20230606:14:37:05][INFO] start: initialize_deployer on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: reg_mirrors on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: init_mirrors on 3 hosts
[20230606:14:37:06][ERROR] init_mirrors: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"

STDERR:
    pg_basebackup: error: could not connect to server: FATAL:  no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off

[20230606:14:37:06][INFO] revert start: init_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: reg_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: initialize_deployer on hosts
[20230606:14:37:13][INFO] revert done
[20230606:14:37:13][INFO] revert start: create_user_and_directories on 3 hosts
[20230606:14:37:13][INFO] revert done
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"

STDERR:
    pg_basebackup: error: could not connect to server: FATAL:  no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off

解决方案

  1. 在所有 segment 节点上面找到 pg_hba.conf 文件,添加白名单权限
    for i in `find /mxdata_20230606142513/ -name pg_hba.conf` ; do echo "host            all             all             0.0.0.0/0       md5" >>$i; done

2. 重新加载配置文件

mxstop -u 


9 ERROR: replication slot "internal_wal_replication_slot" already exists


报错信息

Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"

STDERR:
    pg_basebackup: error: could not send replication command "CREATE_REPLICATION_SLOT "internal_wal_replication_slot" PHYSICAL RESERVE_WAL": ERROR:  replication slot "internal_wal_replication_slot" already exists

解决方案

  1. 查询 Segment 节点链接信息

    =# SELECT E'PGOPTIONS=\'-c gp_role\=utility\' psql -h ' ||hostname ||' -p ' ||port  FROM gp_segment_configuration where content<>-1;
                       ?column?                       
    -----------------------------------------------------
    PGOPTIONS='-c gp_role=utility' psql -h sdw3 -p 6000
    PGOPTIONS='-c gp_role=utility' psql -h sdw2 -p 6000
    PGOPTIONS='-c gp_role=utility' psql -h sdw1 -p 6000
    (3 rows)
  2. 分别登录第一步查询输出的节点连接信息,查询 slot 信息

SELECT * FROM pg_replication_slots ;
  1. 分别登录第一步查询输出的节点连接信息,删除 slot
    SELECT pg_drop_replication_slot('internal_wal_replication_slot');


10 ERROR: interconnect Error: Could not set up tcp listener socket


报错信息

**** con19047721,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
**** con19047731,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,
**** con19047723,,seg14,,,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,,,0,,"ic_tcp.c",293,

问题分析

  1. 使用 journalctl 发现 NetworkManager 一直在重启 dhclient 服务,在重启期间会导致网络中断现象。

    NetworkManager[2285]: <info>  [1691711564.7553] device (ens10f0): state change: config -> ip-config (rea
    NetworkManager[2285]: <info>  [1691711564.7562] dhcp4 (ens10f0): activation: beginning transaction (time
    NetworkManager[2285]: <info>  [1691711564.7599] dhcp4 (ens10f0): dhclient started with pid 88186
    dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 6 (xid=0x34e8d695)
    dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 12 (xid=0x34e8d695)
  2. 在 CentOS 7 中会存在 networkNetworkManager 两个与网络配置有关的服务,network 是通过静态网络接口配置,而 NetworkManager 是动态网络来实现网络的管理,然而这两个服务同时启动时,会产生冲突,NetworkManager 会在网络断开的时候会清理路由,导致网络断开。

  3. 在监控上查看找NetStat 面板,查看连接使用情况。

解决方案

  1. 关掉 NetworkManager 服务
    systemctl stop NetworkManager  
    systemctl disable NetworkManager
  2. 查看其他的应用是否有连接没有关闭的情况。


11 使用 mxshift 工具进行迁移之前,如何增加白名单?


将目标端 IP 网段配置到源端 Master 机器上的 pg_hba 文件中即可。


12 如果我在内网环境使用 mxshift 工具,需要开通哪些端口?


  • 源端数据库:如果未配置 disable-connector = false,则会在源端数据库的 Segment 节点上为相应 Primary 实例生成随机端口;配置为 true,系统会则根据源端集群初始化时配置的 Primary 实例端口进行分配。
  • 目标端数据库:YMatrix 会在目标端数据库开启随机端口,如果需要手动指定,则需通过修改环境变量来实现: export MXSHIFT_WORKER_PORT_BASE = <端口号>,同时需指定配置文件中的压缩方式 --compress-method。开启压缩会使得再次执行 mxshift 后,执行日志能够显示端口范围。范围中的最大值会根据并发度以及 Segment 数量变化。


13 index ron requires 19160 bvtes, maximum size is 8191


报错信息

java. lang.RuntimeException: org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 Internal Server Error:
[ERROR: index row requires 19160 bytes, maximum size is 8191 (seg35 10.17.8.132:6012 pid=15508) (SQLSTATE 54000)]

问题分析

插入的数据时超过了 btree 索引的行数限制。

postgres=# CREATE TABLE test ( x text ) DISTRIBUTED BY(x);
CREATE TABLE
Time: 26.768 ms

postgres=# INSERT INTO test(x) VALUES ( repeat('x', 900000) );
INSERT 0 1
Time: 16.707 ms

postgres=# CREATE INDEX test_x ON test USING btree(x);
ERROR:  index row requires 10328 bytes, maximum size is 8191  (seg0 172.16.172.148:6000 pid=21674)
Time: 16.257 ms

postgres=# CREATE INDEX test_x ON test USING brin(x);
CREATE INDEX
Time: 19.866 ms

解决方案

减少字段的长度或者更换其他的索引。