MatrixGate 常见问题

本文档介绍 MatrixGate 使用中的常见问题。

1 数据加载性能低


做数据加载时,发现磁盘 IO 利用率高,但是加载速度很慢。

问题分析

使用 gpcheckperf 看磁盘性能,网络性能,发现磁盘性能仅 80MB/s。

解决方案

加载多块磁盘提升 I/O 性能,将 WAL 和 Data 数据盘分开, 尽可能提升 I/O 性能。

2 failed to acquire resources on on or more segments ,fatal out of memory


mxgate 同时开启加载 30 张表时,出现 OOM 异常。

问题分析

yMatrix 是多进程模式,高并发请求过来,连接数过多,无法分配内存给相应的请求,从而报错。

解决方案

调整 /etc/sysctl.conf 中的内核参数 vm.overcommit_memory = 2
调整 mxgate prepared 参数,从 prepared=10 改为 prepared = 5

3 mxgate发生锁等待卡住

问题分析

查看 mxgate 进程状态:

  1. mxgate status 查看 mxgate 的进程状态
  2. 查看是否有数据正常入库
  3. 也可用通过 ps -ef|grep mxgate 查看 mxgate 的进程状态

查看锁等待信息: 关注 granted 列值为 false, 表示尚未获得锁的进程,可以使用如下SQL查看:

create view v_locks_monitor as   
with t_wait as    
(    
  select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted,   
  a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath,    
  b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name   
    from pg_locks a,pg_stat_activity b where a.pid=b.pid and not a.granted   
),   
t_run as   
(   
  select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted,   
  a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath,   
  b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name   
    from pg_locks a,pg_stat_activity b where a.pid=b.pid and a.granted   
),   
t_overlap as   
(   
  select r.* from t_wait w join t_run r on   
  (   
    r.locktype is not distinct from w.locktype and   
    r.database is not distinct from w.database and   
    r.relation is not distinct from w.relation and   
    r.page is not distinct from w.page and   
    r.tuple is not distinct from w.tuple and   
    r.virtualxid is not distinct from w.virtualxid and   
    r.transactionid is not distinct from w.transactionid and   
    r.classid is not distinct from w.classid and   
    r.objid is not distinct from w.objid and   
    r.objsubid is not distinct from w.objsubid and   
    r.pid <> w.pid   
  )    
),    
t_unionall as    
(    
  select r.* from t_overlap r    
  union all    
  select w.* from t_wait w    
)    
select locktype,datname,relation::regclass,page,tuple,virtualxid,transactionid::text,classid::regclass,objid,objsubid,   
string_agg(   
'Pid: '||case when pid is null then 'NULL' else pid::text end||chr(10)||   
'Lock_Granted: '||case when granted is null then 'NULL' else granted::text end||' , Mode: '||case when mode is null then 'NULL' else mode::text end||' , FastPath: '||case when fastpath is null then 'NULL' else fastpath::text end||' , VirtualTransaction: '||case when virtualtransaction is null then 'NULL' else virtualtransaction::text end||' , Session_State: '||case when state is null then 'NULL' else state::text end||chr(10)||   
'Username: '||case when usename is null then 'NULL' else usename::text end||' , Database: '||case when datname is null then 'NULL' else datname::text end||' , Client_Addr: '||case when client_addr is null then 'NULL' else client_addr::text end||' , Client_Port: '||case when client_port is null then 'NULL' else client_port::text end||' , Application_Name: '||case when application_name is null then 'NULL' else application_name::text end||chr(10)||    
'Xact_Start: '||case when xact_start is null then 'NULL' else xact_start::text end||' , Query_Start: '||case when query_start is null then 'NULL' else query_start::text end||' , Xact_Elapse: '||case when (now()-xact_start) is null then 'NULL' else (now()-xact_start)::text end||' , Query_Elapse: '||case when (now()-query_start) is null then 'NULL' else (now()-query_start)::text end||chr(10)||    
'SQL (Current SQL in Transaction): '||chr(10)||  
case when query is null then 'NULL' else query::text end,    
chr(10)||'--------'||chr(10)    
order by    
  (  case mode    
    when 'INVALID' then 0   
    when 'AccessShareLock' then 1   
    when 'RowShareLock' then 2   
    when 'RowExclusiveLock' then 3   
    when 'ShareUpdateExclusiveLock' then 4   
    when 'ShareLock' then 5   
    when 'ShareRowExclusiveLock' then 6   
    when 'ExclusiveLock' then 7   
    when 'AccessExclusiveLock' then 8   
    else 0   
  end  ) desc,   
  (case when granted then 0 else 1 end)  
) as lock_conflict  
from t_unionall   
group by   
locktype,datname,relation,page,tuple,virtualxid,transactionid::text,classid,objid,objsubid ;

查看特定表名锁信息, 如果有其他处于 idle in transaction 的进程,则找到 mppsessionid:

select * from pg_locks where relation='t1_p'::regclass;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid   |       mode       | g
ranted | fastpath | mppsessionid | mppiswriter | gp_segment_id 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+--------+------------------+--
-------+----------+--------------+-------------+---------------
 relation |   693089 |  2693588 |      |       |            |               |         |       |          | 23/40692           | 100289 | RowExclusiveLock | t
       | t        |        37099 | t           |            -1
 relation |   693089 |  2693588 |      |       |            |               |         |       |          | 33/198609          | 100676 | AccessShareLock  | t
       | t        |        37099 | t           |             4
 relation |   693089 |  2693588 |      |       |            |               |         |       |          | 33/198609          | 100676 | RowExclusiveLock | t
       | t        |        37099 | t           |             4
(3 rows)

查看事务是否处于 waiting 状态:

[mxadmin@sdw21 gpconfigs]$ gpssh -f ./seg_host 
=> ps -ef|grep 37099
[sdw18] mxadmin  213588 152755  0 21:19 ?        00:00:00 postgres:  8000, mxadmin test 192.168.100.32(31024) con37099 seg8 idle
[sdw18] mxadmin  213589 152762  0 21:19 ?        00:00:00 postgres:  8001, mxadmin test 192.168.100.32(20614) con37099 seg9 idle
[sdw18] mxadmin  213590 152766  0 21:19 ?        00:00:00 postgres:  8002, mxadmin test 192.168.100.32(22542) con37099 seg10 idle
[sdw18] mxadmin  213591 152778  0 21:19 ?        00:00:00 postgres:  8003, mxadmin test 192.168.100.32(45560) con37099 seg11 idle
[sdw18] mxadmin  213592 152782  0 21:19 ?        00:00:00 postgres:  8004, mxadmin test 192.168.100.32(27438) con37099 seg12 idle
[sdw18] mxadmin  213593 152788  0 21:19 ?        00:00:00 postgres:  8005, mxadmin test 192.168.100.32(47258) con37099 seg13 idle
[sdw18] mxadmin  213594 152791  0 21:19 ?        00:00:00 postgres:  8006, mxadmin test 192.168.100.32(62898) con37099 seg14 idle
[sdw18] mxadmin  213966 212889  0 21:20 pts/2    00:00:00 grep --color=auto 37099
[sdw19] mxadmin  236149 203253  0 21:19 ?        00:00:00 postgres:  8000, mxadmin test 192.168.100.32(27718) con37099 seg16 idle
[sdw19] mxadmin  236150 203248  0 21:19 ?        00:00:00 postgres:  9007, mxadmin test 192.168.100.32(37320) con37099 seg15 idle
[sdw19] mxadmin  236151 203267  0 21:19 ?        00:00:00 postgres:  8003, mxadmin test 192.168.100.32(64854) con37099 seg19 idle
[sdw19] mxadmin  236152 203263  0 21:19 ?        00:00:00 postgres:  8002, mxadmin test 192.168.100.32(65230) con37099 seg18 idle
[sdw19] mxadmin  236153 203257  0 21:19 ?        00:00:00 postgres:  8001, mxadmin test 192.168.100.32(40968) con37099 seg17 idle
[sdw19] mxadmin  236154 203269  0 21:19 ?        00:00:00 postgres:  8004, mxadmin test 192.168.100.32(54360) con37099 seg20 idle
[sdw19] mxadmin  236155 203278  0 21:19 ?        00:00:00 postgres:  8005, mxadmin test 192.168.100.32(41424) con37099 seg21 idle
[sdw19] mxadmin  236156 203279  0 21:19 ?        00:00:00 postgres:  8006, mxadmin test 192.168.100.32(41532) con37099 seg22 idle
[sdw19] mxadmin  236157 203280  0 21:19 ?        00:00:00 postgres:  8007, mxadmin test 192.168.100.32(23184) con37099 seg23 idle
[sdw19] mxadmin  236295 235047  0 21:20 pts/1    00:00:00 grep --color=auto 37099
[sdw20] mxadmin  188516  35758  0 21:19 ?        00:00:00 postgres:  8000, mxadmin test 192.168.100.32(47346) con37099 seg24 idle
[sdw20] mxadmin  188517  35773  0 21:19 ?        00:00:00 postgres:  8003, mxadmin test 192.168.100.32(59058) con37099 seg27 idle
[sdw20] mxadmin  188518  35779  0 21:19 ?        00:00:00 postgres:  8004, mxadmin test 192.168.100.32(58736) con37099 seg28 idle
[sdw20] mxadmin  188519  35767  0 21:19 ?        00:00:00 postgres:  8002, mxadmin test 192.168.100.32(11736) con37099 seg26 idle
[sdw20] mxadmin  188520  35783  0 21:19 ?        00:00:00 postgres:  8005, mxadmin test 192.168.100.32(45638) con37099 seg29 idle
[sdw20] mxadmin  188521  35760  0 21:19 ?        00:00:00 postgres:  8001, mxadmin test 192.168.100.32(52888) con37099 seg25 idle
[sdw20] mxadmin  188522  35784  0 21:19 ?        00:00:00 postgres:  8006, mxadmin test 192.168.100.32(42540) con37099 seg30 idle
[sdw20] mxadmin  188523  35785  0 21:19 ?        00:00:00 postgres:  8007, mxadmin test 192.168.100.32(41324) con37099 seg31 idle
[sdw20] mxadmin  188650 187294  0 21:20 pts/1    00:00:00 grep --color=auto 37099

解决方案

  1. 如果确实发现处于 waiting 的事务,或者事务一直完不成,尽管拿到的是行级锁,也会影响自动创建子分区
  2. 则找到具体的 pid ,然后通过 select pg_terminate_backend(pid); 杀掉
  3. plpython udf 不接受 kill 信号,杀不掉
  4. 建议采用 kill -3 杀掉,最好不要用 kill -9

4 mxgate排除某一列


命令行使用时,通过参数 --exclude-columns 排除指定的列即可, 如是以后台运行模式,则在配置文件修改参数 exclude-columns=["列名"], 重启生效即可。

示例

以命令的方式进行加载 csv 文件(有表头),采用 tail -n +2 的方式过去第一行。

$ tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database postgres --db-master-host mdw --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256  --delimiter ','  --exclude-columns trip_duration 

5 mxgate 流式加载增加列不停服的办法


准备测试数据

-- 建表
create table t1(id int,a int);

-- 生成测试数据
insert into t1 select i,i from generate_series(1,10) as i;

-- 导出到文件
copy t1 to '/home/mxadmin/data/t1.csv' with delimiter '|';

-- 清空
truncate table t1;

生成配置文件

$ mxgate config --db-database postgres \
            --db-master-host localhost \
            --db-master-port 6432 \
            --db-user mxadmin \
            --target public.t1 \
            --time-format raw \
            --delimiter '|' \
            > mxgate.conf

启动 mxgate

$ mxgate start --config mxgate.conf 
******************************************************
 __  __       _        _       ____       _       
|  \/  | __ _| |_ _ __(_)_  __/ ___| __ _| |_ ___ 
| |\/| |/ _` | __| '__| \ \/ / |  _ / _` | __/ _ \
| |  | | (_| | |_| |  | |>  <| |_| | (_| | ||  __/
|_|  |_|\__,_|\__|_|  |_/_/\_\\____|\__,_|\__\___|
  Version: v4.6.5+Dev (git: master 4f01c039)
  Your Copy is Licensed to: http://www.ymatrix.cn
******************************************************
Launching MatrixGate daemon...
MatrixGate daemon started successfully

加载数据

curl http://localhost:8086/ -X POST -H 'Content-Type: text/plain' --data-binary "@t1.csv"

查看数据

postgres=# select * from t1;
 id  | a 
-----+-----
  1 |  1
  2 |  2
  3 |  3
  4 |  4
  5 |  5

暂停 mxgate

指定进程 pid 暂停

$ mxgate pause -X -S -p 133478 

指定 job 名称 schema.table 停

$ mxgate pause -X -S --job public.t1

pause 默认是异步的,这时还有行排他锁,如果要做 ddl 添加字段,会等待行排他锁结束, 加上 -S,就代表是暂停 mxgate 进程是同步执行的。

输出如下:

******************************************************
 __  __       _        _       ____       _       
|  \/  | __ _| |_ _ __(_)_  __/ ___| __ _| |_ ___ 
| |\/| |/ _` | __| '__| \ \/ / |  _ / _` | __/ _ \
| |  | | (_| | |_| |  | |>  <| |_| | (_| | ||  __/
|_|  |_|\__,_|\__|_|  |_/_/\_\\____|\__,_|\__\___|
  Version: v4.6.5+Dev (git: master 4f01c039)
  Your Copy is Licensed to: http://www.ymatrix.cn
******************************************************
begin to pause all jobs, please wait...
public.t1 paused

暂停 mxgate 任务时,会将 web 外部表的定义删除。

增加一个字段

alter table t1 add column b int;

重新启动指定的任务

$ mxgate resume -R --job public.t1

恢复 mxgate 进程时,会再次创建外部表。

加载数据

$ curl http://localhost:8086/ -X POST -H 'Content-Type: text/plain' --data-binary "@t1.csv"

数据加载正常

6 mxgate socket 文件缺失?


正常情况下 socket 文件如下:

$ cat /tmp/.s.MXGATED.4329.17056.lock 
## This mxgate process is part of MatrixDB internal tools
90969
/var/log/matrixdb/matrixgate.2022-09-15_093212-90969.log
0
4329.17056
17056
/tmp/mxgate.conf

经过排查,是 /tmp/.s.MXGATED.8086.0.lock 文件如果长时间没有被使用,文件可能会被系统删除。

解决方案

$ ps -ef |grep mxgate
mxadmin   59272 277634  0 11:57 pts/1    00:00:00 grep --color=auto mxgate
mxadmin   68731      1 40 Jan27 ?        7-11:13:10 /usr/local/matrixdb-4.3.11~rc6.enterprise/bin/mxgated daemon --config mxgate_chj_iot.conf
$ kill 68731

$ touch /tmp/.s.MXGATED.8086.0.lock
$ cat /tmp/.s.MXGATED.8086.0.lock  # 内容不重要,关键是文件要有6行
336941

$ wc -l /tmp/.s.MXGATED.8086.0.lock 
6 /tmp/.s.MXGATED.8086.0.lock

$ chmod 644 .s.MXGATED.8086.0.lock

$ mxgate status

******************************************************
 __  __       _        _       ____       _       
|  \/  | __ _| |_ _ __(_)_  __/ ___| __ _| |_ ___ 
| |\/| |/ _` | __| '__| \ \/ / |  _ / _` | __/ _ \
| |  | | (_| | |_| |  | |>  <| |_| | (_| | ||  __/
|_|  |_|\__,_|\__|_|  |_/_/\_\\____|\__,_|\__\___|
  Version: v4.3.13 (git: HEAD cef61ced)
  Your Copy is Licensed to: yMatrix.cn; 2022-04-13; any
******************************************************
PID          336941 alive
Launched At  2022-01-28 21:08:16
Up For       18 days 0 hours 39 minutes 15 seconds
Binary       /usr/local/matrixdb-4.3.11~rc6.enterprise/bin/mxgated
Log          
Config    

7 data format error: invalid input syntax for type integer: "1970-01-01 00:00:01"


问题分析

mxgate 设计之初主要是为了快速加载时序类数据,默认第一列为时间类型。

解决方案

mxgate 中的参数 time-format 默认为 unix-second,将其改为 raw 即可。

8 mxgate 入库时间字段默认值相差8小时


问题分析

mxgate 在参数 time-format 默认为 unix-second,会导致入数相差8小时。

解决方案

mxgate 中的参数 time-format 默认为 unix-second,将其改为 raw 即可。

9 Aborting operation regardless of REJECT LIMIT value, last error was: invalid input syntax for type timestamp: ""22:31:01"", column batch_num


问题分析

mxgate 数据加载,源端和目标端数据类型不一致报错。

表结构

create table s1.test1
 id                   character varying(36),
 created_date_time    timestamp(6) without time zone , 
 modified_date_time   timestamp(6) without time zone ,
 substep              character varying(36)        ,
 substep_id           character varying(6)         ,
 txdate               date                         ,
 batch_num            time(0) without time zone    
 )
distributed by (id)

执行命令

$ mxgate  --source transfer --src-host 172.26.14.15 --src-port 5432 --src-db wdp --src-user user1 --src-password password1 --src-schema s1 --src-table test1 --compress "gzip" --port-base 9000 --local-ip 172.26.14.17 --db-master-host 172.26.14.17 --db-database wdp --target s1.test1 --format csv --time-format raw --use-auto-increment=false

后台 Segment 报错日志

"Aborting operation regardless of REJECT LIMIT value, last error was: invalid input syntax for type timestamp: ""22:31:01"", column batch_num",

解决方案

将目标端的 batch_num 修改为 time(0) without time zone 类型,两边的数据类型需要保持一致。

10 ERROR c.s.o.matrixdb.MatrixdbSource 77 -MatrixdbSource sendingPostRequest errorCode: 400, errorMsg: target table name is empty


问题分析

在通过 HTTP 方式写入 mxgate 时提示表为空

解决方案

1、检查 HTTP 写入时的表填写是否正确

2、查看 mxgate.conf 配置文件中表名与数据库中的表是否正确

11 responseCode:500 errorMsg: ERROR: invalid byte sequence for encoding "UTF8": 0x81


问题分析

在通过HTTP方式写入mxgate时提示编码错误

解决方案

请检查源数据中是否有ASCII符号