每 compress_threshold (默认为 1200) 行的数据,我们称为一个 Range;一个 Range 内的某一列数据 (包含 compress_threshold 行),称之为 stripe;这一列数据如果特别大,那么 stripe 会切成若干个 1MB 的 chunk,读的时候也不会一下读整个 compress_threshold 的数据出来。
compress_threshold 对于压缩率也有一定的影响:
同一批次压缩的数据多了,可能压缩效果更好,原理上重复的数据被压的更多了,要不然在两个批次里面就没压到一起。
| id | compress_thresthold=1200 | 3600 | 10000 | 50000 |
|---|---|---|---|---|
| 1 | 8.8032 | 7.7846 | 6.6471 | 6.6824 |
| 2 | 0.981 | 0.997 | 2.2284 | 2.2322 |
| 3 | 3.3512 | 3.3143 | 3.3184 | 3.3177 |
| 4 | 7.7059 | 7.7326 | 6.6034 | 5.5994 |
| 5 | 5.5732 | 5.5042 | 5.5939 | 6.633 |
| 6 | 0.478 | 0.38 | 0.545 | 0.545 |
| 7 | 1.1908 | 1.1917 | 2.2172 | 2.2222 |
| 8 | 3.3602 | 3.3129 | 3.3018 | 3.3227 |
| 9 | 8.8429 | 7.74 | 7.7751 | 7.7795 |
| 10 | 3.3835 | 4.4239 | 5.5496 | 5.5499 |
| 11 | 0.845 | 0.743 | 2.2307 | 2.2299 |
| 12 | 1.1998 | 1.1887 | 2.2164 | 2.2138 |
| 13 | 6.6514 | 6.6351 | 6.6784 | 6.6424 |
| 14 | 0.837 | 0.776 | 1.1364 | 1.1275 |
| 15 | 1.1436 | 1.1359 | 1.183 | 1.1783 |
| 16 | 1.1768 | 1.1787 | 2.2673 | 2.2757 |
| 17 | 9.9419 | 8.8705 | 8.8922 | 8.8771 |
| 18 | 11.11337 | 11.11522 | 10.10384 | 10.10647 |
| 19 | 4.4792 | 3.3942 | 4.4918 | 4.467 |
| 20 | 2.2668 | 2.2389 | 3.3475 | 3.3346 |
| 21 | 11.11391 | 10.10633 | 11.1119 | 10.10742 |
| 22 | 3.3009 | 2.2864 | 3.372 | 3.3377 |
| sum | 97.73958 | 92.05015 | 100.81134 | 99.78189 |
| 测试场景 | compress_threshold | 1200 | 3600 | 10000 | 50000 |
|---|---|---|---|---|---|
| 写入性能 | 分区表 (条/s) | 852,334 | 919,031 | 1,055,371 | 1,057,424 |
| 写入性能 | 非分区表 (条/s) | 991,463 | 1,033,751 | 1,054,292 | 1,076,714 |
| 查询性能 | time_bucket=1h (1天范围) | 357ms | 342ms | 333ms | 324ms |
| 查询性能 | time_bucket=1d (1月范围) | 7,384ms | 6,312ms | 6,008ms | 6,025ms |
| 查询性能 | time_bucket=30d (1年范围) | 94,278ms | 75,101ms | 68,184ms | 69,006ms |
| 查询性能 | 点查场景 | 19.416 ms | 20.043 ms | 23.692 ms | 37.844 ms |
索引压缩架构参照 mars3btree 章节
CREATE INDEX idx_name ON table_name
USING mars3btree (column_list)
WITH (
compresstype = 'lz4', -- 压缩算法
compresslevel = 1, -- 压缩级别
compressctid = true, -- 是否压缩CTID列
encodechain = '', -- 编码链
minmax = true -- 启用min/max优化
);
compresstype (压缩算法):支持 lz4、zstd 和 mxcustom。默认为 lz4,lz4:压缩/解压速度快,压缩率中等;zstd:压缩率高,速度稍慢;mxcustom:需配合 encodechain 使用compresslevel(压缩级别):支持 1 ~ 9,默认 1。对于查询密集型建议 1 ~ 3 (优先解压速度),存储敏感型建议 6 ~ 9 (优先压缩率)compressctid (CTID 列是否压缩):默认为 true,建议保持为 true,CTID 列压缩率通常很高在某客户场景中,使用索引压缩后:
前面提到,压缩 (不管是 zstd / lz4 还是 RLE / dict / bitpack 这些编码) 都依赖一个核心事实:同一块/同一 stripe 内的数据越规律,压缩越好。
lz4 / zstd 依赖重复子串/重复模式。排序后,同一块内字段组合更相似,尤其是宽表。当相近实体/相近时间的数据聚集在同一 stripe 内时,块内值域收敛、重复与成段重复增强,字典规模下降,delta / bitpacking 的 bit 宽度降低,从而提升编码与通用压缩的效果。
adw=# CREATE TABLE t_sort_good (
device_id int NOT NULL,
ts timestamptz NOT NULL,
site_id int NOT NULL,
status smallint NOT NULL,
v1 double precision NOT NULL,
v2 double precision NOT NULL,
attrs jsonb NOT NULL
)
USING MARS3
WITH (compresstype=zstd,compresslevel=3,mars3options='prefer_load_mode=bulk')
ORDER BY (device_id, ts);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'device_id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
adw=# CREATE TABLE t_sort_bad (
device_id int NOT NULL,
ts timestamptz NOT NULL,
site_id int NOT NULL,
status smallint NOT NULL,
v1 double precision NOT NULL,
v2 double precision NOT NULL,
attrs jsonb NOT NULL
)
USING MARS3
WITH (compresstype=zstd,compresslevel=3,mars3options='prefer_load_mode=bulk')
ORDER BY (ts, device_id);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'device_id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
然后生成原始数据源,同时往两张表里插入数据
DROP TABLE IF EXISTS t_src;
CREATE UNLOGGED TABLE t_src AS
WITH
dev AS (
SELECT d AS device_id,
(d % 200) AS site_id,
(1000 + (d % 500))::double precision AS base
FROM generate_series(1, 10000) AS d
),
pts AS (
SELECT device_id, site_id, base,
g AS seq,
(timestamp '2026-01-01' + (g || ' seconds')::interval) AS ts
FROM dev
CROSS JOIN generate_series(1, 1000) AS g
)
SELECT
device_id,
ts,
site_id,
-- status:每 300 秒变一次段,形成长 run
((seq / 300) % 4)::smallint AS status,
-- v1/v2:设备基线 + 小波动
(base + (seq % 10) * 0.1)::double precision AS v1,
(base * 0.1 + (seq % 20) * 0.01)::double precision AS v2,
-- attrs:key 模式高度相似,但值随设备/时间变化
jsonb_build_object(
'fw', '1.2.' || (device_id % 10),
'model', 'm' || (device_id % 50),
'tag', 'S' || site_id,
'k', seq % 5
) AS attrs
FROM pts;
INSERT INTO t_sort_good SELECT * FROM t_src;
INSERT INTO t_sort_bad SELECT * FROM t_src;
为了确保数据的准确性,插入完成后执行多次 vacuum full + vacuum
adw=# select segid,level,total_nruns,visible_nruns,invisible_nruns,level_size from matrixts_internal.mars3_level_stats('t_sort_good') where level_size <> '0 bytes';
segid | level | total_nruns | visible_nruns | invisible_nruns | level_size
-------+-------+-------------+---------------+-----------------+------------
0 | 1 | 1 | 1 | 0 | 8839 kB
1 | 1 | 1 | 1 | 0 | 8933 kB
3 | 1 | 1 | 1 | 0 | 8552 kB
2 | 1 | 1 | 1 | 0 | 8728 kB
(4 rows)
adw=# select segid,level,total_nruns,visible_nruns,invisible_nruns,level_size from matrixts_internal.mars3_level_stats('t_sort_bad') where level_size <> '0 bytes';
segid | level | total_nruns | visible_nruns | invisible_nruns | level_size
-------+-------+-------------+---------------+-----------------+------------
0 | 2 | 1 | 1 | 0 | 26 MB
1 | 2 | 1 | 1 | 0 | 26 MB
3 | 2 | 1 | 1 | 0 | 25 MB
2 | 2 | 1 | 1 | 0 | 25 MB
(4 rows)
adw=# \dt+ t_sort_bad
List of relations
Schema | Name | Type | Owner | Storage | Size | Description
--------+------------+-------+---------+---------+--------+-------------
public | t_sort_bad | table | mxadmin | mars3 | 103 MB |
(1 row)
adw=# \dt+ t_sort_good
List of relations
Schema | Name | Type | Owner | Storage | Size | Description
--------+-------------+-------+---------+---------+-------+-------------
public | t_sort_good | table | mxadmin | mars3 | 35 MB |
(1 row)
adw=# select * from t_sort_good limit 10;
device_id | ts | site_id | status | v1 | v2 | attrs
-----------+------------------------+---------+--------+--------+--------------------+-----------------------------------------------------
3 | 2026-01-01 00:00:01+08 | 3 | 0 | 1003.1 | 100.31000000000002 | {"k": 1, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:02+08 | 3 | 0 | 1003.2 | 100.32000000000001 | {"k": 2, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:03+08 | 3 | 0 | 1003.3 | 100.33000000000001 | {"k": 3, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:04+08 | 3 | 0 | 1003.4 | 100.34000000000002 | {"k": 4, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:05+08 | 3 | 0 | 1003.5 | 100.35000000000001 | {"k": 0, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:06+08 | 3 | 0 | 1003.6 | 100.36000000000001 | {"k": 1, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:07+08 | 3 | 0 | 1003.7 | 100.37 | {"k": 2, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:08+08 | 3 | 0 | 1003.8 | 100.38000000000001 | {"k": 3, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:09+08 | 3 | 0 | 1003.9 | 100.39000000000001 | {"k": 4, "fw": "1.2.3", "tag": "S3", "model": "m3"}
3 | 2026-01-01 00:00:10+08 | 3 | 0 | 1003 | 100.4 | {"k": 0, "fw": "1.2.3", "tag": "S3", "model": "m3"}
(10 rows)
adw=# select * from t_sort_bad limit 10;
device_id | ts | site_id | status | v1 | v2 | attrs
-----------+------------------------+---------+--------+--------+--------------------+-------------------------------------------------------
1 | 2026-01-01 00:00:01+08 | 1 | 0 | 1001.1 | 100.11000000000001 | {"k": 1, "fw": "1.2.1", "tag": "S1", "model": "m1"}
12 | 2026-01-01 00:00:01+08 | 12 | 0 | 1012.1 | 101.21000000000001 | {"k": 1, "fw": "1.2.2", "tag": "S12", "model": "m12"}
15 | 2026-01-01 00:00:01+08 | 15 | 0 | 1015.1 | 101.51 | {"k": 1, "fw": "1.2.5", "tag": "S15", "model": "m15"}
20 | 2026-01-01 00:00:01+08 | 20 | 0 | 1020.1 | 102.01 | {"k": 1, "fw": "1.2.0", "tag": "S20", "model": "m20"}
23 | 2026-01-01 00:00:01+08 | 23 | 0 | 1023.1 | 102.31000000000002 | {"k": 1, "fw": "1.2.3", "tag": "S23", "model": "m23"}
35 | 2026-01-01 00:00:01+08 | 35 | 0 | 1035.1 | 103.51 | {"k": 1, "fw": "1.2.5", "tag": "S35", "model": "m35"}
38 | 2026-01-01 00:00:01+08 | 38 | 0 | 1038.1 | 103.81000000000002 | {"k": 1, "fw": "1.2.8", "tag": "S38", "model": "m38"}
40 | 2026-01-01 00:00:01+08 | 40 | 0 | 1040.1 | 104.01 | {"k": 1, "fw": "1.2.0", "tag": "S40", "model": "m40"}
44 | 2026-01-01 00:00:01+08 | 44 | 0 | 1044.1 | 104.41000000000001 | {"k": 1, "fw": "1.2.4", "tag": "S44", "model": "m44"}
47 | 2026-01-01 00:00:01+08 | 47 | 0 | 1047.1 | 104.71000000000001 | {"k": 1, "fw": "1.2.7", "tag": "S47", "model": "m47"}
(10 rows)
简而言之
同样的数据量,只是排序键不同:
对于 t_sort_good 来说,按 (device_id, ts) 排序 → 相似聚集 → 压缩友好,这样在同一个 stripe/块里,往往是“同一个设备的一段连续时间”。这会带来:
因此,同样的行数,物理字节会明显更小。
对于 t_sort_bad 来说,按 (ts, device_id) 排序 → 设备混杂 → 压缩不友好,同一个时间点会混进大量设备,导致同一块里:
所以压缩比显著变差,物理大小变成 3 倍非常正常。
换句话说:坏排序不仅更大,还更难治理,到了更高层才稳定下来。
返回上一章节:存储引擎原理