Compression and Performance Impact

Every compress_threshold rows (default: 1200) constitute a Range. Within a Range, the data for a specific column (containing compress_threshold rows) is called a Stripe. If a column's data within a Stripe is particularly large, the Stripe is split into multiple 1 MB chunks. During reads, the system does not necessarily fetch the entire compress_threshold volume at once.

The compress_threshold parameter also influences the compression ratio:

Compressing a larger batch of data simultaneously often yields better compression ratios. This is because repetitive patterns are more likely to be captured within a single batch. If data is split across multiple batches, these patterns may be fragmented, reducing compression efficiency.

Impact of compress_threshold on Read Performance

ID compress_threshold=1200 3600 10000 50000
1 8.8032 7.7846 6.6471 6.6824
2 0.981 0.997 2.2284 2.2322
3 3.3512 3.3143 3.3184 3.3177
4 7.7059 7.7326 6.6034 5.5994
5 5.5732 5.5042 5.5939 6.633
6 0.478 0.38 0.545 0.545
7 1.1908 1.1917 2.2172 2.2222
8 3.3602 3.3129 3.3018 3.3227
9 8.8429 7.74 7.7751 7.7795
10 3.3835 4.4239 5.5496 5.5499
11 0.845 0.743 2.2307 2.2299
12 1.1998 1.1887 2.2164 2.2138
13 6.6514 6.6351 6.6784 6.6424
14 0.837 0.776 1.1364 1.1275
15 1.1436 1.1359 1.183 1.1783
16 1.1768 1.1787 2.2673 2.2757
17 9.9419 8.8705 8.8922 8.8771
18 11.11337 11.11522 10.10384 10.10647
19 4.4792 3.3942 4.4918 4.467
20 2.2668 2.2389 3.3475 3.3346
21 11.11391 10.10633 11.1119 10.10742
22 3.3009 2.2864 3.372 3.3377
Sum 97.73958 92.05015 100.81134 99.78189

Impact of compress_threshold on Write Performance

Test Scenario Metric compress_threshold=1200 3600 10000 50000
Write Performance Partitioned Table (rows/s) 852,334 919,031 1,055,371 1,057,424
Write Performance Non-Partitioned Table (rows/s) 991,463 1,033,751 1,054,292 1,076,714
Query Performance time_bucket=1h (1-day range) 357ms 342ms 333ms 324ms
Query Performance time_bucket=1d (1-month range) 7,384ms 6,312ms 6,008ms 6,025ms
Query Performance time_bucket=30d (1-year range) 94,278ms 75,101ms 68,184ms 69,006ms
Query Performance Point Lookup 19.416 ms 20.043 ms 23.692 ms 37.844 ms

Index Compression

For index compression architecture, refer to the mars3btree section.

CREATE INDEX idx_name ON table_name
USING mars3btree (column_list)
WITH (
    compresstype = 'lz4',           -- Compression algorithm
    compresslevel = 1,              -- Compression level
    compressctid = true,            -- Compress CTID column?
    encodechain = '',               -- Encoding chain
    minmax = true                   -- Enable min/max optimization
);

Parameters:

  1. compresstype: Supported algorithms are lz4, zstd, and mxcustom.
    • Default: lz4. Offers fast compression/decompression with moderate ratios.
    • zstd: Provides higher compression ratios but is slightly slower.
    • mxcustom: Must be used in conjunction with encodechain.
  2. compresslevel: Supports values 1–9 (default: 1).
    • For query-intensive workloads, use 1–3 to prioritize decompression speed.
    • For storage-sensitive scenarios, use 6–9 to prioritize compression ratio.
  3. compressctid: Determines whether the CTID column is compressed. Default is true. It is recommended to keep this enabled, as CTID columns typically achieve high compression ratios.

Customer Case Study: In a specific customer scenario, enabling index compression yielded the following results:

  • TOB Cluster: Minimal impact on node CPU and memory usage, while saving 24% of YMatrix partition storage space.
  • TOC Cluster: Significant results. Despite a slight increase in CPU and memory consumption, it saved 63% of YMatrix partition storage space.
  • Throughput (50 vehicles, 1-day GPS data): Service throughput with index compression was 3.3x higher than without.
  • Throughput (50 vehicles, 3-day GPS data): Service throughput with index compression was 1.1x higher than without.

Impact of Sort Keys on Compression

As previously mentioned, compression (whether using ZSTD, LZ4, or encoding schemes like RLE, Dictionary, and Bitpacking) relies on a core principle: the more regular the data within a block or Stripe, the better the compression.

  • LZ4/ZSTD: Depend on repeated substrings and patterns. Sorting data ensures that field combinations within a block are more similar, especially in wide tables.
  • Clustering Effect: When data from similar entities or time periods is clustered within the same Stripe:
    • Value ranges within blocks converge.
    • Repetition and run-lengths increase.
    • Dictionary sizes decrease.
    • Bit widths for delta/bitpacking reduce.
    • Consequently, both encoding and general compression efficiency improve.

Example Setup:

-- Good Sort Key: device_id, ts
adw=# CREATE TABLE t_sort_good (
  device_id   int NOT NULL,
  ts          timestamptz NOT NULL,
  site_id     int NOT NULL,
  status      smallint NOT NULL,
  v1          double precision NOT NULL,
  v2          double precision NOT NULL,
  attrs       jsonb NOT NULL
)
USING MARS3
WITH (compresstype=zstd, compresslevel=3, mars3options='prefer_load_mode=bulk')
ORDER BY (device_id, ts);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'device_id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

-- Bad Sort Key: ts, device_id
adw=# CREATE TABLE t_sort_bad (
  device_id   int NOT NULL,
  ts          timestamptz NOT NULL,
  site_id     int NOT NULL,
  status      smallint NOT NULL,
  v1          double precision NOT NULL,
  v2          double precision NOT NULL,
  attrs       jsonb NOT NULL
)
USING MARS3
WITH (compresstype=zstd, compresslevel=3, mars3options='prefer_load_mode=bulk')
ORDER BY (ts, device_id);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'device_id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

Data Generation and Insertion:

DROP TABLE IF EXISTS t_src;
CREATE UNLOGGED TABLE t_src AS
WITH
  dev AS (
    SELECT d AS device_id,
           (d % 200) AS site_id,
           (1000 + (d % 500))::double precision AS base
    FROM generate_series(1, 10000) AS d
  ),
  pts AS (
    SELECT device_id, site_id, base,
           g AS seq,
           (timestamp '2026-01-01' + (g || ' seconds')::interval) AS ts
    FROM dev
    CROSS JOIN generate_series(1, 1000) AS g
  )
SELECT
  device_id,
  ts,
  site_id,
  -- status: Changes segment every 300 seconds, forming long runs
  ((seq / 300) % 4)::smallint AS status,
  -- v1/v2: Device baseline + small fluctuations
  (base + (seq % 10) * 0.1)::double precision AS v1,
  (base * 0.1 + (seq % 20) * 0.01)::double precision AS v2,
  -- attrs: Highly similar key patterns, values vary by device/time
  jsonb_build_object(
    'fw',  '1.2.' || (device_id % 10),
    'model', 'm' || (device_id % 50),
    'tag',  'S' || site_id,
    'k',    seq % 5
  ) AS attrs
FROM pts;

INSERT INTO t_sort_good SELECT * FROM t_src;
INSERT INTO t_sort_bad  SELECT * FROM t_src;

Verification: To ensure data accuracy, execute VACUUM FULL followed by VACUUM multiple times after insertion.

-- Check t_sort_good stats
adw=# SELECT segid, level, total_nruns, visible_nruns, invisible_nruns, level_size 
FROM matrixts_internal.mars3_level_stats('t_sort_good') 
WHERE level_size <> '0 bytes';
 segid | level | total_nruns | visible_nruns | invisible_nruns | level_size
-------+-------+-------------+---------------+-----------------+------------
     0 |     1 |           1 |             1 |               0 | 8839 kB
     1 |     1 |           1 |             1 |               0 | 8933 kB
     3 |     1 |           1 |             1 |               0 | 8552 kB
     2 |     1 |           1 |             1 |               0 | 8728 kB
(4 rows)

-- Check t_sort_bad stats
adw=# SELECT segid, level, total_nruns, visible_nruns, invisible_nruns, level_size 
FROM matrixts_internal.mars3_level_stats('t_sort_bad') 
WHERE level_size <> '0 bytes';
 segid | level | total_nruns | visible_nruns | invisible_nruns | level_size
-------+-------+-------------+---------------+-----------------+------------
     0 |     2 |           1 |             1 |               0 | 26 MB
     1 |     2 |           1 |             1 |               0 | 26 MB
     3 |     2 |           1 |             1 |               0 | 25 MB
     2 |     2 |           1 |             1 |               0 | 25 MB
(4 rows)

-- Check table sizes
adw=# \dt+ t_sort_bad
                           List of relations
 Schema |    Name    | Type  |  Owner  | Storage |  Size  | Description
--------+------------+-------+---------+---------+--------+-------------
 public | t_sort_bad | table | mxadmin | mars3   | 103 MB |
(1 row)

adw=# \dt+ t_sort_good
                           List of relations
 Schema |    Name     | Type  |  Owner  | Storage | Size  | Description
--------+-------------+-------+---------+---------+-------+-------------
 public | t_sort_good | table | mxadmin | mars3   | 35 MB |
(1 row)

-- Sample data from t_sort_good (Sorted by device_id, then ts)
adw=# SELECT * FROM t_sort_good LIMIT 10;
 device_id |           ts           | site_id | status |   v1   |         v2         |                        attrs
-----------+------------------------+---------+--------+--------+--------------------+-----------------------------------------------------
         3 | 2026-01-01 00:00:01+08 |       3 |      0 | 1003.1 | 100.31000000000002 | {"k": 1, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:02+08 |       3 |      0 | 1003.2 | 100.32000000000001 | {"k": 2, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:03+08 |       3 |      0 | 1003.3 | 100.33000000000001 | {"k": 3, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:04+08 |       3 |      0 | 1003.4 | 100.34000000000002 | {"k": 4, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:05+08 |       3 |      0 | 1003.5 | 100.35000000000001 | {"k": 0, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:06+08 |       3 |      0 | 1003.6 | 100.36000000000001 | {"k": 1, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:07+08 |       3 |      0 | 1003.7 |             100.37 | {"k": 2, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:08+08 |       3 |      0 | 1003.8 | 100.38000000000001 | {"k": 3, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:09+08 |       3 |      0 | 1003.9 | 100.39000000000001 | {"k": 4, "fw": "1.2.3", "tag": "S3", "model": "m3"}
         3 | 2026-01-01 00:00:10+08 |       3 |      0 |   1003 |              100.4 | {"k": 0, "fw": "1.2.3", "tag": "S3", "model": "m3"}
(10 rows)

-- Sample data from t_sort_bad (Sorted by ts, then device_id - mixed devices)
adw=# SELECT * FROM t_sort_bad LIMIT 10;
 device_id |           ts           | site_id | status |   v1   |         v2         |                         attrs
-----------+------------------------+---------+--------+--------+--------------------+-------------------------------------------------------
         1 | 2026-01-01 00:00:01+08 |       1 |      0 | 1001.1 | 100.11000000000001 | {"k": 1, "fw": "1.2.1", "tag": "S1", "model": "m1"}
        12 | 2026-01-01 00:00:01+08 |      12 |      0 | 1012.1 | 101.21000000000001 | {"k": 1, "fw": "1.2.2", "tag": "S12", "model": "m12"}
        15 | 2026-01-01 00:00:01+08 |      15 |      0 | 1015.1 |             101.51 | {"k": 1, "fw": "1.2.5", "tag": "S15", "model": "m15"}
        20 | 2026-01-01 00:00:01+08 |      20 |      0 | 1020.1 |             102.01 | {"k": 1, "fw": "1.2.0", "tag": "S20", "model": "m20"}
        23 | 2026-01-01 00:00:01+08 |      23 |      0 | 1023.1 | 102.31000000000002 | {"k": 1, "fw": "1.2.3", "tag": "S23", "model": "m23"}
        35 | 2026-01-01 00:00:01+08 |      35 |      0 | 1035.1 |             103.51 | {"k": 1, "fw": "1.2.5", "tag": "S35", "model": "m35"}
        38 | 2026-01-01 00:00:01+08 |      38 |      0 | 1038.1 | 103.81000000000002 | {"k": 1, "fw": "1.2.8", "tag": "S38", "model": "m38"}
        40 | 2026-01-01 00:00:01+08 |      40 |      0 | 1040.1 |             104.01 | {"k": 1, "fw": "1.2.0", "tag": "S40", "model": "m40"}
        44 | 2026-01-01 00:00:01+08 |      44 |      0 | 1044.1 | 104.41000000000001 | {"k": 1, "fw": "1.2.4", "tag": "S44", "model": "m44"}
        47 | 2026-01-01 00:00:01+08 |      47 |      0 | 1047.1 | 104.71000000000001 | {"k": 1, "fw": "1.2.7", "tag": "S47", "model": "m47"}
(10 rows)

Summary:

  • t_sort_good: Groups similar data together. Data written to disk in the same batch is more regular and repetitive, making it easier for the compressor to work efficiently.
  • t_sort_bad: Mixes completely different data together. Data written in the same batch is chaotic and random, making compression difficult.

Comparison Results (Same Data Volume, Different Sort Keys):

  • t_sort_good Total Size: 35 MB
  • t_sort_bad Total Size: 103 MB (Approximately 3x larger)

Run/Level Statistics: Although each segment contains only one Run in both cases:

  • t_sort_good Runs reside in Level 1, approximately 8.5–8.9 MB per segment.
  • t_sort_bad Runs reside in Level 2, approximately 25–26 MB per segment.

Analysis:

  1. t_sort_good (Sort Key: device_id, ts): Similarity Clustering → Compression Friendly Sorting by (device_id, ts) ensures that each Stripe/block contains "a continuous time series for a single device." This leads to:

    • Long Repeats: Columns like device_id, site_id, and status exhibit long runs of identical values, ideal for RLE and Dictionary encoding.
    • Small Deltas: Columns like ts, v1, and v2 change slowly over time within a device, resulting in small deltas ideal for delta/bitpacking.
    • Pattern Regularity: The key structure in attrs (JSONB) is highly consistent within a device, benefiting ZSTD/LZ4.
    • Result: Significantly smaller physical size for the same number of rows.
  2. t_sort_bad (Sort Key: ts, device_id): Device Mixing → Compression Unfriendly Sorting by (ts, device_id) mixes a large number of devices at every timestamp. Within a single block:

    • High Cardinality: device_id changes almost every row, increasing dictionary size and reducing RLE run lengths to nearly 1.
    • Scattered Patterns: site_id and status are disrupted, breaking run-length efficiency.
    • Irregular JSON: The combination patterns in attrs become chaotic, reducing repeated substrings.
    • Net Effect: Even though ts is sequential, the randomness of other columns negates most compression benefits.
    • Result: Significantly worse compression ratio; a 3x increase in physical size is expected.

In conclusion, a poor sort key not only increases storage size but also makes data governance more difficult, requiring data to reach higher levels before stabilizing.

Return to previous section: Storage Engine Principles