Brief Overview of Storage Engine Principles

MARS3 Overview

MARS3 is a self-developed LSM-Tree based storage engine for YMatrix. It employs a hybrid row-column architecture, introducing a "row-first, column-later" dual-storage path on top of traditional LSM structures. This design inherits the write-friendly characteristics of row stores while retaining the high-performance analytical query capabilities of column stores. MARS3 supports features such as encoding chain compression, data updates and deletions, MVCC mechanisms, BRIN indexes, and hybrid storage, satisfying the requirements of both Analytical Processing (AP) and Transactional Processing (TP) scenarios.

MARS3 supports data updates and deletions via UPDATE (except in Unique Mode) and DELETE clauses.

It also supports adding or dropping columns and accommodates COPY and pg_dump operations.

Internal Principles

For each individual MARS3 table, data is stored internally using an LSM Tree structure. The LSM Tree (Log Structured Merge Tree) is a hierarchical, ordered, disk-oriented data structure. Its core philosophy is to fully leverage disk performance by performing batch sequential writes, which offers significantly higher performance than random writes.

The internal principle diagram of MARS3 is shown below:

Run

Data stored within MARS3 is ordered. A continuous sequence of ordered data is referred to as a Run.

Runs are categorized into two types: Row Store Runs and Column Store Runs. Data inserted for high-speed writing is initially stored as Row Store Runs. Subsequently, these are converted into Column Store Runs to facilitate reading and compression.

Each individual Run has a size limit defined by:

The table-level parameter max_runsize, which specifies the maximum size of a single Run during table creation. The maximum value is 16384 MB.
The default value is 4096 MB.
You can use the function matrixts_internal.mars3_files to view the extension files and incremental files of a MARS3 table.

select * from matrixts_internal.mars3_files('test');

These files primarily include DATA, LINK, FSM, and DELTA. If indexes exist on the table, corresponding INDEX files will also be present.

DATA: The primary data file used to store user data.
FSM (Free Space Map): In YMatrix, update and delete operations do not modify existing data space directly; instead, they utilize tuple multi-versioning. This leads to "expired data," where a tuple version becomes invisible to all transactions. At this point, the space it occupies can be reclaimed. The FSM file tracks these available spaces and efficiently allocates them when needed.
LINK: Used during update and delete operations to maintain the upstream and downstream relationships of tuple versions during compaction.
DELTA: Stores deletion information. MARS3 update and delete operations do not modify data in place; rather, they rely on DELTA files (containing XMAX and other deletion info) and version information to mask old data, thereby controlling data visibility.
INDEX and INDEX_1_TOAST: Files used to store index data. Currently, MARS3 supports BRIN and BTREE indexes.

Level

MARS3 organizes data based on the LSM-Tree model. Individual Run files are organized into Levels, with a maximum of 10 levels: L0, L1, L2, ..., L9.

Compaction (merging) is triggered when the number of Runs in a level reaches a certain threshold or when the total size of multiple Runs in the same level exceeds a specific limit. Once merged into a single Run, it is promoted to a higher level. To accelerate Run promotion, multiple merge tasks can run concurrently within the same level.

level

In YMatrix, a set of background merge processes periodically checks the status of various tables and executes merge operations. YMatrix provides the utility function matrixts_internal.mars3_level_stats to view the status of each level within a MARS3 table.

 select * from matrixts_internal.mars3_level_stats('test') limit 10;

This operation is highly useful for assessing table health, such as verifying if Runs are merging as expected, checking for excessive invisible Runs, and ensuring Run counts remain within normal ranges.

As a general rule of thumb:

When level = 0, if the number of Runs exceeds 3, the status is considered unhealthy.
When level = 1, if the number of Runs exceeds 50, the status is considered unhealthy.
When level > 1, if the number of Runs exceeds 10, the status is considered unhealthy.

Range and Stripe

Column stores perform direct reads and writes without a buffer layer like Shared Buffers or page flushing mechanisms. Every compress_threshold rows of data (default is 1200) constitute a Range. A contiguous block of data for a specific column within a Range (containing compress_threshold rows) is called a Stripe.

If a column's data is particularly large, the Stripe is split into several 1MB chunks. During reads, the entire compress_threshold amount of data is not fetched at once.

RUN 
 └── range (split by rows, default 1 range per 1200 rows)
      ├── column1 stripe (1200 datum)
      ├── column2 stripe (1200 datum)
      ├── column3 stripe (1200 datum)
      └── ...

Range: A logical row window. Every 1200 rows form a Range. Internally, data is stored in columnar format and compressed by column.
Stripe: A physical column block. It represents the contiguous storage block for a specific column within that 1200-row range.
Datum: The smallest unit of data. It represents the value of a single column in a single row (the native unit of the PostgreSQL kernel).

Row Store vs. Column Store

MARS3 supports a "row-first, column-later" storage approach. Essentially, data enters the system in a row-oriented format suitable for fast writing, and gradually transitions to a column-oriented format suitable for analysis during background governance. This enables continuous "write and analyze" operations.
- Row Store Format: Optimized for writing and accessing fresh data. It facilitates rapid ingestion of new data and is better suited for small-range reads and detailed lookups (especially when data is newly written and not yet fully organized).
- Column Store Format: Optimized for scanning and aggregation. Once data is organized, column storage significantly improves scan throughput and compression efficiency, making large-scale aggregation and filtering queries more resource-efficient.
Compared to storing data directly in columnar format, this approach offers several benefits:
- Faster write speeds for high-frequency, small-batch data.
- Reduced need for large memory buffers for data caching.
- Ensures uniform tuple counts across data blocks.

Data Writing

Data is written to memory via INSERT and then flushed to Runs in the L0 level.
To accommodate different scenario requirements, YMatrix supports three write modes, determined jointly by the table-level parameters prefer_load_mode and rowstore_size. See the Configuration Parameters section below for details:
1. Normal: Indicates standard mode. New data is first written to Row Store Runs in the L0 level. Once accumulated to rowstore_size, it is moved to Column Store Runs in the L1 level. Compared to Bulk mode, this involves one additional I/O operation. Column store conversion changes from synchronous to asynchronous. This mode is suitable for high-frequency, small-batch write scenarios where I/O capacity is sufficient and latency sensitivity is high.
2. Bulk: Batch loading mode. Suitable for low-frequency, large-batch write scenarios. Data is written directly to Column Store Runs in the L1 level. Compared to Normal mode, this reduces I/O by one step, and column store conversion becomes synchronous. This mode is suitable for low-frequency, large-batch data writes where I/O capacity is limited and latency is less critical.
3. Single: Data is inserted directly into the rowstore, and tuples are placed directly into Shared Buffers.

For more detailed information, refer to Write Path Overview.

MARS3 Indexes

Currently, MARS3 supports BRIN and BTREE indexes.

BTREE: Suitable for transactional systems focused on "exact lookup." It achieves fast positioning through row-level pointers.
BRIN: Suitable for large-scale analytical systems focused on "range scans." It significantly reduces invalid I/O through block-level summaries.

Note!
For MARS3 tables, a maximum of 16 indexes are allowed per table (regardless of whether they are on the same column, or whether they are BRIN or BTREE).

BRIN Indexes

MARS3 supports creating mars3_brin and mars3_default_brin indexes, and supports deleting and adding BRIN indexes.
Each Run creates its own independent BRIN index file upon generation.
Default Brin is a key feature of the MARS3 storage engine. It provides default BRIN index support at the table level without requiring manual index creation. Unlike regular CREATE INDEX USING BRIN (which only benefits Index Scans), Sequential Scans can also benefit from Default Brin, significantly improving query efficiency. Note that Default Brin does not consume index slots; even with a Default Brin, you can still create up to 16 additional indexes.

	mars3_brin	mars3_default_brin
Creation Method	Manual creation required	Automatically created; no manual action needed
Query Support	Filters data only during IndexScan	Filters data during both IndexScan and SeqScan
Technical Version	brinV2	brinV2
Parameterized Query	Supports parameterized queries (param-IndexScan)	Supports parameterized queries (param-SeqScan)

BTREE Indexes

BTREE indexes are a general-purpose index type based on a balanced multi-way tree structure. By organizing index nodes in key-value order, they enable fast and precise positioning of single rows or small data ranges. Query complexity remains stable at O(logN), supporting both equality queries and efficient range scans and sorting operations. Since they do not rely on the physical distribution of data, BTREE indexes exhibit strong stability in high-concurrency transaction processing scenarios. They are the default choice for primary keys, unique constraints, and highly selective queries. However, they are not suitable for low-selectivity columns or wide-range scans on large tables.

mars3btree is the dedicated B-tree implementation within the MARS3 storage engine. The internal pages of the index remain standard B-tree pages. mars3btree supports two types:

NORMAL: Standard row-style B-tree (used for RowStore); uncompressed.
COMPRESSED: Column-style compressed B-tree (used for ColumnStore); compressed.

Sort Key

The sort key is a core design element determining whether the engine can achieve optimal scan efficiency and long-term stability. Ordered data combined with reliable block-level metadata significantly boosts scan efficiency. A well-chosen sort key ensures stronger locality within Runs and at higher levels, allowing query filter conditions to match continuous ranges more effectively and enabling efficient skipping. Conversely, an unreasonable sort key results in scattered data distribution; filter conditions fail to converge the scan range, causing the system to behave as if it were performing a full scan despite having indexes or metadata.

In MARS3, data is stored in sorted order. When creating a table, you must specify the sort order by defining sort columns (multiple columns are supported). The fields involved in this sort order are called the Sort Key.
The sort key can be specified only once. It cannot be modified, added to, or removed.
To maximize the performance benefits of ordering, choose fields that are frequently used and have good filtering effects as the sort key. For example, in an equipment monitoring table, event timestamps and device IDs can serve as the sort key.
If the sort key is of a text type and byte-order sorting is acceptable, applying COLLATE C to that column can accelerate sorting.

For more details on sort keys and selection principles, see Sort Keys and Data Locality.

Compression

Typed Encoder

Starting from version 6.8.2, MARS3 uses the typed encoder as the default data compression method. If compresstype is not specified when a table is created, this method is used. You can also explicitly set compresstype=auto to use the typed encoder. The typed encoder dynamically selects the most suitable encoding method based on the column type and the characteristics of each data block, without additional intervention.

Use the automode parameter to specify the typed encoder priority:

automode=1: prioritizes compression ratio.
automode=2: prioritizes encoding and decoding speed. The default value is 2.

The typed encoder has the following characteristics:

For common data types, it provides a higher compression ratio than general-purpose compression algorithms such as lz4 and zstd.
It provides higher encoding and decoding performance.

The typed encoder currently supports only the following types:

smallint, integer, bigint
date, time, timestamp, timestamptz
text, varchar
numeric, only when fixed precision is defined on the column
float

For other types, MARS3 uses zstd level 3 for compression.

For more information on the impact of compression on performance, refer to Compression and Performance Impact.

Updates and Deletes

MARS3 performs deletions via the DELETE command. Deletions are recorded in the Delta file of the corresponding Run and are physically removed only during Run compaction.
MARS3 performs updates via the UPDATE command. An update first deletes the original data and then inserts a new record.
In Unique Mode, MARS3 supports DELETE. Explicit use of the UPDATE clause is not required; simply executing an INSERT clause automatically completes the operation. To update a specific record identified by a Unique Key (the specific key values corresponding to the sort key defined at table creation), simply insert a new record with the same Unique Key. For example, CREATE TABLE mars3_t(c1 int NOT NULL, c2 int) USING MARS3 WITH (uniquemode=true) ORDER BY (c1, c2);, where the Unique Key is (c1, c2).

Note!
If Unique Mode is enabled, the first field in the ORDER BY clause must be defined with a NOT NULL constraint.

For more technical details, see Updates and Deletes.

Compaction and Reclamation

Overlapping data ranges within Runs cause read amplification, reducing query efficiency. Therefore, when the number of Runs on disk exceeds a certain threshold, MARS3 merges multiple Runs on disk through sorting, outputting a single Run. This process is called compaction.
During compaction, data remains readable and writable:
- Reads access only the input files being merged.
- Writes do not read data newly written during the merge process.
- Reads, writes, and compactions do not block each other.
After compaction completes, the Runs involved in the merge are automatically marked as reclaimable based on their Transaction IDs.

For more technical details, see Background Governance.

Support for MVCC Mechanism

The MVCC (Multiversion Concurrency Control) mechanism is often referred to as multi-version management. Its core function is handling data updates, modifications, and deletions.
In multi-version management, data updates and deletions do not necessarily modify the original data in place. Instead, a new version is created, the original data is marked as invalid, and new data is added to the new version. Data thus possesses multiple versions. Each data item carries version information, and historical versions are preserved.
MARS3 update and delete operations do not modify data in place. Instead, they rely on Delta files and version information to mask old data, thereby controlling data visibility.
Note: Continuously updating or deleting data in the same Run will cause the physical space occupied by its Delta file to grow continuously. However, once all data in the current Run is deleted, this growth will cease. Furthermore, MARS3 compaction operations can automatically clear Dead data. Alternatively, you can plan to regularly use VACUUM to clean up Dead data.

Support for Buckets

MARS3 Bucket is a storage-layer parallel execution optimization mechanism designed by YMatrix for parallel scan scenarios within the MPP architecture. By organizing data into multiple logical buckets based on the hash of the distribution key during the write phase, it ensures that data with the same distribution key is processed by the same worker during parallel scans. This preserves data distribution semantics (locus), avoids unnecessary data redistribution (Motion), and achieves a performance leap from "scanning faster" to "computing more locally."

create table foo (c1 int, c2 int) using mars3 with (mars3options='nbuckets = 2');

Valid values for nbuckets range from 1 to 128. The default value is 1, indicating a single bucket (i.e., no bucketing is performed).

For more technical details, see MARS3 Bucket Technical Deep Dive.

Using MARS3

Creating MARS3 Tables

Provided the matrixts extension has been created, the simplest way to create a table is to add the USING clause and append the ORDER BY clause to the CREATE TABLE statement. Extended examples can be found in Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS3 
  ORDER BY (dev_id,ts);

Note!
MARS3 tables support the creation of BRIN indexes, but it is not mandatory. Starting from version 6.3.0, the requirement to specify a sort key using the ORDER BY clause during table creation has been removed for MARS3 tables.

Configuration Parameters

Note!
The following configuration parameters are table-level settings. They can only be configured using the WITH(mars3options='a=1,b=2,...') clause during table creation. They apply to individual tables and cannot be modified after creation. For more information, see Database Table Configuration Parameters.

The following parameters regulate the size of L0 Level Runs and indirectly control the size of Runs above L1.

Parameter	Unit	Default Value	Range	Description
`rowstore_size`	MB	64	8 – 1024	Controls when to switch L0 Runs. When data size exceeds this value, a new Run is started.

The following parameters set the compression threshold, which can adjust compression effectiveness and improve read efficiency. Setting it too low yields negligible compression, while setting it too high consumes excessive memory.

Parameter	Unit	Default Value	Range	Description
`compress_threshold`	Tuples	1200	1 – 100000	Compression threshold. Controls the number of tuples per column in a single table before compression occurs. It represents the upper limit of tuples compressed within a single unit.

The following parameters specify the data loading mode in MARS3.

Parameter	Unit	Default Value	Range	Description
`prefer_load_mode`	—	`normal`	`normal` / `bulk` / `Single`	Data loading mode. `normal`: Standard mode. New data is first written to Row Store Runs in the L0 level; once accumulated to `rowstore_size`, it moves to Column Store Runs in the L1 level. Compared to `bulk` mode, this involves one additional I/O operation, and column store conversion becomes asynchronous. Suitable for high-frequency, small-batch writes where I/O capacity is sufficient and latency sensitivity is high. `bulk`: Batch loading mode. Suitable for low-frequency, large-batch writes. Data is written directly to Column Store Runs in the L1 level. Compared to `normal` mode, this reduces I/O by one step, and column store conversion becomes synchronous. Suitable for low-frequency, large-batch writes where I/O capacity is limited and latency is less critical. `Single`: Data is inserted directly into `rowstore`, and tuples are placed directly into `Shared Buffers`.

The following parameters specify the size amplifier factor for Levels.

Parameter	Unit	Default Value	Range	Description
`level_size_amplifier`	—	8	1 – 1000	Level size amplifier. The threshold for triggering merge operations at each Level. Calculated as: `rowstore_size * (level_size_amplifier ^ level)`. Larger values result in slower read speeds but faster write speeds. Choose the specific value based on scenario details (write-heavy/read-light, read-heavy/write-light, compression ratio, etc.). Note: Ensure the number of Runs per Level does not become excessive, as this can degrade query performance or even prevent new data insertion.

The following parameters specify the number of buckets.

Parameter	Unit	Default Value	Range	Description
`nbuckets`	—	1	1 – 128	Number of buckets. Controls the bucket count to optimize query performance. For best practices, refer to the MARS3 Bucket Best Practices section within "Table Design and Data Distribution Best Practices".

Configuration Example:

=# CREATE EXTENSION matrixts;
CREATE TABLE t(
    time timestamp with time zone,
    tag_id int,
    i4 int4,
    i8 int8
)
USING MARS3
WITH (compresstype=zstd, compresslevel=3, compress_threshold=1200,
        mars3options='rowstore_size=64, prefer_load_mode=normal, level_size_amplifier=8, nbuckets=2')
DISTRIBUTED BY (tag_id)
PARTITION BY RANGE (time)
(
    START ('2026-02-01 00:00:00+08')
    END   ('2026-03-01 00:00:00+08')
    EVERY (INTERVAL '1 day')
)
ORDER BY (time, tag_id);

Utility Functions

matrixts_internal.mars3_level_stats: Views the status of each Level in a MARS3 table. This helps assess table health, such as verifying if Runs are merging as expected and if their counts are within limits.
matrixts_internal.mars3_files: Views the file status of a MARS3 table. Useful for checking if extension files and incremental files (Data, Delta, Index files, etc.) meet expectations.
matrixts_internal.mars3_info_brin: Views the status of a specific BRIN index on a MARS3 table.

HEAP Overview

HEAP is the default storage engine for YMatrix, also known as Heap Storage. Inherited from PostgreSQL, it supports only row storage and does not support column storage or compression. It is implemented based on the MVCC mechanism and is suitable for scenarios requiring extensive updates and deletions.

Using the MVCC Mechanism

Under the influence of the MVCC mechanism, HEAP tables do not physically delete data when handling update and delete operations. Instead, they mask old data using version information (controlling data visibility). Consequently, frequent updates or deletions on HEAP tables cause the physical space occupied to grow continuously. It is necessary to plan periodic VACUUM operations to clean up old data.

Using HEAP

You can create a HEAP table in YMatrix using the following SQL statement:

=# CREATE TABLE disk_heap(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
DISTRIBUTED BY (tag_id);

AORO Overview

AORO (Append-Only Row-oriented) is a storage organization paradigm designed for analytical databases. It writes data sequentially by row in an append-only manner, does not support in-place updates or deletions, and maintains versions via timestamps or transaction IDs. It balances write throughput, query efficiency, and MVCC consistency. AORO supports row storage.

For AO tables with frequent update and delete operations, planned periodic cleanup of old data is also required. However, in AO tables, the cleanup tool vacuum needs to reset bitmaps and compress physical files, so it is typically slower than on HEAP tables.

Note!
For detailed information on storage engines, usage, and best practices, see Table Design and Data Distribution Best Practices.

← Previous

Overview

MARS3 DIMS

English Русский 简体中文