Yesterday at the PostgreSQL Meetup in Suisse Romande, Lucy Linder discussed exciting challenges related to migrating 7000 applications from PostgreSQL 13 to 15. She highlighted an issue regarding the following error during pg_restore:
ERROR: index row size 3056 exceeds btree version 4 maximum 2704 for index
HINT: Values larger than 1/3 of a buffer page cannot be indexed. Consider a function index of an MD5 hash of the value, or use full-text indexing.
The slide mentions "RDS Only" because the issue wasn't replicated when importing it into vanilla PostgreSQL. However, this doesn't imply that the software running in the managed service is different, as a couple of settings will affect this, such as the TOAST compression algorithm. Managed services that do not make their source public are like black boxes, unfortunately.
Limits of PostgreSQL and fixed-size blocks B-Trees
PostgreSQL raises this error when the size of an index entry exceeds one-third of the block size. This limitation exists in all traditional databases that use B-tree indexes with fixed-size pages. In Oracle, the limit can be higher (up to 50% of the block), but a leaf must still contain at least two index entries to reduce the number of branches at higher levels, ultimately resulting in a single root. Additionally, each block must contain additional metadata. PostgreSQL 12 reduced the maximum size by 8 bytes to store extra metadata used by an improvement in splitting blocks with duplicate entries. This explains why the error can occur when upgrading.
I executed a simple script to increase the size of an indexed column on PostgreSQL.
create table demo (id bigint, description text, iterations int default 0);
create index on demo(description);
insert into demo values (generate_series(1,1000), '');
update demo
set description = description||md5(random()::text)::text , iterations=iterations+1
where id=42
returning iterations, pg_size_pretty(pg_column_size(description)::bigint)
\watch 0.1
This loop ends quickly when the size of the indexed columns is 2692 bytes:
...
UPDATE 1
Fri 20 Sep 2024 12:24:45 PM GMT (every 0.1s)
iterations | pg_size_pretty
------------+----------------
83 | 2660 bytes
(1 row)
UPDATE 1
Fri 20 Sep 2024 12:24:45 PM GMT (every 0.1s)
iterations | pg_size_pretty
------------+----------------
84 | 2692 bytes
(1 row)
UPDATE 1
ERROR: index row size 2736 exceeds btree version 4 maximum 2704 for index "demo_description_idx"
DETAIL: Index row references tuple (9,21) in relation "demo".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
You might think that two kilobytes are enough for a single row index key, but there are several reasons why you might encounter this limit. For example, you might want to add an extensive description in a secondary index on a table with many columns to leverage Index Only Scan. Additionally, you might have a table with a usually small description that you want to search for using the first few words with a LIKE query, and this table may have a couple of rows with descriptions of bigger sizes.
Modern SQL databases store documents in JSON or JSONB, and you may want to index large attributes. SQL databases can also do text searches without the need for another database. For example, here is an issue encountered in Mattermost on PostgreSQL that will not happen when migrating to YugabyteDB.
No page limits in YugabyteDB with LSM Tree
With modern storage solutions like SSDs, where random reads are fast, indexes don't have to stick to a fixed block size structure. Modern databases use LSM-Trees. In an LSM-Tree, new index entries do not have to split blocks like a B-Tree because of their fixed size. Inserts go to a memory table flushed periodically to SST files, which are later compacted. This approach has the advantage of using only fast I/O operations: random writes to memory, sequential writes to disks, and sequential and random reads from disk. The only operation that remains slow in SSD, the random writes, are entirely avoided and transformed into sequential writes thanks to the Log-Structured Merge-Tree.
With YugabyteDB, you can run the same example as above without failure. The limit is only the RPC message size, which defaults to 255MB (rpc_max_message_size=267386880
).
To reach this limit, I use orafce
, installed by default in YugabyteDB, extension to generate large random strings:
yugabyte=# create extension orafce;
CREATE EXTENSION
yugabyte=# update demo
set description = dbms_random.string('p',128*1024*1014)
where id=42
returning pg_size_pretty(pg_column_size(description)::bigint);
pg_size_pretty
----------------
127 MB
(1 row)
UPDATE 1
yugabyte=# update demo
set description = dbms_random.string('p',255*1024*1014)
where id=42
returning pg_size_pretty(pg_column_size(description)::bigint);
ERROR: Sending too long RPC message (662458904 bytes)
This limitation is not due to storage but simply the limit of network messages between nodes in the cluster. In a cloud-native database, large objects in hundreds of megabytes should reference object storage items, like S3, rather than being stored within the database.
YugabyteDB stores data distributed to LSM Trees based on RocksDB and can store hundreds of megabytes in its index entries. This is why the table rows can be stored in the primary key index, similar to an Index-Organized Table, but without overflow. This structure allows storing rows clustered on their primary key, enabling faster range scans on a partial key, including all MVCC versions.