In a previous post, Postgres dead tuple space reused without vacuum, I used pageinspect
to examine the internals of MVCC in PostgreSQL. YugabyteDB is PostgreSQL compatible for the query layer but has a different storage, so pageinspect
doesn't apply. The storage is adapted from RocksDB, and we can use the Yugabyte version of sst_dump
.
This demo is similar to @denismagda's PostgreSQL MVCC Backstage but on YugabyteDB and we will talk about this during the next Open Hours:
I start a docker container from the YugabyteDB image:
docker run -it yugabytedb/yugabyte:2.20.2.0-b145 bash
I start a single-node cluster:
export PATH="/home/yugabyte/bin:$PATH"
cd
yugabyted destroy
yugabyted start --advertise_address 0.0.0.0 --tserver_flags=ysql_enable_packed_row=false
I disable the Packed Rows optimization to make the observation easier. We will talk about it later.
I connect to it:
yugabyted connect ysql
I create the account table for the demo with a single row:
create table account(id int primary key, balance money, comment text);
insert into account values (1, 500, 'Deposit #1');
select * from account;
\q
I have added a text column that is easier to find in the file (when encryption is disabled, of course)
Find the row on the disk
The yugabyted status
output mentions the Data Directory location:
Let's search for my "Deposit" text under this directory:
sh-4.4# grep -r Deposit /root/var/data
Binary file /root/var/data/yb-data/tserver/wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 matches
sh-4.4#
The only presence on disk is in the WAL - Write Ahead Logging.
Write Ahead Logging
I can check how it looks like:
sh-4.4# strings /root/var/data/yb-data/tserver/wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 | grep -C3 -iE "^|Deposit.*"
yugalogf
* e2cada1498c64837a7ca7cba189093db0
balance
comment
public@
a@a
a@aH
SDeposit #1
[e}D
sh-4.4#
I can see the presence of "Deposit #1" prefixed with a "S" for as it is stored as a String. Now you know why you need to always enable encryption in a production environment. It is free with YugabyteDB which is fully Open Source.
For the moment, my row is visible only in the WAL, which is written ahead, and not in any data file:
sh-4.4# grep -r Deposit .
Binary file ./wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 matches
sh-4.4#
The reason is that the first level of the LSM-Tree is in memory, called MemTable or MemStore, protected by WAL but stays in memory only until it is flushed.
Flush
I can force a flush and see the SST file:
sh-4.4# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
sh-4.4# grep -r Deposit .
Binary file ./data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/000010.sst.sblock.0 matches
Binary file ./wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db/wal-000000001 matches
sh-4.4#
SST file
The SST file is contained in a directory that stores the RocksDB datastore for the tablet:
sh-4.4# cd $(dirname $(grep -rl Deposit /root/var/data/yb-data/tserver/data/rocksdb/))
sh-4.4# ls -l
total 160
-rw-r--r--. 1 root root 0 Mar 19 16:39 000003.log
-rw-r--r--. 1 root root 66381 Mar 19 16:49 000010.sst
-rw-r--r--. 1 root root 78 Mar 19 16:49 000010.sst.sblock.0
-rw-r--r--. 1 root root 16 Mar 19 16:49 CURRENT
-rw-r--r--. 1 root root 37 Mar 19 16:39 IDENTITY
-rw-r--r--. 1 root root 0 Mar 19 16:39 LOCK
-rw-r--r--. 1 root root 472 Mar 19 16:49 MANIFEST-000011
-rw-r--r--. 1 root root 4300 Mar 19 16:39 OPTIONS-000007
-rw-r--r--. 1 root root 4301 Mar 19 16:39 OPTIONS-000009
I can check the character strings in the SST file:
sh-4.4# strings 000010.sst.sblock.0 | grep -C3 -iE "^|Deposit.*"
SDeposit #1
sh-4.4#
However, we have a tool to decode it and show the RocksDB key-value documents stored by YugabyteDB:
sh-4.4# sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 16:56:42.758445 2920 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 16:56:42.758615 2920 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000010.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
sh-4.4#
I'll explain some options later. Each SQL row is a document with a key (DocKey
), a sub-document for the row itself (SystemColumnId(0)
) and one for each non-key column (ColumnId(1)
,ColumnId(2)
).
You can recognize the values for these non-key columns: 50000
for the balance and Deposit #1
for the comment.
The key is the id, with the value [1]
prefixed by a hash code 0x1210
.
The Multi-Version Concurrency Control uses a Hybrid Time (HT) with a physical component (1710866355184535
is the epoch in microseconds for Tue Mar 19 16:39:15.184535 UTC 2024
) and a logical time synchronized with Lamport clock to avoid clock skew.
I used -formatter_tablet_metadata
to decode packed rows, an optimization to store all column values in one key. It is stored in protobuf format and can be decoded with yb-pbc-dump
:
sh-4.4# yb-pbc-dump /root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
yb.tablet.RaftGroupReplicaSuperBlockPB 0
-------
primary_table_id: "000033bd000030008000000000004000"
raft_group_id: "e2cada1498c64837a7ca7cba189093db"
tablet_data_state: TABLET_DATA_READY
partition {
partition_key_start: ""
partition_key_end: ""
}
wal_dir: "/root/var/data/yb-data/tserver/wals/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db"
kv_store {
kv_store_id: "e2cada1498c64837a7ca7cba189093db"
rocksdb_dir: "/root/var/data/yb-data/tserver/data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db"
tables {
table_id: "000033bd000030008000000000004000"
table_name: "account"
...
Here, I disabled Packed Rows to show the values in the WAL, but with this metadata, they would be displayed in the same way even if, physically, the Sub Documents are stored as one Document.
I decoded the SST file with --output_format=decoded_regulardb
because I'm looking at changes committed to RegularDB, clean from any transaction intents.
To understand the hash value 0x1210
for the value 1
you can use select yb_hash_code(1)
:
sh-4.4# yugabyted connect ysql
ysqlsh (11.2-YB-2.20.2.0-b0)
yugabyte=# select '0x'||to_hex(yb_hash_code(id)), * from account;
?column? | id | balance | comment
----------+----+---------+------------
0x1210 | 1 | $500.00 | Deposit #1
(1 row)
yugabyte=#
More rows inserted
Let's insert two more rows and flush the MemTable again:
yugabyte=# insert into account values (2, 600, 'Another'), (3, 700, 'One More');
INSERT 0 2
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:12:33.443876 3027 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:12:33.444073 3027 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000010.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
Process ./000012.sst
Sst file format: block-based
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#
The two new rows follow the same format, but in another SST file.
Let's compare with PostgreSQL. Yugabyte has no CTID, the physical location in PostgreSQL heap tables, because table rows are stored by their key, so the equivalent is the key itself (visible as DocKey). YugabyteDB has no XMIN because it uses the Hybrid Time, which will never wraparound. YugabyteDB has no XMAX because the new versions will have their place within the same DocKey and then the end of visibility is marked by the Hybrid Time of the next version.
All versions are stored together which avoids all the random reads you will find in PostgreSQL or Oracle to find the right version. Of course, after many flushes, the key may have versions in multiple SST files and this is why those are compacted in the background.
Updating a column
When a column is updated, YugabyteDB doesn't copy the whole row like PostgreSQL but simply adds the new version of the column as a new SubDocKey:
yugabyte=# update account set balance=100 where id = 1;
UPDATE 1
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:21:15.816469 3076 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:21:15.816608 3076 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000010.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
Process ./000012.sst
Sst file format: block-based
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
Process ./000013.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
yugabyte=#
The newly flushed file contains the new value, 10000
at the new timestamp, 1710868871792282
. A read will now have to seek into the key in all SST files and read the four subdocuments:
yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from account where id = 1;
QUERY PLAN
-------------------------------------------------------------------
Index Scan using account_pkey on account (actual time=1.089..1.092 rows=1 loops=1)
Index Cond: (id = 1)
Storage Table Read Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 4.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 3.000
Compaction
To avoid reading from too many files, they are compacted in the background. For this small demo, like I flushed manually, I compact manually:
yugabyte=# \! yb-ts-cli compact_all_tablets
Successfully compacted all tablets
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:26:01.916337 3118 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:26:01.916476 3118 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000014.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710866355184535 w: 1 }]) -> 50000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#
Now, all versions are clustered in the same file, fast to read. The versions must be kept for MVCC to allow long queries to read a consistent snapshot but, at least, they are not scattered though the database.
MVCC retention
By default, the intermediate versions are kept for 15 minutes, defined by timestamp_history_retention_interval_sec=900
. For this demo I reduce it the time to run a compaction:
yugabyte=# \! yb-ts-cli set_flag --force timestamp_history_retention_interval_sec 60
yugabyte=# \! yb-ts-cli compact_all_tablets
Successfully compacted all tablets
yugabyte=# \! yb-ts-cli set_flag --force timestamp_history_retention_interval_sec 900
A new SST file now contains only the final versions, replacing the other SST files:
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:30:28.030417 3182 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:30:28.030556 3182 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000015.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#
The result if visible by one less subdocument to read:
yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from account where id = 1;
QUERY PLAN
-------------------------------------------------------------------
Index Scan using account_pkey on account (actual time=1.089..1.092 rows=1 loops=1)
Index Cond: (id = 1)
Storage Table Read Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 3.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 3.000
Note that with Packed Rows, the document would be:
SubDocKey(DocKey(0x1210, [1], []), [HT{ physical: 1710878213683630 }]) -> { 1: 50000 2: "Deposit #1" }
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710878309381790 }]) -> 10000
SubDocKey(DocKey(0xc0c4, [2], []), [HT{ physical: 1710878305322497 }]) -> { 1: 60000 2: "Another" }
SubDocKey(DocKey(0xfca0, [3], []), [HT{ physical: 1710878305322497 w: 1 }]) -> { 1: 70000 2: "One More" }
with all columns in one SubDocument and less next
operations:
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 2.000
Transaction intents
Many databases like PostgreSQL or Oracle write the transaction intents, with lock information, in the final blocks and have to clean them up later. YugabyteDB doesn't pollute the RegularDB with those but store them as provisional records in an IndentDB until they are committed. The RocksDB iterators are efficient at merging from multiple files (the M in LSM-Tree stands for Merge) and YugabyteDB leverages this by using two LSM-Trees per tablets.
I open a transaction to update the row and look at the SST Files before I commit:
yugabyte=# begin transaction;
BEGIN
yugabyte=# update account set balance=100 where id = 1;
UPDATE 1
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:39:03.068542 3268 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:39:03.068686 3268 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000015.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#
I took care of flushing the MemTable, but there's nothing new in RegularDB. The ongoing changes are stored in the IntentsDB in a sibling directory:
yugabyte=# \! pwd
/root/var/data/yb-data/tserver/data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db
yugabyte=# \! ls ..
tablet-e2cada1498c64837a7ca7cba189093db tablet-e2cada1498c64837a7ca7cba189093db.intents tablet-e2cada1498c64837a7ca7cba189093db.snapshots
yugabyte=# \! ls $PWD.intents
000003.log 000012.sst 000012.sst.sblock.0 CURRENT IDENTITY LOCK MANIFEST-000011 OPTIONS-000007 OPTIONS-000009
yugabyte=#
This is where I use --output_format=decoded_intentsdb
because IntentsDB has more information about the transactions involved:
yugabyte=# \! sst_dump --command=scan --output_format=decoded_intentsdb --file=$PWD.intents --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:40:49.912428 3293 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:40:49.912580 3293 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process /root/var/data/yb-data/tserver/data/rocksdb/table-000033bd000030008000000000004000/tablet-e2cada1498c64837a7ca7cba189093db.intents/000012.sst
Sst file format: block-based
SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 2 } -> TransactionId(33a6b08e-95aa-4dd5-a3a1-1601b08ddc20) none
SubDocKey(DocKey(0x1210, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 1 } -> TransactionId(33a6b08e-95aa-4dd5-a3a1-1601b08ddc20) none
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1710869935661132 } -> TransactionId(33a6b08e-95aa-4dd5-a3a1-1601b08ddc20) WriteId(0) 10000
TXN META 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 -> { transaction_id: 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 isolation: SNAPSHOT_ISOLATION status_tablet: 5c0155f301a44123a82f95a9fb3d0f87 priority: 10096702022479923001 start_time: { physical: 1710869935658346 } locality: GLOBAL old_status_tablet: }
TXN REV 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 HT{ physical: 1710869935661132 } -> SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1710869935661132 }
TXN REV 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 HT{ physical: 1710869935661132 w: 1 } -> SubDocKey(DocKey(0x1210, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 1 }
TXN REV 33a6b08e-95aa-4dd5-a3a1-1601b08ddc20 HT{ physical: 1710869935661132 w: 2 } -> SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1710869935661132 w: 2 }
yugabyte=#
My ongoing changes still use the DocKey and SubDocKey but hold additional information about locks. For example, the update has acquired strong locks on the column (kStrongRead, kStrongWrite
) to let other transactions know that they cannot write on the same column. It also acquires weaker locks (kWeakRead, kWeakWrite
) at higher level, to detect conflicts faster. The IntentsDB also adds reference to the transaction ID with two-way indexing: one to know if the transaction is committed when seeing a change, the other to find the intents to cleanup after the transaction commits or rolls back.
Commit
On commit, the new state is written to RegularDB and the Intents are deleted.
yugabyte=# commit;
COMMIT
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
yugabyte=# \! ls $PWD.intents
000003.log CURRENT IDENTITY LOCK MANIFEST-000011 OPTIONS-000007 OPTIONS-000009
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:56:12.125682 3389 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:56:12.125821 3389 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000015.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
Process ./000016.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710870886389512 }]) -> 10000
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
The session that commits doesn't way for the intents to be written to RegularDB. As soon as the transaction status is updated, the provisional records are visible to other transactions.
Delete
A delete is like an update with a "tombstone" marker:
yugabyte=# delete from account;
DELETE 3
yugabyte=# \! yb-ts-cli flush_all_tablets
Successfully flushed all tablets
yugabyte=# \! yb-ts-cli compact_all_tablets
Successfully compacted all tablets
yugabyte=# \! sst_dump --command=scan --output_format=decoded_regulardb --file=. --formatter_tablet_metadata=/root/var/data/yb-data/tserver/tablet-meta/${PWD#*/tablet-}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0319 17:59:46.082562 3448 doc_read_context.cc:71] fake_log_prefixLogAfterLoad: [0]
I0319 17:59:46.082723 3448 kv_formatter.cc:35] Found info for table ID 000033bd000030008000000000004000 (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name account, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process ./000018.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [HT{ physical: 1710871148344769 }]) -> DEL
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1710866355184535 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710870886389512 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1710868871792282 }]) -> 10000
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1710866355184535 w: 2 }]) -> "Deposit #1"
SubDocKey(DocKey(0xc0c4, [2], []), [HT{ physical: 1710871148344769 w: 1 }]) -> DEL
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1710868341526423 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 1 }]) -> 60000
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 2 }]) -> "Another"
SubDocKey(DocKey(0xfca0, [3], []), [HT{ physical: 1710871148344769 w: 2 }]) -> DEL
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1710868341526423 w: 3 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1710868341526423 w: 4 }]) -> 70000
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1710868341526423 w: 5 }]) -> "One More"
yugabyte=#
The end of life of the rows is marked by -> DEL
. All versions stay for the 15 minutes retention but are all together. The next compaction after the 15 minutes retention will remove all and a new SST File will be created, much smaller.