YugabyteDB does not re-use the PostgreSQL code as-is. To get high performance in a distributed environment, many operations have been pushed down and offloaded to the storage. The SQL processing happens in YSQL (the PostgreSQL backend) which gets rows from DocDB (the distributed storage and transaction).
Some of those optimizations were introduced in previous releases:
- pushdowns for aggregates in release 2.0
- pushdown for Seq Scan filters in 2.13
- pushdown for Index Scan conditions in 2.15
How do they combine where running a select COUNT(*)
with a WHERE
clause?
I'm running this in version 2.17 and create a 1 million rows table with a range-sharded index:
create table demo ( id bigint primary key, val int );
insert into demo select n, n from generate_series(1,1000000) n;
create index demoi on demo(val asc, id)
split at values ( (333333),(666666) );
analyze demo;
set yb_enable_optimizer_statistics = on;
Count a few rows
I'll start with the most common case: the WHERE close is selective and that's why we have an index for it.
yugabyte=> explain (analyze, dist)
select count(*) from demo where val between 1 and 42;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4.01..4.02 rows=1 width=8) (actual time=0.805..0.805 rows=1 loops=1)
-> Index Only Scan using demoi on demo (cost=0.00..4.01 rows=1 width=0) (actual time=0.764..0.781 rows=42 loops=1)
Index Cond: ((val >= 1) AND (val <= 42))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Execution Time: 4.000 ms
Planning Time: 0.072 ms
Execution Time: 0.847 ms
Storage Read Requests: 1
Storage Write Requests: 0
Storage Execution Time: 4.000 ms
Peak Memory Usage: 24 kB
(12 rows)
This plan is efficient. The query planner estimated a few rows to be counted and choose the Index Only Scan. 42 rows were fetched from DocDB to YSQL in one read request, in 4 milliseconds, and then counted there for an additional 4 millisecond.
Index Scan and push-down
What is pushed down here? The answer is in the rows=42
actual output of the Scan operation. The index condition was pushed down, to scan a range of the index. But as it returned all 42 rows rather than the count, the Aggregate
was not pushed down.
This is acceptable for a few rows. We will see in a second part what happens with a million of rows that you definitely don't want to transport to count them on another node.
Seq Scan and push-down
What would happen if, for any reason, an index scan was not possible?
I disable enable_indexscan
for the query with a hint:
yugabyte=> explain (analyze, dist)
/*+ Set(enable_indexscan off) */
select count(*) from demo where val between 1 and 42;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Aggregate (cost=104697.50..104697.51 rows=1 width=8) (actual time=3767.276..3767.276 rows=1 loops=1)
-> Seq Scan on demo (cost=0.00..104697.50 rows=1 width=0) (actual time=26.033..3767.161 rows=42 loops=1)
Filter: ((val >= 1) AND (val <= 42))
Rows Removed by Filter: 999958
Storage Table Read Requests: 978
Storage Table Execution Time: 3652.016 ms
Planning Time: 0.110 ms
Execution Time: 3767.337 ms
Storage Read Requests: 978
Storage Write Requests: 0
Storage Execution Time: 3652.016 ms
Peak Memory Usage: 30 kB
(12 rows)
Now it takes more than 3 seconds because it has to scan all rows (one million) in DocDB, send all of them to the YSQL layer that applies the filter (Rows Removed by Filter: 999958
), gets the remaining ones (rows=42
) and finally count them to get one aggregated result (rows=1
)
This is where we want to push down the filter by enabling the yb_enable_expression_pushdown
session parameter:
yugabyte=> explain (analyze, dist)
/*+ Set(enable_indexscan off) Set(yb_enable_expression_pushdown on) */
select count(*) from demo where val between 1 and 42;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=104697.50..104697.51 rows=1 width=8) (actual time=1233.028..1233.028 rows=1 loops=1)
-> Seq Scan on demo (cost=0.00..104697.50 rows=1 width=0) (actual time=1233.015..1233.018 rows=3 loops=1)
Remote Filter: ((val >= 1) AND (val <= 42))
Partial Aggregate: true
Storage Table Read Requests: 1
Storage Table Execution Time: 1232.005 ms
Planning Time: 0.110 ms
Execution Time: 1233.096 ms
Storage Read Requests: 1
Storage Write Requests: 0
Storage Execution Time: 1232.005 ms
Peak Memory Usage: 30 kB
(12 rows)
This is a bit faster because the rows have been filtered in DocDB (Remote Filter
) and also counted there (Partial Aggregate: true
). Only the partial count has been fetched from the 3 tablets (rows=3
) and has added to finalize the aggregate to get one count (rows=1
).
When there is no index, this is the best we can do: 1.2 seconds here instead of the previous 3.7 seconds (without the push-down of filter and aggregate).
This is great but, unfortunately too simple if you have to use pg_hint_plan
to force a full table scan. The Seq Scan
you see in the execution plan is not a real Seq Scan. The aggregate push-down has been implemented in early versions of Yugabyte using the Foreign Scan node of PostgreSQL. You see Seq Scan
because Foreign Scan
was renamed in explain.c to hide this implementation detail.
When you think about it, it means that the SeqScan()
hint forces another one, the real Seq Scan
which is, in YugabyteDB, actually a subclass of it that is called YBSeqScan
. I know you get lost, but don't panic, I've summarized that in a table at the end of this post. This scan is visible in the execution plan as YB Seq Scan
and, in this version (2.17), it does not support the aggregate push down:
yugabyte=> explain (analyze, dist)
/*+ SeqScan(demo) Set(yb_enable_expression_pushdown on) */
select count(*) from demo where val between 1 and 42;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Aggregate (cost=104697.50..104697.51 rows=1 width=8) (actual time=1222.116..1222.116 rows=1 loops=1)
-> YB Seq Scan on demo (cost=0.00..104697.50 rows=1 width=0) (actual time=1222.093..1222.105 rows=42 loops=1)
Remote Filter: ((val >= 1) AND (val <= 42))
Planning Time: 0.114 ms
Execution Time: 1222.160 ms
Storage Read Requests: 0
Storage Write Requests: 0
Storage Execution Time: 0.000 ms
Peak Memory Usage: 30 kB
(9 rows)
The expression filter was pushed down, as a Remote Filter
in this YB Seq Scan
but the 42 rows have been fetched to be aggregated in YSQL.
Here, because there are only 42 rows, it doesn't show a difference in the response time.
Count many rows
I'll now count a lot of rows by changing the predicate:
yugabyte=> explain (analyze, dist)
select count(*) from demo where val between 42 and 1e6 ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=108021.14..108021.15 rows=1 width=8) (actual time=3790.036..3790.036 rows=1 loops=1)
-> Seq Scan on demo (cost=0.00..107190.29 rows=332340 width=0) (actual time=4.047..3705.883 rows=999959 loops=1)
Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
Rows Removed by Filter: 41
Storage Table Read Requests: 978
Storage Table Execution Time: 3416.015 ms
Planning Time: 0.072 ms
Execution Time: 3790.096 ms
Storage Read Requests: 978
Storage Write Requests: 0
Storage Execution Time: 3416.015 ms
Peak Memory Usage: 30 kB
(12 rows)
Here, without any push down, the scan takes nearly 4 seconds, one million rows were fetched by YSQL to be filtered (removing only 41 rows) and counted.
An index is not better, even with expression pushdown, because there are many rows and the aggregate is not pushed down:
yugabyte=> explain (analyze, dist)
/*+ IndexOnlyScan(demo) Set(yb_enable_expression_pushdown on) */
select count(*) from demo where val between 42 and 1e6 ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=124631.41..124631.42 rows=1 width=8) (actual time=5546.335..5546.336 rows=1 loops=1)
-> Index Only Scan using demoi on demo (cost=0.00..122138.86 rows=997019 width=0) (actual time=5.589..5451.448 rows=999959 loops=1)
Index Cond: (val >= 42)
Remote Filter: ((val)::numeric <= '1000000'::numeric)
Heap Fetches: 0
Storage Index Read Requests: 978
Storage Index Execution Time: 4928.021 ms
Planning Time: 0.126 ms
Execution Time: 5546.383 ms
Storage Read Requests: 978
Storage Write Requests: 0
Storage Execution Time: 4928.021 ms
Peak Memory Usage: 24 kB
(13 rows)
With the Seq Scan
, the aggregate can be pushed down by enabling yb_enable_expression_pushdown
:
yugabyte=> explain (analyze, dist)
/*+ Set(yb_enable_expression_pushdown on) */
select count(*) from demo where val between 42 and 1e6 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=108021.14..108021.15 rows=1 width=8) (actual time=1359.976..1359.976 rows=1 loops=1)
-> Seq Scan on demo (cost=0.00..107190.29 rows=332340 width=0) (actual time=1359.964..1359.967 rows=3 loops=1)
Remote Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
Partial Aggregate: true
Storage Table Read Requests: 1
Storage Table Execution Time: 1360.006 ms
Planning Time: 0.104 ms
Execution Time: 1360.037 ms
Storage Read Requests: 1
Storage Write Requests: 0
Storage Execution Time: 1360.006 ms
Peak Memory Usage: 30 kB
(12 rows)
Here, because the query planner estimations were good, a Seq Scan
, which is actually the Foreign Scan, has been chosen. If you force a SeqScan()
with a hint, the YB Seq Scan
comes, without pushing down the aggregate:
yugabyte=> explain (analyze, dist)
/*+ SeqScan(demo) Set(yb_enable_expression_pushdown on) */
select count(*) from demo where val between 42 and 1e6 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=108021.14..108021.15 rows=1 width=8) (actual time=2561.376..2561.376 rows=1 loops=1)
-> YB Seq Scan on demo (cost=0.00..107190.29 rows=332340 width=0) (actual time=5.691..2403.967 rows=999959 loops=1)
Remote Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
Planning Time: 0.112 ms
Execution Time: 2561.419 ms
Storage Read Requests: 0
Storage Write Requests: 0
Storage Execution Time: 0.000 ms
Peak Memory Usage: 30 kB
It is the same with the NoIndexScan()
hint:
yugabyte=> explain (analyze, dist)
/*+ NoIndexScan(demo) Set(yb_enable_expression_pushdown on) */
select count(*) from demo where val between 42 and 1e6 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=108021.14..108021.15 rows=1 width=8) (actual time=2763.550..2763.551 rows=1 loops=1)
-> YB Seq Scan on demo (cost=0.00..107190.29 rows=332340 width=0) (actual time=6.953..2631.524 rows=999959 loops=1)
Remote Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
Planning Time: 0.115 ms
Execution Time: 2763.651 ms
Storage Read Requests: 0
Storage Write Requests: 0
Storage Execution Time: 0.000 ms
Peak Memory Usage: 30 kB
(9 rows)
Then, if you need to force Full Table Scan that enables all push downs, you must use the query parameter planner, set at session level, or with the Set()
hint, but not the Scan hints.
A summary as of YugabyteDB 2.17
Here is a summary:
Seq Scan | YB Seq Scan | Index Scan | |
---|---|---|---|
In EXPLAIN as | Seq Scan |
YB Seq Scan |
Index Scan |
In code (/src/postgres/ ) |
ForeignScan | YBSeqScan | IndexScan |
Aggregate Push Down |
Partial Aggregate ✅ |
||
Expression Push Down |
Remote Filter ✅ |
Remote Filter ✅ |
Index Cond. ✅ |
Forced by hint | Set() | SeqScan() | IndexScan() |
Note:
- SeqScan in
nodeSeqscan.c
is not used - ForeignScan in
ybc_fdw.c
whill not be used anymore for YugabyteDB tables after https://github.com/yugabyte/yugabyte-db/issues/18082
Don't forget to enable the features in your session
set yb_enable_expression_pushdown = on;
set yb_enable_optimizer_statistics = on;
and have the tables analyzed. You shouldn't need to hint if the estimations are good.