There can be issues with optimizer statistics when a query contains a predicate with a value that is out of range, such as higher than the maximum value gathered during the last ANALYZE.
If statistics were updated in real-time, it would be easy for the planner to estimate the number of rows to zero or one (the planner never sets zero). However, statistics are gathered with ANALYZE or Auto-Analyze and can become stale until the next run.
Typically, there are always a few rows above the known maximum on columns that are constantly increasing, such as sequences or current timestamps. Oracle Database uses linear decay. PostgreSQL leverages the index to provide a quick, accurate maximum to adjust the estimations. YugabyteDB is currently implementing the cost-based optimizer, which is available in preview. Here is a small test I did to compare the estimates.
I created a table with a ts
timestamp, the system time when inserting, and a ty
type column. I disable Auto-Vacuum to disable Auto Analyze for the experiment:
postgres=# create table demo ( id bigserial primary key, ts timestamptz default clock_timestamp(), ty text )
with (autovacuum_enabled = off);
CREATE TABLE
postgres=# insert into demo (ty,ts)
select 'a', timestamp '2024-02-08 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 86400
I inserted data for February 8, 2024. I will query a range of dates, including those outside the existing range, for a future date. Additionally, I will query with equality on ty
for both an existing and non-existing value.
Here is an example without statistics. Look at the estimated number of rows:
postgres=# explain (analyze, summary off) select * from demo
where ts > timestamp '2024-02-07 14:00:00' - interval '1 hour';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Seq Scan on demo (cost=0.00..1287.96 rows=19652 width=48) (actual time=0.013..10.890 rows=39599 loops=1)
Filter: (ts > '2024-02-07 13:00:00'::timestamp without time zone)
Rows Removed by Filter: 46801
(3 rows)
postgres=# explain (analyze, summary off) select * from demo
where ts > timestamp '2024-02-08 14:00:00' - interval '1 hour';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on demo (cost=0.00..1287.96 rows=19652 width=48) (actual time=9.285..9.285 rows=0 loops=1)
Filter: (ts > '2024-02-08 13:00:00'::timestamp without time zone)
Rows Removed by Filter: 86400
(3 rows)
postgres=# explain (analyze, summary off) select * from demo
postgres-# where ts > timestamp '2024-02-09 14:00:00' - interval '1 hour';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on demo (cost=0.00..1287.96 rows=19652 width=48) (actual time=9.385..9.385 rows=0 loops=1)
Filter: (ts > '2024-02-09 13:00:00'::timestamp without time zone)
Rows Removed by Filter: 86400
(3 rows)
postgres=# explain (analyze, summary off) select * from demo
where ty='a'
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on demo (cost=0.00..1287.96 rows=295 width=48) (actual time=0.014..11.199 rows=86400 loops=1)
Filter: (ty = 'a'::text)
(2 rows)
postgres=# explain (analyze, summary off) select * from demo
where ty='b'
;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on demo (cost=0.00..1287.96 rows=295 width=48) (actual time=7.595..7.595 rows=0 loops=1)
Filter: (ty = 'b'::text)
Rows Removed by Filter: 86400
(3 rows)
My objective is to analyze the accuracy of estimated rows (cost=... rows=
) as compared to actual ones (actual... rows=
) . I intend to do this through various scenarios including those with and without the use of ANALYZE, INSERT after the last ANALYZE, and an index on the columns.
This has been run on PostgreSQL 16, and on YugabyteDB 2.19.3 with and without yb_enable_optimizer_statistics=on
.
Here is the summary from the above execution plans:
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 19652 | 1000 | 1 | 39599 |
ts>2024-02-08 | 19652 | 1000 | 1 | 0 |
ts>2024-02-09 | 19652 | 1000 | 1 | 0 |
ty='a' | 295 | 1000 | 1 | 86400 |
ty='b' | 295 | 1000 | 1 | 0 |
I have rows only for the day of 2024-02-07
but without any statistics, the query planner doesn't know that and uses a generic calculation to estimate the number of rows.
Let's ANALYZE the table and run the same queries:
postgres=# analyze demo;
ANALYZE
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 39275 | 86400 | 39752 | 39599 |
ts>2024-02-08 | 9 | 86400 | 9 | 0 |
ts>2024-02-09 | 9 | 86400 | 9 | 0 |
ty='a' | 86400 | 86400 | 86400 | 86400 |
ty='b' | 1 | 86400 | 1 | 0 |
With the table and column statistics, the estimated number of rows is correct with PostgreSQL. For the values out of range, the planner estimates a small number, but higher than zero, to account for stale statistics. With yb_enable_optimizer_statistics=off
, YugabyteDB uses the number of rows (pg_class.reltuples
) but not the column statistics. With yb_enable_optimizer_statistics=on
it uses them to estimate the selectivity, with the PostgreSQL algorithm.
Let's insert another day to wee what happens when we are above the values gathered by the last ANALYZE:
postgres=# insert into demo (ty,ts)
select 'a', timestamp '2024-02-09 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 8640
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 125723 | 86400 | 39752 | 125999 |
ts>2024-02-08 | 39422 | 86400 | 9 | 39599 |
ts>2024-02-09 | 17 | 86400 | 9 | 0 |
ty='a' | 172800 | 86400 | 86400 | 172800 |
ty='b' | 1 | 86400 | 1 | 0 |
Even if the statistics where not gathered, PostgreSQL has the knowledge of newly inserted rows (used by the Auto Vacuum and visible in pg_stat_all_tables
) and provides a accurate estimations.
Let's Analyze the table:
postgres=# analyze demo;
ANALYZE
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 125834 | 172800 | 125782 | 125999 |
ts>2024-02-08 | 39053 | 172800 | 38853 | 39599 |
ts>2024-02-09 | 17 | 172800 | 17 | 0 |
ty='a' | 172800 | 172800 | 172800 | 172800 |
ty='b' | 1 | 172800 | 1 | 0 |
The estimations are good with PostgreSQL and YugabyteDB when using the column statistics. However, the out-of-range 2024-02-09
is still over-estimated.
Let's create an index on the ts
column and run the queries again:
postgres=# create index demots on demo(ts);
CREATE INDEX
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 125834 | 172800 | 125782 | 125999 |
ts>2024-02-08 | 39053 | 172800 | 38853 | 39599 |
ts>2024-02-09 | 1 | 172800 | 17 | 0 |
ty='a' | 172800 | 172800 | 172800 | 172800 |
ty='b' | 1 | 172800 | 1 | 0 |
Thanks to the presence of the index, the out-of-range estimation is now correct (the planner always put 1 instead of 0) with PostgreSQL which can know the actual maximum value easily from the last block of the B-Tree index.
Now that I have this index, I insert rows for an additional day and run the queries again:
postgres=# insert into demo (ty,ts)
select 'a', timestamp '2024-02-10 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 86400
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 188694 | 172800 | 125782 | 212399 |
ts>2024-02-08 | 58563 | 172800 | 38853 | 125999 |
ts>2024-02-09 | 1165 | 172800 | 17 | 39599 |
ty='a' | 259122 | 172800 | 172800 | 259200 |
ty='b' | 1 | 172800 | 1 | 0 |
Having the actual maximum is not sufficient for an exact estimation, but, without any ANALYZE, the out-of-range values were estimated higher with PostgreSQL.
I insert ten more days and run the queries again:
postgres=# insert into demo (ty,ts)
select 'b', timestamp '2024-02-20 00:00:00'-generate_series(1,864000)*interval'1 second'
;
INSERT 0 864000
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 817749 | 172800 | 125782 | 1076399 |
ts>2024-02-08 | 253794 | 172800 | 38853 | 989999 |
ts>2024-02-09 | 10658 | 172800 | 17 | 903599 |
ty='a' | 1122965 | 172800 | 172800 | 259200 |
ty='b' | 1 | 172800 | 1 | 864000 |
The estimation error is higher. This mechanism helps with stale statistics, when Auto Analyze didn't run recently, but is not a solution if you stop Auto Analyze.
Note that I've inserted those new rows with a different ty
value, but, without ANALYZE, the query planner doesn't see them and still under-estimates the number of rows with ty='b'
We have seen that the index can help for the increasing value, let's try the same here:
postgres=# create index demo_b on demo(ty asc);
CREATE INDEX
Query | PostgreSQL | YB stats=off | YB stats=on | Actual |
---|---|---|---|---|
ts>2024-02-07 | 817749 | 172800 | 125782 | 1076399 |
ts>2024-02-08 | 253794 | 172800 | 38853 | 989999 |
ts>2024-02-09 | 10658 | 172800 | 17 | 903599 |
ty='a' | 1123200 | 172800 | 172800 | 259200 |
ty='b' | 1 | 172800 | 1 | 864000 |
This didn't change. When you add rows with a new value on a column that is used by predicates, you must have it analyzed. in this example, the misestimate can have severe consequences on joins. When the estimation is 1 row, the PostgreSQL planner can choose a Nested Loop, and then finally run it with thousands of loops. YugabyteDB lowers the consequence of it with Batched Nested Loops.
I disabled Auto Vacuum for the experiment. In PostgreSQL you should have it enabled:
postgres=# alter table demo set ( autovacuum_enabled = on );
ALTER TABLE
YugabyteDB doesn't need vacuum. It is not subject to bloat, heap fetches, and transaction ID wraparound. If you rely on statistics (yb_enable_optimizer_statistics=on
you should schedule ANALYZE).
In summary, either you run with no statistics at all and rely on generic plans, which is not recommended in PostgreSQL, or you must be sure that Auto-Analyze is running regularly. PostgreSQL has some nice ways to avoid misestimates for out-of-range predicates but they work only for a small staleness. With YugabyteDB, until the Cost Based Optimizer is GA with a cost model that is cluster-aware, you can use the rule based optimizer for OLTP queries, and help with planner hints for more complex queries. If you use the Cost-Based Optimizer, it is important to analyze all tables and set yb_enable_optimizer_statistics
to ensure that the query planner uses all available statistics. When you rever back, you must de-analyze the tables.
Additionally, it is recommended to have a data model that is not too sensitive to small changes in cardinality. This can be achieved by creating the right indexes for your access patterns so that the best Scan and Join methods become obvious.