Fixing a slow JOIN in Postgres

Mircea Cadariu - Sep 2 - - Dev Community

We can keep our database-backed applications performing pretty well already by following a couple of simple rules, for example:

  • no N+1 queries
  • adding adequate indexes
  • keeping the output "narrow" (retrieving only the required columns)

The scenario I'm about to show you is a bit different. It was already down to just one query, it had the adequate indexes, but it was still taking ~30 seconds to run, and hence was unworkable. Everything was in place such that it would be fast, but somehow, it wasn't! I'll show you what I did such that it ran in <1s and explain every step with the reasoning behind it.

The tables

What I had is a many-to-many relationship, involving two entity classes, let's call them Foo and Bar. They were mapped with the @ManyToMany JPA annotation in the Java code. At the database level, besides the corresponding tables for storing the entities, there was a link table called foo_bar containing only two columns (foreign keys).

The tables were all pretty large, especially the link table totalling more than a hundred million rows. I should note that the link table did have standard b-tree indexes on both columns.

Table Nr. of rows
foo 2819724
bar 21109691
foo_bar 126167975

The query

The query was the following, written in JPQL:

delete 
 from 
 Foo foo 
where 
 foo.columnA in ... and 
 foo.columnB = ...
Enter fullscreen mode Exit fullscreen mode

Based on the above, Hibernate generated the following SQL query:

delete
 from foo_bar
where 
 foo_bar in (select 
              f1_0.id                   
             from 
              foo f1_0 
             where 
              f1_0.columnA in ... and 
              f1_0.columnB = ...
             )
Enter fullscreen mode Exit fullscreen mode

As you can see, it's actually a delete, however as you will see below, the bottleneck was a join operation in the execution plan.

With hindsight, the query could have been written as a native query, and using the CASCADE feature would allow solving it more declaratively. However, for the rest of this post I'll continue with the query that Hibernate generated, as I still think it's a good support for showing you some instruments you have when a query is slow.

I've extracted the set of parameters for an exemplar to reproduce. I've then added a transaction block around the query, such that I can rollback and no actual deletes happen.

The explain plan

When wanting to understand what exactly a query is doing to retrieve our data, we consult so-called explain plans. Let's have a look.

s3

Whilst it's clear what's slowing it down (the sequential scan reading all the contents of the large link table - the thick line in the image above on the left side), it's unexpected (at least for me). Given the presence of the indexes and the fact that the sub-select by itself only returns about ~800 rows, I had expected that we could avoid reading the entire large table like that, because given the data volume, it will always be slow.

No problem, let's see what our options are.

Looking for inspiration

Let's first get some inspiration by disabling some operations the database can use in planning, for example let's prevent it from going for hash joins. It will then have to use one of its alternatives (the other two options are nested loop and merge joins).

set enable_hashjoin = off;
Enter fullscreen mode Exit fullscreen mode

Oh - take a look at that!

s3

No more sequential scan of a hundred million rows. By disabling the hash join option, Postgres went for a much more efficient nested loop that fully utilises the indexes we've defined on the link table.

So far so good. We now know what we're after, the next question is how to get there, because disabling the hash joins like this is only meant to be for experimentation.

Hypothesis

The question to ask is - what's preventing Postgres from employing the more efficient nested loop alternative? Let's take a step back and reflect on how Postgres makes its decisions with regards to how it retrieves our data from disk.

Postgres uses a cost-based optimiser that computes the cost for the various alternatives, and then selects the best one. The costs are calculated based on various factors, including table statistics of the data and configuration properties.

From the Postgres codebase we learn that it collects a sample of 300 multiplied by the so-called default statistics target, a configuration option we can control. If you're wondering like me why the 300, it's not due the 300 Spartans confronting the Persians at Thermopylae. For the real reason, you can have a look here, where as usual, the code has a helpful comment indicating even the research paper on which the choice is based on.

For our use-case, the intuition is that due to the size of the table (hundreds of millions), the sample might be too small to get an accurate representation of the data, which might lead to less efficient planning, but let's try to confirm that with some numbers.

Checking the statistics

One of the statistics Postgres collects is an n_distinct, an estimate of the number of distinct rows in the table. Let's see this for table foo_bar:

select 
 n_distinct 
from 
 pg_stats 
where 
 tablename = 'foo_bar' and 
 attname = 'foo_id'; 

28001
Enter fullscreen mode Exit fullscreen mode

Let's now compute the actual number of distinct values and compare with the estimate. For this, I'm using the following SQL query:

select 
 count(distinct(foo_bar.foo_id)) 
from 
 foo_bar;

910322
Enter fullscreen mode Exit fullscreen mode

There we go! The entry in the pg_stats is ~30 times smaller than the actual number of distinct rows.

This is a problem because it will impact the calculation of the selectivity value used by the planner. To calculate this, Postgres uses the frequencies of the most common values (MCV) in a table. Note that how many we collect is bounded by the default statistics target value we talked about earlier. If the values we're looking for in a table are not in this list, Postgres fallbacks to using this following formula for selectivity:

selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
Enter fullscreen mode Exit fullscreen mode

Now if the values are not in the MCV (because we collected too few), and result of this is wrong because of the wrong estimate of distinct values, then indeed, the planner will not choose the best plan. We might get a nested loop when the selectivity is low, or a hash join when the selectivity is high, which we don't want.

Increasing the amount of statistics

Let's try to improve the situation by allowing Postgres to use a larger sample size in order to get better statistics. This means it will store more values in the MCV list as well as looking at more rows when determining the n_distinct. We have to keep in mind however that it will take longer for it to create plans ("Planning time" in the explain analyse output).

Let's increase the sample size from the default of 100 to a value let's say 10 times bigger, but only for one column, like so:

alter table foo_bar alter foo_id set statistics 1000;
Enter fullscreen mode Exit fullscreen mode

If we query for the n_distinct now we'll get the same value as before, it will only update after running:

analyse foo_bar;
Enter fullscreen mode Exit fullscreen mode

This will take a while. Remember, it has more work to do.

After a couple of minutes, it finished. Let's have a look if the estimate is closer to reality now.

select 
 n_distinct 
from 
 pg_stats 
where 
 tablename = 'foo_bar' and 
 attname = 'foo_id';

121894
Enter fullscreen mode Exit fullscreen mode

Progress! Still about 9 times smaller than the actual number, but let's run explain analyse now and see if we'll get the nested loop being chosen.

Bingo

s3

This executes in under a second, which is a big improvement over where we started. Nice!

Altering the amount of statistics collected was enough to significantly improve this example, however, in case you need even more control, at least for the n_distinct you can manually set it to whatever you want with the following command:

alter table tool_analysis_finding alter column tool_analysis_id set (n_distinct = ...);
Enter fullscreen mode Exit fullscreen mode

However, I would advise against going directly for this approach, because it means we won't benefit anymore from the automatic statistic collection that the database does in the background in an unattended fashion.

An alternative - changing random_page_cost

Let's look at something else. From the Postgres code we can look up other variables that come into the picture when deciding to choose an index scan over a sequential scan. For example, there's this random_page_cost that can be found here in the file costsize.c.

For a description of this setting, have a look here. Basically, it's the estimate for the cost of retrieving a page non-sequentially from disk. With modern hardware like SSDs, there isn't such a big difference between sequential retrieval and random. The default configuration of 4 is not suitable therefore. For example, Crunchy Bridge has changed this value to 1.1 for all new databases on their platform.

Let's try adjusting this to 1.1 and see what happens.

Result

s3

It worked again! We got the nested loop and sub-second execution time again - great stuff.

Conclusion

Postgres features a very clever query planner that does an excellent job finding an efficient way to retrieve our data in most of the cases. However, with some guidance from us (mainly in uncommon situations like very large tables), giving it more details about the context, or letting it use some more storage or time for its internal operations, it continues to deliver the results to our queries as fast as possible.

Further reading

. . . .
Terabox Video Player