You're a web developer helping to build a growing SaaS application. You've got your back-end programming language and framework under control. You know about building models, controllers, and views.
Your web framework of choice handles all the database stuff behind the scenes. You're most likely using a relational database like PostgreSQL. It's all good.
However, your user base has grown. Your product has landed larger customers with more users.
Now you're facing some new issues with your application:
- Certain pages on your site take forever to load
- That global search feature in your app is totally useless because it takes too long to load
- Those gnarly multi-filter reports run fine for simple combinations, but add a few more filters and your report crawls to a halt
- Customer-facing analytics/reports for your biggest customers is near impossible to query against without taking minutes to run
These are all usually symptoms of database issues. Some common causes include:
- Poorly written queries
- Lack of or misplaced indexes
- Data structures that don't lend themselves to decent performance
- Using the wrong database tool for the job
- Approaches that were fine for expected smaller loads but have outlived their usefulness
I want to take you through some of the database techniques, tricks and solutions I wish I knew years ago as a developer helping build multiple SaaS applications. I'll focus on using PostgreSQL as my database of choice - but many of these ideas are transferable.
We'll start this series with indexing.
Indexing
I remember some of my earlier jobs trying to get certain queries to perform well.
"Maybe if I just change how I'm joining tables it'll go faster?"
"Maybe if I don't select as much data it'll be faster?"
It was like black magic where you had to use trial-and-error to figure out if a query could make things faster.
I remember specific analytical reports that I couldn't get to work very efficiently. Now that I understand how to use indexes well I know what I should/could have done 🥲.
One of the learnings that propelled my ability to create efficient SQL queries and build pages/reports that load quickly is learning why and when to apply indexes.
Why Indexes Work
I don't think I need to dig into this too much: just like an index at the back of a book - database indexes are a structure that allow searching through many records really fast.
If I want to find which pages in a book talk about "gravity", then instead of searching through every single page, I can look at the index to see if this topic is listed and which pages I should turn to!
Databases have many types of indexes: but understanding the basic index types like b-tree indexes can get you a long way. These are called "b-tree" in PostgreSQL. In SQL Server most indexes are b-tree structures.
Index Basics
In an index for a book, you have generally two different pieces to consider:
- The book to index
- The term to index
The pages that you need to turn to are the "values" of the index that tell you where to look.
In a database, the parts of an index that you need to know are similar:
- The table to index
- The column to index
The "pages" that the query engine needs to look up are found using the values of those indexes.
Note: Yes, in some databases "pages" are ironically the same term for where our rows/records are stored.
In reality, the index doesn't look like the above. It might look like a b-tree. Or, it might be stored some other way as there are many different types of indexes.
For example, hash indexes work well for looking up a record by a unique string
(like an email address):
The value
stored is a pointer to where on the database's file storage the row/tuple/record with that specific username
is. The database engine can "look-up" that disk location super quick.
Otherwise, the database engine would have to loop through all the records in the table to figure out which one has that specific username.
Multi-Column Indexes
Usually, in a book's index, each index is for one term. But imagine (and it probably exists) a book where there are multiple terms that are related to each other quite often: like "space" and "time".
There might be a "multi-term index" where each index helps you to find sentences where both terms are used.
Similarly, you can index multiple database columns in the same index.
create index my_index on orders (user_id, completed);
This type of index can help to support more complex querying scenarios.
How And When To Use An Index
Here are the common cases when it probably makes sense to add an index to a column on a relational table with some tips sprinkled throughout.
Looking up specific records by id
Consider this query:
select * from users where id = 5;
If id
is a primary key, in all relational databases an index on the id
column is already created for you. This one's easy - you're already covered!
If an index didn't exist on the id
column, then this query would have to loop through every single record in the table until it finds the correct record.
What many developers miss - which will help with later tips - is the same principle applies to all queries including delete
and update
.
Consider these two statements:
delete from users
where id = 5;
update users
set age = 22
where id = 5;
Both of these statements have to first find the row that has id
5.
Because id
is a primary key, it's already indexed.
The kicker is this: if there wasn't an index on id then these statements would have to loop through the entire table until it finds the row with id
5.
Without an index, the select
, delete
and update
would be really slow on any non-trivially sized table.
This idea is foundational to the following tips and will unlock your database skills because it applies to all types of queries and statements.
Selecting a subset of records
Consider this query:
select * from orders
where user_id = 5;
Just like querying against the id
column, creating an index on the user_id
column will greatly speed this up in any non-trivially sized table.
create index my_index orders (user_id);
Remember the principle that indexes don't just apply to select
queries but to any time that the query engine needs to find records?
This index would apply to the following delete
and update
statements:
delete from orders
where user_id = 5;
update orders
set price = 5.33
where user_id = 5;
Note: This is why answering the question "should I add an index" is not always straightforward. The answer depends on whether use said column in delete
, update
or select
statements. As you'll read later, impacts on write performance, disk usage, etc. also impact this decision!
Conditions On Multiple Columns
select *
from users
where work_remote = true
and tenant_id = 5;
In this query, we're looking for all users that are owned by a specific SaaS tenant where each user is working remotely.
Now that the query is filtering by more than one column the possibilities of what you can/should do become more "interesting".
What are your options?
- No indexes
- Create an index on
tenant_id
- Create an index on
work_remote
- Create both of the indexes above
- Create a multi-column index on (
tenant_id
,work_remote
) - Create a multi-column index on (
work_remote
,tenant_id
)
That's a lot of options! And it grows even more if we add another column to the mix!
This is where the trade-offs of using indexes play into your decision. Here are some general guidelines that might be helpful as a starting point:
- Is your table only few thousand records in length?
- It might be best to not use any indexes
- Or just index
tenant_id
as a high-value lower impact change and monitor
- Do you have many other queries that use these columns in different combinations?
- Might be best to consider using two separate indexes
- Do you need this query to perform as well as it possibly could (and these tables are more than a few thousand records)?
- Create an index with the order (tenant_id, work_remote)
Note: Some databases are really poor at applying multiple indexes in the same query. It's good to benchmark these things vs. assuming things and understand what indexing optimizations your database supports. For example, PostgreSQL supports using bitmap index scans where many separate indexes can be combined and used by the same query. Not every database supports this.
Order Matters?
The order of columns defined in a multi-column index make all the difference.
That is, order matters in a b-tree index. For other specialized index types, it might not.
Why? The purpose of an index is to eliminate as many non-matching records as quickly as possible. You always want the first column in an index to be the most specific of the columns. This will whittle down the pool of possible records that match your criteria faster.
In the image above, we want to whittle down our possible records as quickly as possible.
The first option with work_remote
as the leading indexed column doesn't really help us. Best case, we still have 1 million records left to search through. We've only split the possible records to search by two.
The second option using tenant_id
will more quickly reduce the number of possible records. We're able to quickly grab a very selective subset of all the records: only the records belonging to tenant_id
5 will be fetched from the table.
Index columns used in joins
Usually, queries with join
will join
on columns that are foreign keys.
Consider this query:
select *
from users u
inner join addresses a
on u.id = a.user_id;
This is a silly query to run because it grabs data for every single user.
What does the query engine need to do?
- Loop through every record in the users table.
- For each
user
record, find theaddress
record who'suser_id
is the same value as theid
column on the users table. - Return the data to the database client.
For no. 2 above, each time the query engine needs to find a specific record in the address
table it will loop through all the records in that table.
Let me repeat: the query will loop through the entire addresses
table for every single user record.
If both tables are large: this could take a while. You'll have a really slow query.
Joins With Foreign Keys
Luckily for you, some databases will automatically create an index on all foreign keys that you create! So no worries, right?
Oh. But some databases won't do this.
We're all looking at you PostgreSQL:
... it is often a good idea to index the referencing columns too. Because this is not always needed, and there are many choices available on how to index, the declaration of a foreign key constraint does not automatically create an index on the referencing columns.
Oh ya. SQL Server too:
Creating a foreign key constraint doesn't automatically create a corresponding index.
In contrast, MySQL will create an index on a foreign key column for you:
Such an index is created on the referencing table automatically if it does not exist.
Foreign Key Joins In Update And Delete Statements
Again, this idea of adding an index to foreign keys affects update
and delete
statements. Consider the following:
delete
from addresses a
inner join users u
on u.id = a.user_id
where u.work_remote = true;
This statement would delete all addresses where the corresponding user works remotely. If there's no index on foreign key column addresses.user_id
then that table would need to do a full scan (e.g. full loop through the table) for every single user record that works remotely.
In PostgreSQL and SQL Server, you'd definitely want to be aware of this.
Note: It's also worth mentioning that some ORMs will automatically create an index for all foreign keys you create, like Entity Framework! I pays to understand the tools that you're working with 🥲.
Joins Using Non-Foreign Key Columns
I'd say this is rare in many SaaS applications. But, it does happen in analytical queries, unstructured data, etc.
Consider the following:
select *
from account_codes codes
inner join imported_codes imported
on codes.code = imported.code
where imported.tenant_id = 4;
Imagine that you have some account codes (whatever that is) in our system. We've also imported codes from a 3rd party system that doesn't match up with our data by any IDs.
In this join, you could (most likely) significantly speed up the query by indexing both sides:
create index on account_codes (code);
create index on imported_codes (code);
Cascading Operations
Consider this schema:
create table orders (
id bigint primary key,
placed_at timestamp with time zone not null,
user_id bigint references users (id) on delete cascade
)
Let's focus on the foreign key definition.
on delete cascade
means that whenever a user record is deleted any referencing records in orders
will automatically be deleted too.
Imagine user
with id
5 is deleted. The database has to find all orders where user_id = 5
to delete them.
If there's no index on user_id
then this will require a full table scan. If you tried to delete many user records, then this means many full tables scans against the orders
table!
Note: If you ever have a page in your application that deletes records and it takes longer than expected - it could be there are some cascading deletes that might benefit from some indexes!
Index Trade-Offs
Indexes aren't magic. As with all things in software development: there are trade-offs.
Let's shoot through some of the important trade-offs:
Storage
Indexes use a lot of disk space. In many cases, one index can use the same space as the table it's indexing. Yup.
Once you start using multi-column indexes and specialized index types then your indexes can be even larger than your tables!
The benefit is that you can significantly speed up certain queries or operations like deleting and updating records. But if you aren't careful and considerate of this point, you can quickly cause your database size to expand significantly.
This can lead to slow backups, slow performance due to memory pressure, etc.
Memory Pressure
Databases will load the most often used objects into available RAM to speed things up (tables, indexes, etc.). However, if you create too many indexes then the amount of data that your database needs to keep in memory can start to exceed available RAM.
Not entire indexes are kept in memory - only the most often used disk pages, to be specific. But the point still applies: you're forcing the database to require storing too much "stuff" in RAM. This can start to cause severe issues if you're not careful!
Write Performance
Indexes have to be in sync with the table that they index. This is one way that relational databases keep the data consistent and referential integrity in place.
Every time a new record is inserted, updated or deleted, the corresponding index entries across all indexes have to be modified/added.
The more indexes you have the more work has to be done to insert
, update
or delete
a record.
This is why guidance around bulk loading data usually recommends that you turn off all indexes, load your data and then turn the indexes back on again.
If you have tables where you need data modifications to be super fast then consider keeping your usage of indexes at a minimum.
More Index Types
You don't need to understand how all the different index types work.
However, it's really useful to have a cursory knowledge of what problems specialized indexes solve. If you come across these problems then you'll remember that there is a solution. Then, you can get into the specifics.
Here's a quick rundown of some more specialized indexes available in PostgreSQL and what problems they solve:
-
Hash index
- Fast equality checks. Really useful for scanning against large unique string/text values.
create index user_email on users using hash (email);
-
GIN index (Generalized Inverted Index)
- Useful for full-text search and indexing JSON data.
create index gin_docs on documents using gin (to_tsvector('english', text));
-
Expression index
- Supports indexing usage of functions within queries
create index lower_email on users (lower(email));
-
Covering index
- Supports squeezing as much performance as possible out of very specific queries.
create index covering_orders on orders using btree (user_id, ordered_at desc) include (special_instructions, coupon_code);
-
Bloom index
- You know those pesky reports that have a million different filters the user can choose? Ya - this is for that 🤣.
create index my_bloom on report_data using bloom (index1, index2, index3, index4, index5);
-
BRIN index
- Useful when you have a table that is append-only
create index my_brin on facts using brin (created_at);
My Final Tip
There's much to say about standards and how they give you transferable knowledge as a developer.
But, the fact is: learning in-depth details about how your specific database works does have a massive impact on how you think, design, code and troubleshoot databases.
You don't need to become a database expert. But... I mean... all of your slow pages, reports, slow backups, large disk usage, downtime, etc. are caused by your database!
For real applications with a legit user base: databases are important. Critical.
That doesn't mean you can't get by with a general purpose database using general design and coding guidelines. But if you ever land those big customers that push your application - your database is going to be the bottleneck.
Learn how it works.
A great place to start are the PostgreSQL official docs. It's some of the best written documentation in the industry. The docs really get into the nitty-gritty and help you to understand what's happening at a lower level. Much of this knowledge, in my experience, is transferable.
Start with the documentation on indexes. Take what you've learned here and dig into some more!