When working with databases, it's fundamental that we understand how to optimize our queries. One of the most important aspects of scaling our use cases is being able to retrieve information as fast as possible. Before we dive in specifics, let's have an overview on how databases typically store information.
When we run an INSERT
statement, a new row is created under a table. However, this is just a logical representation of data that is stored in disk. Imagine you are implementing a database and, for the sake of simplicity, the inserted data end up in a simple text file, separated by rows and commas.
1, Adam, EMPLOYEE
2, Anne, EMPLOYEE
3, John, CUSTOMER
4, Jane, CUSTOMER
5, Cris, EMPLOYEE
...
How would you implement the logic to fetch a user by ID? The most obvious answer would be to just traverse the rows and look for the searched ID, which has a linear time complexity. Maybe this works fine for our small file, but what if we have 20 million users? Every query would need to traverse potentially millions of rows to find a single user.
In a database management system, this would mean a sequential access in a full table scan. It would be much easier if the database already knows where to fetch the data so it doesn't need to scan the entire file, right? Fortunately, it exists and we will see how to get the most advantages of it. Let's see how indexes work.
Database Indexes
We already saw what we don't want to happen when looking for data: sequential access in a table scan. Indexes are used to avoid this scenario by providing an additional data structure that provides references to data in disk.
The typical implementation of indexes use variations of the B-Tree data structure to store references to pages where actual data lives. Since they are balanced, it's possible to use binary search, which has logarithmic time complexity and is generally faster than linear complexity.
In this example, let's say we are looking for user with ID 15. We start reading from the root node. As we perform a search inside of it, we fall into the reference between 7 and 16, which brings us to the next node, which is a leaf node. It contains the ID we are looking for and provides the address of the page that the database needs to load to access the data we need.
This ensures that we don't need to traverse all our dataset anymore to find a user by ID.
A practical example
This example uses PostgreSQL running in a Docker image. To follow, you can simply run:
docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres
docker exec -it some-postgres /bin/sh
Let's first define our table without any index and insert some data.
CREATE TABLE users (
id INTEGER,
name VARCHAR(100),
type VARCHAR(100)
);
INSERT INTO users (id, name, type) VALUES
(1, 'Alice', 'EMPLOYEE'),
(2, 'Bob', 'CUSTOMER'),
(3, 'Charlie', 'EMPLOYEE'),
(4, 'David', 'CUSTOMER'),
(5, 'Eve', 'EMPLOYEE'),
(6, 'Frank', 'CUSTOMER'),
(7, 'Grace', 'EMPLOYEE'),
(8, 'Hank', 'CUSTOMER'),
(9, 'Ivy', 'EMPLOYEE'),
(10, 'Jack', 'CUSTOMER');
Now, let's run a EXPLAIN ANALYZE statement to understand what happens when searching for a user with ID=7:
EXPLAIN ANALYZE SELECT * FROM users WHERE id=7;
It will return something like this:
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..12.12 rows=1 width=440) (actual time=0.087..0.092 rows=1 loops=1)
Filter: (id = 7)
Rows Removed by Filter: 9
Planning Time: 0.121 ms
Execution Time: 0.119 ms
(5 rows)
Take a look at the first statement, it clearly says that the database will run a sequential scan on the users table (Seq Scan on users). As we saw, this is something we want to avoid.
Let's now create an index for our table:
ALTER TABLE users ADD PRIMARY KEY (id);
If we run the EXPLAIN ANALYZE statement again, we will see that the database still did the sequential scan. But why does that happened? Most likely, since our data is very small, reading the entire table might be less costly than using the index, since it would need to find the reference to the data and load the entire page anyway, there would be additional overhead, so it's just simpler to only scan the entire table. The database is smart enough to plan the query according to the context we currently have. Now, let's change our scenario and add 1 million users to our table:
DELETE FROM users WHERE 1=1;
INSERT INTO users (id, name, type)
SELECT
i,
'User_' || i,
CASE WHEN i % 2 = 0 THEN 'EMPLOYEE' ELSE 'CUSTOMER' END
FROM
generate_series(1, 1000000) AS i;
Let's see how the database plans our query this time:
EXPLAIN ANALYZE SELECT * FROM users WHERE id=20;
This will return:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Index Scan using users_pkey on users (cost=0.42..8.44 rows=1 width=24) (actual time=0.050..0.052 rows=1 loops=1)
Index Cond: (id = 20)
Planning Time: 0.130 ms
Execution Time: 0.219 ms
(4 rows)
Now, we can see that the database chose to use an index scan (Index Scan using users_pkey). This effectively means that, in order to find our user, the database performed a binary search in the index structure to locate the actual data.
Conclusion
In this article, we saw how indexes can power our databases and speed-up reading operations. We explored how indexes use variations of B-Trees to use the advantages of binary search algorithm over sequential search. Finally, we explored in practice how the database plans the query execution on tables with and without indexes.