A few days ago, a team member asked me a question about partial indexing, and I feel it's a good topic, so today we'll talk about partial indexing FAQ.
- What are the benefits of partial indexing?
- When should I use a partial index?
Before we talk about FAQ, let's have a basic understanding of partial indexing.
The purpose of database indexing is to speed up query performance by placing data in memory through special data structures in order to avoid long access latency on the hard drive.
The current mainstream RDBMS MySQL and PostgreSQL, they use B+ tree, while the document-based MongoDB uses B tree.
Most of the database indexes do not treat each row in the database as a separate leaf node, but use the "value" as the leaf and record the primary key in it, so after finding the leaf, it is necessary to pull the other data from the primary key.
Partial indexes only create nodes for "values" that meet certain conditions. Currently MongoDB and PostgreSQL both support partial indexes, but InnoDB, which is often used by MySQL, does not. After MySQL 8.0, there is support for functional indexes, which can be used to achieve the effect of partial indexes.
What are the benefits of partial indexing?
After understanding the background of partial indexing, it should be no wonder that the biggest advantage of partial indexing is to reduce the memory size occupied by the index. Compared to creating nodes for all values, partial indexing only creates nodes for some values, so the percentage of reduction depends on the criteria used.
For example, if I have a single-field (a) index with values from 0 to 9, I can save about 50% of the space overhead if I set the partial index condition to a >= 5
. Note that this is not related to the amount of data, but only to the value ranges. Even if the a = 0
data has only one row, it will still have an index node as well as the a = 1
data with 1000 rows.
In addition to saving space, the query performance will also be improved. The query efficiency of B tree is O(log N)
, where N
refers to the number of index nodes, not the number of rows.
After partial indexing, we can get a more compact tree, let's say it has M nodes. Due to M < N
, so O(log M) < O(log N)
, we thus know that partial indexing can improve the query performance. However, the benefit is small, because after log
, unless the difference between N
and M
is significant, the effect on query performance is not much.
To sum up, the biggest advantage of partial indexing is to reduce the space overhead for indexing and to improve some of the query performance.
When should I use a partial index?
To figure out the time to use it, we first need to know what will happen when the "query results" will contain nodes that don't meet the partial indexing criteria.
A bit abstract, right? Let's use MongoDB as an example.
First, there is a partial index built on a compound index of the user list, in order to make indexing adults faster.
db.users.createIndex(
{ name: 1 },
{ partialFilterExpression: { age: { $gte: 18 } } }
)
However, if our scenario is likely to require a query for a name without an age limit, then such a query will not use the index.
db.users.find({name: "Tom"})
For instance, if there are three Toms aged 8, 18 and 28, then ("Tom", 18)
and ("Tom", 28)
will both be created but not ("Tom", 8)
, so the MongoDB optimizer knows this is the case and will not use partial indexes, but will use full table scans instead.
Yes, that's the price. As soon as the scenario contains a condition that is not in the value range of the partial index, then it becomes a full table scan.
Therefore, what kind of scenario is suitable for partial indexing? In my opinion, two conditions need to be satisfied.
- The value range is large, but the common range of data is small.
- Even without index, the query will not impact the database too much, e.g., query frequency, collection size, etc.
The second point is simpler to understand, but what does the first point exactly mean?
From the previous section, we know the biggest advantage of partial indexing is to reduce the space of indexes rather than to improve the performance of queries, so the most effective scenario is when data is scattered and only a fraction of the common data is used by the application.
Take a practical example of a membership system.
From registration to membership, a user may need to go through many procedures until finally becoming a member after email verification. But this system, for non-members basically does not care, it only focuses on processing the verified member data.
Then, the partial index can be designed in this way.
db.users.createIndex(
{ name: 1 },
{ partialFilterExpression: { email: { $exists: true } } }
)
Because this membership system only cares about member information, i. e., users must have email. There may be a large number of users who go through the partial registration process at the beginning but do not pass the verification process at the end, then there will be a large amount of data that does not need to be indexed. In this way, the index space can be reduced as much as possible, and the efficiency of the member's query can be improved.
Conclusion
After all, there is not a perfect solution. Partial indexing saves index space and improves query performance, but there is a corresponding price to pay for a full table scan.
Therefore, it is important to understand the principle of partial indexing and the scenario of using it.
The effectiveness of an indexing mechanism depends entirely on the data distribution and the characteristics of the application, whether it is a regular index or a partial index. Therefore, this article can't tell you a one size fits all solution, it can only offer you a guideline, and it's up to you to take the trade-off.