In this article, I will show you what is the ideal pattern of a MongoDB shard key. Although there is a good page on the MongoDB official manual, it still not provides a formula to choose a shard key.
TL;DR
The formula is
{coarselyAscending : 1, search : 1}
I will explain the reason in the following sections.
User Scenario
In order to well-describe the formula, I will use an example to illustrate the scenario. There is a collection within application logs, and the format is like:
{
"id": "4df16cf0-2699-410f-a07e-ca0bc3d3e153",
"type": "app",
"level": "high",
"ts": 1635132899,
"msg": "Database crash"
}
Each log has the same template, id
is a UUID
, ts
is an epoch, and both type
and level
are a finite enumeration. I will leverage the terminologies in the official manual to explain some incorrect designs.
Low Cardinality Shard Key
From the mentioned example, we usually choose type
at first sight. Because, we always use type
to identify the logging scope. However, if we choose the type
as the shard key, it must encounter a hot-spot problem. Hot-spot problem means there is a shard size much larger than others. For example, there are 3 shards corresponding to 3 types of logs, app, web and admin, the most popular user is on app. Therefore, the shard size with app log will be very large. Furthermore, due to the low-cardinality shard key, the shards cannot be rebalanced anymore.
Ascending Shard Key
Alright, if type
cannot be the shard key, how about ts
? We always search for the most recently logs, and ts
are fully uniform distributed, it should be a proper choice. Actually, no. When the shard key is an ascending data, it works at the very first time. Nevertheless, it will result in a performance impact soon. The reason is ts
is always ascending, so the data will always insert into the last shard. The last shard will be rebalanced frequently. Worst of all, the query pattern used to search from the last shard as well, i.e. the search will often be the rebalance period.
Random Shard Key
Based on the previous sections, we know type
, level
and ts
all are not good shard key candidates. Thus, we can use id
as the shard key, so that we can spread the data evenly without frequent changes. This approach will work fine when the data set is limited. After the data set becomes huge, the overhead of rebalance will be very high. Because the data is random, MongoDB has to random access the data while rebalancing. On the other hand, if the data is ascending, MongoDB can retrieve the data chunks via the sequential access.
Solution
A good MongoDB shard key should be like:
{coarselyAscending : 1, search : 1}
In order to prevent the random access, we choose the coarsely ascending data be the former. This pick also won't meet the problem of frequently rebalancing. And we put a search pattern on the latter to ensure the related data can be located at the same shard as much as possible. In our example, I will not only choose the shard key but also redesign our search pattern. The ts
is fine to address the log at the specific time; however, it is a bit inefficient for a time range query like from 3 month ago til now. Hence, I will add one more key, month
, in the document, so we therefore can leverage the MongoDB date type and make a proper shard key. The collection will be:
{
"id": "4df16cf0-2699-410f-a07e-ca0bc3d3e153",
"type": "app",
"level": "high",
"ts": 1635132899,
"msg": "Database crash",
"month": new Date(2021, 10) // only month
}
And, the shard key is {month: 1, type: 1}
.
The key point here is we use month
instead of ts
to avoid frequently rebalaning. The month
is not made just for the shard key; on the contrary, we also use it for our search pattern. Instead of calculating the relationship between timestamp and the date, we can use getMonth
to find results faster. For instance,
var d = new Date();
d.setMonth(d.getMonth() - 1); //1 month ago
db.data.find({month:{$gte: d}});
To sum up, this article provides the concepts of designing MongoDB shard key. You might not have a coarsely ascending data so, but you can refer to the concepts and find out a proper key design for your applications.