Cardinality Counting in Redis

ChunTing Wu - Dec 22 '21 - - Dev Community

Cardinality counting is used to calculate the amount of elements without any duplication. In Redis, there are many data structures able to accomplish this job. However, what is the most applicable way for your use cases? This article will show the consideration behind the technical selection.

User Scenario

Suppose we need to get the failure rate in a sensor network to investigate the reporting qualities. Therefore, we have to record the health status in hours by the incoming requests.

The key point is to simplify the process, we don't want to get the value first, determine whether it is existing, and then insert the record like:

figure 1

Instead, we should insert the record every time, and the storage can de-duplicate for us. Or, we can limited pre-process data to make the storage do faster.

Assume that we have a sensor A, and the sensor requested to the server at 1/2 1:11, 1/3 2:22, and 1/8 3:00.

Alright, let's see how Redis did cardinality counting.

Set

The basic idea is using set. Before adding to the set, we have to pre-process the date. Due to our requirement, we only keep the hour without minutes and seconds.

const date1 = new Date(2021, 0, 2, 1, 0);
const d1 = date1.toISOString(); 
Enter fullscreen mode Exit fullscreen mode

Then, we can add d1 to the set via SADD.

SADD sensorA "2021-01-02T01:00:00.000Z"
SADD sensorA "2021-01-03T02:00:00.000Z"
SADD sensorA "2021-01-08T03:00:00.000Z"
Enter fullscreen mode Exit fullscreen mode

In order to get the health status, we can use SCARD.

SCARD sensorA
> 3
Enter fullscreen mode Exit fullscreen mode

The implementation of using set is simple; nevertheless, if we want to count the health status during the specific period like in 2021, set cannot handle this request.

Sorted Set

Therefore, if we would like to meet the needs of specific time period and whole time, we can leverage sorted set. The implementation is similar to the set. Firstly, pre-process the date.

const date1 = new Date(2021, 0, 2, 1, 0);
const d1 = date1.getTime();
Enter fullscreen mode Exit fullscreen mode

It is unlike using ISO string, we are using epoch here to find the specific time range easily. Now, we can add to the sorted set via ZADD.

ZADD sensorA 1609520400000 1609520400000
ZADD sensorA 1609610400000 1609610400000
ZADD sensorA 1610046000000 1610046000000
Enter fullscreen mode Exit fullscreen mode

To find whole count in it:

ZCARD sensorA
> 3
Enter fullscreen mode Exit fullscreen mode

On the other hand, in order to search the specific time range, we assign the start and end in ZCOUNT.

ZCOUNT sensorA 1609520400000 1610040000000
> 2
Enter fullscreen mode Exit fullscreen mode

Bitmap

We walked through two approaches, but neither set nor sorted set are space efficiency. The detail implementation in Redis takes a huge space to indicate the data structure. When the amount of sensors become larger or the duration of records become longer, the space in Redis will grow quickly.

How to reduce the space? We can leverage the extended function of string, bitmap. Bitmap is very space efficiency, each bit takes 1 bit as its meaning.

But the pre-process is a little complicated, we have to get an offset to operate bits. For example, we can calculate the diff hours between the service launch time and the current time, e.g. 2021/1/2 1:11.

const base = new Date(2021, 0, 1, 0, 0);
const date1 = new Date(2021, 0, 2, 1, 11);
const diffTime = Math.abs(date1 - base);
const diffHours = Math.ceil(diffTime / (1000 * 60 * 60));
Enter fullscreen mode Exit fullscreen mode

After that, make the offset be 1.

SETBIT sensorA 26 1
SETBIT sensorA 51 1
SETBIT sensorA 171 1
Enter fullscreen mode Exit fullscreen mode

Hence we can get the total health status by BITCOUNT.

BITCOUNT sensorA
> 3
Enter fullscreen mode Exit fullscreen mode

BITCOUNT also provides the range match, so we can assign start and end to search a specific time range like sorted set. It is noteworthy that the start and end here represents the bytes offset. We have to convert the start time and end time to the diff hours bytes, the calculation is complex, so I am not going to provide an example in this article to prevent losing focus.

HyperLogLog

The final approach is called hyperloglog. This is an algorithm for big data statistics. Redis provides it as the built-in method.

Whether it is set, sorted set or bitmap, the space usage will larger and larger as time goes by. For instance, if we keep the health status for 10 years, even bitmap will take huge space, 365 * 10 * 24 / 1024 ~ 85.5 KB.

However, in hyperloglog, the space usage is constant. No matter how long the retention you need, hyperloglog takes 12 KB constantly. And the pre-process is like set,

const date1 = new Date(2021, 0, 2, 1, 0);
const d1 = date1.toISOString();
Enter fullscreen mode Exit fullscreen mode

Then, we can add the date to hyperloglog via PFADD.

PFADD sensorA "2021-01-02T01:00:00.000Z"
PFADD sensorA "2021-01-03T02:00:00.000Z"
PFADD sensorA "2021-01-08T03:00:00.000Z"
Enter fullscreen mode Exit fullscreen mode

It is simple to get the total count.

PFOCUNT sensorA
> 3
Enter fullscreen mode Exit fullscreen mode

Hyperloglog is not quite precise, the result of PFCOUNT may contain some deviations when the data set is huge, but the performance is pretty well.

Conclusion

Let's summarize those 4 approaches.

Set Sorted Set Bitmap HyperLogLog
Implementation effort low low high low
Specific time range V V
Space cost high high low to medium low

The example in this article is trivial, however, I believe you can know the concepts of those approaches. The most important thing is each approach has its own strength and drawback, using them in a smart way is developer's responsibility.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player