Redis & Hyperloglog

Redis is amazing. It seems very ordinary. It describes itself as an “in-memory data store”. So far so good. But, it’s also used as “database, cache, streaming engine, and message broker”. Keeping this list in mind gives you an idea of what to expect when trying to understand the what and how of it. It’s like having a lorry, car, bus, plane and rocket all in one vehicle.

One data structure it uses is HyperLogLog. This was not invented by Redis but it uses it. LogLog is beautifully summarized by the authors:

“Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of shakespeare’s works”

This means it provides a method to count the number of distinct elements (cardinality) a set, here a file, contains. LogLog gets it’s name from the memory needed, LogLog Nmax where N stands for an upper bound on cardinalities. It was also designed to be used distributed or parallelized XD.

It’s used when storing the actual unique data is not possible due to the enormous amount of it. It estimates how much unique data there is. It initializes a table called register of binary items, hashes elements, inspects the binary hash of it in terms of leading zeros to determine where the register should be updated. It then uses a formula to calculate the number of unique items; based on the longest number of leading zeros it observed and the elements of the register. It reminds us of how a bloom filter is implemented. The size of the table/register it used for the shakespeare’s works is 256 bytes of 4 bits.

Super loglog is the best version of the loglog algorithm. Hyper loglog is a more accurate version of it. There is a lot of stats at work in this, for once, an interesting algorithm to dive in.

So, Redis allows you to use Hyperloglog as a supported structure. If you need to count unique elements which occur in massively great numbers, it seems a great option!