Data Structures

HyperLogLog: Counting Unique Elements in O(1) Memory

Цели урока

  • Understand the cardinality estimation problem
  • Grasp the idea: leading zeros as an indicator of count
  • Implement HyperLogLog with buckets
  • Use Redis PFCOUNT

Предварительные знания

  • Bloom Filter
  • Hash functions
  • Bloom Filter

Google handles billions of search queries a day. How do you find out how many unique queries there were in a month without storing them all? HyperLogLog answers with 99% accuracy using only 12 kilobytes.

  • Redis PFCOUNT - built-in HyperLogLog
  • Google BigQuery APPROX_COUNT_DISTINCT
  • Apache Spark - approximate distinct count
  • Analytics systems - unique users

Flajolet's near-optimal estimator

HyperLogLog was published in 2007 by Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier in the paper 'HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm'. It was the last in a line of estimators. The Flajolet-Martin algorithm of 1985 introduced the core trick of using the position of the lowest set bit in a hash to gauge cardinality. The LogLog algorithm of Marianne Durand and Philippe Flajolet in 2003 added the idea of splitting the stream into many buckets and averaging their estimates. HyperLogLog refined this by replacing the geometric mean with a harmonic mean, which sharply reduced the influence of outlier buckets and brought the standard error down to about 1.04 over the square root of the bucket count. The result counts billions of distinct items in roughly 12 kilobytes. Today it powers Redis PFCOUNT and BigQuery's APPROX_COUNT_DISTINCT, among many other systems.

The Problem: Counting Unique Elements

**The task:** how many unique visitors does a site have? How many unique IPs? How many unique queries?

**Redis PFCOUNT** uses HyperLogLog: 12 KB for any number of elements.

HyperLogLog uses fixed memory because:

The Idea: The Probability of Rare Events

**Observation:** if you hash elements into a binary representation, rare patterns reveal how many elements there are.

**Intuition:** the more unique elements there are, the higher the chance of seeing a 'rare' hash with many leading zeros. The maximum number of zeros is about log₂(n).

If the maximum number of leading zeros is 10, the rough estimate of the element count is:

The Improvement: Many Buckets

**Problem:** a single counter has high variance. Solution: many independent estimates.

**Why the harmonic mean?** The arithmetic mean is distorted by outliers. The harmonic mean is robust against rare large values.

**Memory:** 16384 buckets × 6 bits = 12 KB. Accuracy: σ ≈ 1.04/√m ≈ 0.81%.

More buckets in HyperLogLog means:

Implementing HyperLogLog

The complexity of the add() operation in HyperLogLog:

Applications of HyperLogLog

  • **Redis PFCOUNT** - counting unique keys and values
  • **Google BigQuery** - APPROX_COUNT_DISTINCT
  • **PostgreSQL** - an extension for cardinality estimation
  • **Analytics** - unique visitors, sessions, events
  • **Network monitoring** - unique IPs and connections

**Merging HLLs** - you can combine several HyperLogLogs into one. Handy for time aggregation (days → weeks → months).

The advantage of HyperLogLog over a HashSet for counting unique visitors:

HyperLogLog

  • Estimates the count of unique elements in O(1) memory
  • Idea: rare hashes (many zeros) point to a large n
  • Many buckets → harmonic mean → ~1% accuracy
  • Redis: 12 KB for any number of elements
  • You can merge several HLLs into one

Probabilistic structures

HyperLogLog belongs to the family of sketch algorithms. Others include Count-Min Sketch (frequencies), Bloom Filter (membership), and MinHash (set similarity).

  • Bloom Filter — Membership testing

Связанные уроки

  • ds-24-bloom-filter — Both are probabilistic sketches using minimal memory
  • ds-06-hash-intro — Hashing spreads elements to estimate leading-zero patterns
  • prob-07-expectation — The estimator is built on expected leading-zero counts
  • db-19-redis — Redis ships PFCOUNT backed by HyperLogLog
  • stat-02-estimation — Cardinality estimation is statistical point estimation
  • crypto-19-hash-functions
HyperLogLog: Counting Unique Elements in O(1) Memory

0

1

Sign In