Algorithms

Rabin-Karp: String Hashing

Цели урока

  • Understand the hash search idea
  • Implement Rolling Hash
  • Understand the collision problem
  • Apply for multiple pattern search

Предварительные знания

  • Naive substring search
  • Substring Search: Introduction

Why compare 1000 characters when you can compare 2 numbers? Rabin-Karp uses the mathematical magic of hashing for lightning-fast search.

  • Plagiarism detection in texts
  • Antivirus (signature search)
  • Data deduplication
  • Similar image search (perceptual hashing)

Hashing turns string search into arithmetic

Richard Karp and Michael Rabin published their algorithm in 1987 in a paper titled Efficient randomized pattern-matching algorithms. Their idea was different from KMP: instead of comparing characters, compare hashes. Hash the pattern once, then hash each window of the text and compare numbers. The trick that makes it fast is the rolling hash. When the window slides one character to the right, the hash is not recomputed from scratch. The algorithm subtracts the contribution of the outgoing character and adds the incoming one in constant time, treating the window as a number in some base modulo a prime. A hash match is then verified character by character to rule out collisions. Average time is linear, and the same rolling-hash idea later powered multi-pattern search and content-defined chunking in systems like rsync.

Idea: compare hashes

0

1

Sign In

**Rabin-Karp's idea:** instead of character-by-character comparison - compare hashes. If hashes differ - strings are definitely different. If equal - verify.

**Problem:** computing a hash for each position is O(m), total O(nm). No better than naive!

**Solution:** Rolling Hash - recompute hash in O(1) on each shift.

If hash(pattern) ≠ hash(text[i:i+m]), then:

Rolling Hash

**Polynomial hash:** represent a string as a number in base-`base`.

**Rolling update:**

**Collisions:** different strings can have the same hash. When hashes are equal, ALWAYS verify the strings!

Why is mod (modulus) needed?

Rabin-Karp Implementation

CaseComplexityReason
Best/AverageO(n + m)Few collisions
WorstO(nm)All hashes match (many collisions)

**Double hashing:** use two hashes with different base and mod. Collision probability ≈ 1/mod² - practically 0.

Why do we STILL compare strings when pattern_hash == window_hash?

Applications

**1. Searching for multiple patterns:**

**2. Plagiarism detection (common substrings):**

**3. Longest Common Substring (binary search on length):**

**Rabin-Karp vs KMP:** KMP guarantees O(n+m), Rabin-Karp is simpler and better for multiple pattern search.

For searching 100 patterns of the same length in text, what is more efficient?

Rabin-Karp

  • Compare hashes, not strings
  • Rolling Hash: O(1) per shift
  • O(n+m) average, O(nm) worst (collisions)
  • Double hashing reduces collisions
  • Great for multiple pattern search

Related Topics

Rabin-Karp is the foundation for many string algorithms.

  • KMP — Alternative with O(n+m) guarantee
  • Hashing — Foundation of the algorithm

Связанные уроки

  • alg-23-pattern-matching — Rabin-Karp is a hashing answer to naive search
  • ds-06-hash-intro — Rolling hash builds on basic hashing ideas
  • alg-24-kmp — Both reach O(n+m); KMP is exact, RK is probabilistic
  • crypto-19-hash-functions — Hash collisions explain Rabin-Karp false positives
Rabin-Karp: String Hashing