Algorithms
Rabin-Karp: String Hashing
Цели урока
- Understand the hash search idea
- Implement Rolling Hash
- Understand the collision problem
- Apply for multiple pattern search
Предварительные знания
- Naive substring search
Why compare 1000 characters when you can compare 2 numbers? Rabin-Karp uses the mathematical magic of hashing for lightning-fast search.
- Plagiarism detection in texts
- Antivirus (signature search)
- Data deduplication
- Similar image search (perceptual hashing)
Hashing turns string search into arithmetic
Richard Karp and Michael Rabin published their algorithm in 1987 in a paper titled Efficient randomized pattern-matching algorithms. Their idea was different from KMP: instead of comparing characters, compare hashes. Hash the pattern once, then hash each window of the text and compare numbers. The trick that makes it fast is the rolling hash. When the window slides one character to the right, the hash is not recomputed from scratch. The algorithm subtracts the contribution of the outgoing character and adds the incoming one in constant time, treating the window as a number in some base modulo a prime. A hash match is then verified character by character to rule out collisions. Average time is linear, and the same rolling-hash idea later powered multi-pattern search and content-defined chunking in systems like rsync.