Discrete Mathematics
Discrete Math in Computer Science
Recognizing an NP-complete problem in a system design interview saves one from searching for a polynomial solution that does not exist. Knowing RSA's math tells one why e=65537 is standard and what breaking it would require. Understanding Bloom filter probability lets one size it correctly for a billion-URL crawler. This is discrete math as a working engineering tool.
- **Chrome Safe Browsing:** Bloom filter with ~8 bits/URL lets the browser check dangerous sites locally without sending every URL to Google servers
- **TLS 1.3:** ECDH (discrete log on elliptic curves) for key exchange, Miller-Rabin for RSA key generation, SHA-256 for all hashing
- **LLVM register allocator:** graph coloring (NP-complete) solved with greedy approximation - this is what compiles the code to machine instructions
- **Apache Cassandra:** Bloom filters per SSTable reduce disk reads by 10-100x for nonexistent keys
Предварительные знания
Complexity Classes: P, NP, NP-Completeness
Google's PageRank relies on the P-vs-NP boundary: RSA-2048 security depends on factorization hardness, a problem a 4,000-qubit quantum computer would solve in minutes. Complexity theory asks which problems are fundamentally hard to solve. The classes P and NP define the boundary between efficiently solvable and possibly not. Understanding this boundary is critical when choosing algorithms: it tells one when to look for an approximation instead of an exact solution.
**Complexity classes:** - **P:** problems solvable in polynomial time O(n^k). Sorting, shortest path, matrix multiplication. - **NP:** problems whose solutions can be *verified* in polynomial time. P is a subset of NP. - **NP-complete:** the hardest problems in NP. If any NP-complete problem can be solved in polynomial time, all of NP can. - **P = NP?** Open problem, $1M Clay Institute prize. Most researchers believe P != NP.
| Problem | Class | CS Application |
|---|---|---|
| Sorting | P (O(n log n)) | Databases, search, merge |
| 3-SAT | NP-complete | Software verification, constraint solving |
| Hamiltonian path | NP-complete | TSP, network routing |
| Graph k-coloring (k>=3) | NP-complete | Register allocation in compilers |
| Integer linear programming | NP-complete | Scheduling, resource allocation |
| Primality testing | P (AKS, 2002) | RSA and cryptographic key generation |
In practice, NP-complete problems are solved by approximation algorithms, heuristics, or specialized solvers (CPLEX, Gurobi for ILP). LLVM's register allocator uses greedy graph coloring (not optimal). TSP is solved with 2-opt or Lin-Kernighan heuristics. The theoretical hardness tells one to stop looking for an exact polynomial algorithm.
In system design interviews: if one recognize a problem as NP-complete, say so explicitly. The correct answer is "this reduces to X (NP-complete), so we use an approximation / greedy heuristic / LP relaxation" rather than claiming a polynomial exact solution exists.
Why is graph 3-coloring NP-complete, but 2-coloring (bipartite checking) solvable in O(V+E)?
Probabilistic Algorithms: Miller-Rabin Primality Test
Probabilistic algorithms use randomness to achieve fast results that are correct with high probability. The Miller-Rabin primality test is used by OpenSSL, Python's `secrets` module, and Java's `BigInteger.isProbablePrime()` to generate cryptographic keys.
**Miller-Rabin test (based on Fermat's little theorem):** Given odd n, write n-1 = 2^s * d. For a random witness a: - If a^d = 1 (mod n), n is probably prime - If a^(2^r * d) = -1 (mod n) for some r in [0, s-1], n is probably prime - Otherwise, n is definitely composite (a is a witness to compositeness) With k independent witnesses: P(false positive) < 4^(-k).
**Las Vegas vs Monte Carlo algorithms:** - **Las Vegas:** always correct, random running time. Example: randomized quicksort (always sorts correctly, expected O(n log n)). - **Monte Carlo:** fixed time, small error probability. Example: Miller-Rabin (always fast, P(false prime) < 4^(-k)). - **RP:** Monte Carlo with one-sided error only (no false negatives or no false positives, not both).
Why Miller-Rabin instead of the deterministic AKS algorithm in production: AKS is theoretically O(log^6 n) but the constants are enormous. Miller-Rabin with k=40 is faster in practice, and P(error) < 4^(-40) ≈ 10^(-24), which is smaller than the probability of a cosmic ray flipping a bit in hardware.
Miller-Rabin returns 'composite' for a number. What can one conclude with certainty?
RSA and Diffie-Hellman: The Math Behind the Protocol
RSA and Diffie-Hellman are two foundational cryptographic protocols built directly on discrete math. RSA rests on the hardness of integer factorization. DH rests on the discrete logarithm problem in a cyclic group. Both are direct applications of number theory and group theory.
**RSA (Rivest-Shamir-Adleman, 1977):** 1. Choose large primes p, q; compute n = p*q, phi(n) = (p-1)*(q-1) 2. Public key: e with gcd(e, phi(n)) = 1 (typically e = 65537) 3. Private key: d = e^(-1) mod phi(n) via extended Euclidean algorithm 4. Encrypt: c = m^e mod n 5. Decrypt: m = c^d mod n (correct by Euler's theorem: m^(phi(n)) = 1 mod n) Security: no known polynomial algorithm for factoring n when |p|, |q| ~ 2048 bits.
Diffie-Hellman (1976): the first key exchange protocol over a public channel. Alice and Bob publicly agree on (p, g). Alice sends g^a mod p, Bob sends g^b mod p. Both compute g^(ab) mod p as their shared secret. An eavesdropper who sees g^a and g^b cannot compute g^(ab) without solving the discrete logarithm.
**What TLS 1.3 actually uses:** X25519 (ECDH over Curve25519) for ephemeral key exchange, AES-256-GCM for symmetric encryption, HMAC-SHA-256 for message authentication, and ECDSA or RSA only for certificate signing. The discrete math from this course is literally running inside every HTTPS connection.
In RSA with p=5, q=11, n=55, phi(n)=40, e=3, what is the private key d?
Bloom Filters: Probability Meets Hashing
A Bloom filter is a space-efficient probabilistic data structure for set membership. It uses k hash functions and a bit array of size m. False negatives are impossible. False positives occur with a precisely computable probability that decreases as m grows.
**False positive rate after inserting n items:** p_fp ≈ (1 - e^(-kn/m))^k **Optimal number of hash functions** for given m and n: k* = (m/n) * ln(2) ≈ 0.693 * (m/n) **Optimal bit array size** for target false positive rate eps: m = -n * ln(eps) / (ln 2)^2 ≈ 1.44 * n * log2(1/eps)
Bloom filters are used where a false positive is cheap but a false negative is unacceptable. Chrome Safe Browsing downloads a Bloom filter of ~300k dangerous URLs so the browser checks locally without sending every URL to Google's servers. Cassandra and HBase use them to avoid disk reads for keys that definitely do not exist.
Quick sizing rules: fp_rate=1% needs about 9.6 bits/element with k=7 hashes. fp_rate=0.1% needs about 14.4 bits/element with k=10 hashes. Halving the false positive rate costs about 1.44 additional bits per element. A standard Redis Bloom filter for 1 billion URLs at 1% fp_rate needs roughly 1.2 GB.
A Bloom filter returns 'definitely not in set'. What can one conclude?
Key Ideas
- **P vs NP:** P is efficiently solvable; NP is efficiently verifiable. NP-complete problems (3-SAT, TSP, graph coloring) are the hardest in NP - use approximations, not exact algorithms.
- **Miller-Rabin:** probabilistic primality test with P(error) < 4^(-k). One-sided error only: 'composite' is always correct. Used in every major crypto library.
- **RSA:** security from factoring n=p*q. Keys satisfy e*d = 1 (mod phi(n)). All operations use fast modular exponentiation.
- **Diffie-Hellman:** key exchange over public channel. Security from discrete log in Z_p* or on an elliptic curve. The math behind every TLS handshake.
- **Bloom filter:** bit array + k hash functions. No false negatives. P(false positive) is exactly computable. Memory-efficient for billion-scale membership testing.
Related Topics
This lesson synthesizes all previous topics into practical CS applications:
- Groups and Rings — RSA and DH are built on Z_p* group theory; AES uses GF(2^8) field arithmetic
- Random Variables — Probabilistic algorithms are analyzed using expected running time and tail probability bounds
- Discrete Math in Interviews — NP-completeness recognition, Bloom filter design, and crypto fundamentals appear in FAANG system design rounds
Вопросы для размышления
- If P = NP were proven tomorrow, which specific cryptographic systems would break immediately, and which might survive? (Hint: consider which ones rely on NP-hard assumptions.)
- Miller-Rabin with k=40 rounds gives P(error) < 4^(-40). OpenSSL uses k=64 for RSA-2048 key generation. Why do we need more rounds for larger keys, even though the math is the same?
- Design a web crawler for 10 billion URLs. Compare a Bloom filter (fp_rate=0.1%) vs a hash set for deduplication: compute memory requirements for each, and explain when woulding prefer one over the other.