Information Theory
Info theory at the interview
Information theory shows up at FAANG interviews more often than you would expect: 'explain KL divergence', 'why does L2 regularization help generalization', 'what is the information bottleneck'. These questions separate candidates who understand ML theoretically from those who just call libraries.
- **Google Brain** has stated publicly that candidates who can explain ELBO through information principles receive significantly higher scores at ML Research interviews.
- **Meta AI Research** often asks 'how do you measure how much information about the input a hidden representation retains?', directly inside the Information Bottleneck frame.
- **OpenAI** uses KL divergence as a central component of RLHF (the ChatGPT training loop). Understanding it helps at interviews for alignment and RL teams.
Предварительные знания
Entropy questions
Entropy is the most common information theory topic at interviews. Basic questions: 'what is entropy?', 'what is the entropy of a coin?', 'what happens to the entropy if you add an event of probability zero?'. Tougher ones: 'minimum and maximum entropy with n outcomes?', 'how is entropy related to the number of bits needed to code something?'.
**Core entropy facts:** H(X) = -sum p(x) log_2 p(x) bits. H = 0 iff X is deterministic. H = log_2 n iff the distribution is uniform (maximum). H(X | Y) <= H(X) (conditioning reduces entropy). H(X, Y) = H(X) + H(Y | X). H(X) >= 0. Zero-probability events: 0 log 0 = 0 by convention.
| Question | Answer | Trick |
|---|---|---|
| H(deterministic X) = ? | 0 bits | -1 * log(1) = 0 |
| H(uniform {1..n}) = ? | log_2(n) bits | Maximum |
| H(X, Y) when X = Y? | H(X) = H(Y) | H(X | X) = 0 |
| H(X + Y) vs H(X) + H(Y)? | <= sum | Correlation lowers it |
| 0 log 0 = ? | 0 (by convention) | lim p -> 0 of p log p = 0 |
Historical note
These questions are standard at ML engineering interviews at Google, Meta, Amazon. The usual opening is 'what is entropy intuitively?', then it moves into concrete problems. The key is being able to tie the formal definition to the intuition of 'uncertainty'.
A high probability on one outcome means high entropy.
The opposite: the more concentrated one outcome, the lower the entropy. Maximum entropy sits at the uniform distribution.
Entropy is uncertainty. If one outcome is almost certain (p ~ 1) there is no uncertainty, no information, entropy ~ 0.
Interview question: 'A coin has p(H) = 0.99. Roughly what is its entropy?' Best answer:
Coding questions
A frequent task: 'design an optimal code for an alphabet with these probabilities'. The expectation is a Huffman tree, the average code length, and a comparison with entropy. Another type: 'what is the minimum size of a file with N symbols of entropy H?' - the answer is nH bits plus a small overhead.
**Huffman at the interview:** 1. Sort the probabilities. 2. Merge the two smallest. 3. Repeat. Average length L: H(X) <= L < H(X) + 1. If asked 'can you reach H(X)?', only when probabilities are powers of two. Otherwise use arithmetic coding or ANS.
| Task | Method | Key fact |
|---|---|---|
| Min bits for n symbols | nH(X) | AEP: exactly nH bits needed |
| Average code word | Huffman tree | H <= L < H + 1 |
| Min code per 1 symbol | Huffman | Integer bits, >= H(X) |
| Min code per block, n -> infinity | Arithmetic / ANS | -> H(X) bits/symbol |
| Cannot compress below... | H(X) | Source coding theorem |
Historical note
Coding problems are standard at Systems Design and ML engineering interviews at FAANG. The most common is the link between entropy and minimum size, with the answer 'nH bits' from Shannon's source coding theorem (1948).
Lossless compression always shrinks files.
Lossless compression cannot shrink a random file. It is already at maximum entropy. Some files even grow under 'compression'.
Pigeonhole: not all length-n strings can be shorter than n bits. For every well-compressed input there are inputs that grow.
A file holds 10000 symbols from {A, B, C, D} with probabilities 1/2, 1/4, 1/8, 1/8. Minimum size under lossless compression?
Capacity questions
Capacity questions are common at System Design interviews: 'how do you estimate the theoretical limit of a Wi-Fi channel?'. The expectation is applying Shannon's formula C = B log_2(1 + SNR), knowing the BSC, and basic math. ML engineers also get 'what is mutual information?' and 'how does it relate to channel capacity?'.
**Interview points:** 1. C = B log_2(1 + SNR) for AWGN. 2. BSC(p): C = 1 - H(p). 3. I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X). 4. Independent X, Y: I(X; Y) = 0. 5. I(X; Y) >= 0 always. 6. Data processing: I(X; Y) >= I(X; g(Y)).
| Interview question | Key answer | Formula |
|---|---|---|
| I(X; Y) when X is independent of Y? | 0 | I = H(X) - H(X | Y) = H(X) - H(X) |
| I(X; Y) maximum? | min(H(X), H(Y)) | Reached at functional dependence |
| C = ? for an ideal channel | infinity (noiseless) | SNR -> infinity: C -> infinity |
| What limits internet speed? | Bandwidth + SNR | C = B log(1 + SNR) |
| Data processing inequality | I(X; Y) >= I(X; g(Y)) | Processing cannot grow I |
Historical note
Capacity and mutual information questions appear often in the context of 'how would you design a recommender system?' with I(user; recommendation) as a quality metric. A real-world ML product application of Shannon theory.
Feature engineering can increase mutual information between features and target.
By DPI feature engineering cannot increase I(X; Y), only preserve or reduce it. The goal is to avoid losing useful information during transforms.
Markov chain: X -> g(X) -> Y. By DPI: I(X; Y) >= I(g(X); Y). Features g(X) are functions of X, so information about Y cannot exceed what is in X.
Interview question: 'What is the data processing inequality and why does it matter for ML?'
Practical applications
The final family of questions: 'where does information theory show up in real systems?'. The right move is to connect theory to concrete instances: JPEG -> R-D tradeoff; recommender systems -> I(user; recommendation) as a quality signal; A/B tests -> KL divergence between groups; differential privacy -> an information limit on leakage.
**IT applications in ML industry:** 1. Entropy regularization in RL (SAC, max-entropy policy). 2. KL divergence in RLHF (KL from a reference policy). 3. Mutual information in feature selection. 4. Differential privacy: epsilon-DP bounds I(algorithm; data). 5. Model compression = minimizing I(weights; data).
| System | IT use | Metric |
|---|---|---|
| SAC (RL) | Entropy regularization: max H(pi) | Entropy H(pi) |
| RLHF (ChatGPT) | KL penalty from ref policy | KL[pi || pi_ref] |
| Recommender | Max I(user; item) | Mutual information |
| Differential privacy | epsilon-DP bounds I | epsilon, delta parameters |
| Knowledge distillation | KL[soft_teacher || student] | KL divergence |
Historical note
Soft Actor-Critic (Haarnoja et al., 2018) formalized entropy regularization in RL. RLHF (InstructGPT, 2022) uses KL divergence as a penalty preventing policy mode collapse. Differential Privacy (Dwork, 2006) is an information-theoretic privacy guarantee.
Information theory only applies to compression and telecommunications.
IT runs through ML: VAE (ELBO), RLHF (KL penalty), differential privacy (epsilon-DP), feature selection (MI), model compression (MDL). It is the universal language for describing information in systems.
Information is a foundational concept. Wherever there is data, learning, or uncertainty, there is IT. ML is largely 'learning efficient information representations'.
RLHF for ChatGPT adds a KL[pi || pi_ref] penalty. Why?
Takeaways
- **Entropy:** H = 0 (deterministic) up to log_2(n) (uniform). Biased coin p = 0.99 has H ~ 0.08 bits. Conditioning reduces entropy.
- **Coding:** Huffman gives H <= L < H + 1. nH bits is the minimum size for n symbols. At power-of-two probabilities Huffman is optimal.
- **Capacity:** C = B log_2(1 + SNR) for AWGN. BSC(p): C = 1 - H(p). I(X; Y) = H(X) - H(X | Y) >= 0.
- **ML applications:** RLHF (KL penalty), SAC (entropy reg), VAE (ELBO), differential privacy (epsilon-DP). Data processing inequality: processing does not create information.
Related topics
Interview preparation depends on every earlier topic.
- Entropy and mutual information — Base definitions, a precondition for every other question.
- Information theory in ML — ELBO, IB, MINE are the most common advanced questions at ML interviews.
- Channel and capacity — Shannon's formula and channel capacity are System Design staples.
Вопросы для размышления
- If the interviewer asks for 'intuition behind KL divergence', how do you answer in 2 minutes with no formulas?
- How would you answer 'why is ChatGPT sometimes confidently wrong?' using IT concepts (output entropy, KL from reference)?
- Suppose the task is 'given N videos, pick 10 to show the user'. How would an information-theoretic approach help here?