Information Theory

Info theory at the interview

Information theory shows up at FAANG interviews more often than you would expect: 'explain KL divergence', 'why does L2 regularization help generalization', 'what is the information bottleneck'. These questions separate candidates who understand ML theoretically from those who just call libraries.

  • **Google Brain** has stated publicly that candidates who can explain ELBO through information principles receive significantly higher scores at ML Research interviews.
  • **Meta AI Research** often asks 'how do you measure how much information about the input a hidden representation retains?', directly inside the Information Bottleneck frame.
  • **OpenAI** uses KL divergence as a central component of RLHF (the ChatGPT training loop). Understanding it helps at interviews for alignment and RL teams.

Предварительные знания

  • Information Theory in Deep Learning

Entropy questions

Entropy is the most common information theory topic at interviews. Basic questions: 'what is entropy?', 'what is the entropy of a coin?', 'what happens to the entropy if you add an event of probability zero?'. Tougher ones: 'minimum and maximum entropy with n outcomes?', 'how is entropy related to the number of bits needed to code something?'.

**Core entropy facts:** H(X) = -sum p(x) log_2 p(x) bits. H = 0 iff X is deterministic. H = log_2 n iff the distribution is uniform (maximum). H(X | Y) <= H(X) (conditioning reduces entropy). H(X, Y) = H(X) + H(Y | X). H(X) >= 0. Zero-probability events: 0 log 0 = 0 by convention.

QuestionAnswerTrick
H(deterministic X) = ?0 bits-1 * log(1) = 0
H(uniform {1..n}) = ?log_2(n) bitsMaximum
H(X, Y) when X = Y?H(X) = H(Y)H(X | X) = 0
H(X + Y) vs H(X) + H(Y)?<= sumCorrelation lowers it
0 log 0 = ?0 (by convention)lim p -> 0 of p log p = 0

Historical note

These questions are standard at ML engineering interviews at Google, Meta, Amazon. The usual opening is 'what is entropy intuitively?', then it moves into concrete problems. The key is being able to tie the formal definition to the intuition of 'uncertainty'.

A high probability on one outcome means high entropy.

The opposite: the more concentrated one outcome, the lower the entropy. Maximum entropy sits at the uniform distribution.

Entropy is uncertainty. If one outcome is almost certain (p ~ 1) there is no uncertainty, no information, entropy ~ 0.

Interview question: 'A coin has p(H) = 0.99. Roughly what is its entropy?' Best answer:

Coding questions

A frequent task: 'design an optimal code for an alphabet with these probabilities'. The expectation is a Huffman tree, the average code length, and a comparison with entropy. Another type: 'what is the minimum size of a file with N symbols of entropy H?' - the answer is nH bits plus a small overhead.

**Huffman at the interview:** 1. Sort the probabilities. 2. Merge the two smallest. 3. Repeat. Average length L: H(X) <= L < H(X) + 1. If asked 'can you reach H(X)?', only when probabilities are powers of two. Otherwise use arithmetic coding or ANS.

TaskMethodKey fact
Min bits for n symbolsnH(X)AEP: exactly nH bits needed
Average code wordHuffman treeH <= L < H + 1
Min code per 1 symbolHuffmanInteger bits, >= H(X)
Min code per block, n -> infinityArithmetic / ANS-> H(X) bits/symbol
Cannot compress below...H(X)Source coding theorem

Historical note

Coding problems are standard at Systems Design and ML engineering interviews at FAANG. The most common is the link between entropy and minimum size, with the answer 'nH bits' from Shannon's source coding theorem (1948).

Lossless compression always shrinks files.

Lossless compression cannot shrink a random file. It is already at maximum entropy. Some files even grow under 'compression'.

Pigeonhole: not all length-n strings can be shorter than n bits. For every well-compressed input there are inputs that grow.

A file holds 10000 symbols from {A, B, C, D} with probabilities 1/2, 1/4, 1/8, 1/8. Minimum size under lossless compression?

Capacity questions

Capacity questions are common at System Design interviews: 'how do you estimate the theoretical limit of a Wi-Fi channel?'. The expectation is applying Shannon's formula C = B log_2(1 + SNR), knowing the BSC, and basic math. ML engineers also get 'what is mutual information?' and 'how does it relate to channel capacity?'.

**Interview points:** 1. C = B log_2(1 + SNR) for AWGN. 2. BSC(p): C = 1 - H(p). 3. I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X). 4. Independent X, Y: I(X; Y) = 0. 5. I(X; Y) >= 0 always. 6. Data processing: I(X; Y) >= I(X; g(Y)).

Interview questionKey answerFormula
I(X; Y) when X is independent of Y?0I = H(X) - H(X | Y) = H(X) - H(X)
I(X; Y) maximum?min(H(X), H(Y))Reached at functional dependence
C = ? for an ideal channelinfinity (noiseless)SNR -> infinity: C -> infinity
What limits internet speed?Bandwidth + SNRC = B log(1 + SNR)
Data processing inequalityI(X; Y) >= I(X; g(Y))Processing cannot grow I

Historical note

Capacity and mutual information questions appear often in the context of 'how would you design a recommender system?' with I(user; recommendation) as a quality metric. A real-world ML product application of Shannon theory.

Feature engineering can increase mutual information between features and target.

By DPI feature engineering cannot increase I(X; Y), only preserve or reduce it. The goal is to avoid losing useful information during transforms.

Markov chain: X -> g(X) -> Y. By DPI: I(X; Y) >= I(g(X); Y). Features g(X) are functions of X, so information about Y cannot exceed what is in X.

Interview question: 'What is the data processing inequality and why does it matter for ML?'

Practical applications

The final family of questions: 'where does information theory show up in real systems?'. The right move is to connect theory to concrete instances: JPEG -> R-D tradeoff; recommender systems -> I(user; recommendation) as a quality signal; A/B tests -> KL divergence between groups; differential privacy -> an information limit on leakage.

**IT applications in ML industry:** 1. Entropy regularization in RL (SAC, max-entropy policy). 2. KL divergence in RLHF (KL from a reference policy). 3. Mutual information in feature selection. 4. Differential privacy: epsilon-DP bounds I(algorithm; data). 5. Model compression = minimizing I(weights; data).

SystemIT useMetric
SAC (RL)Entropy regularization: max H(pi)Entropy H(pi)
RLHF (ChatGPT)KL penalty from ref policyKL[pi || pi_ref]
RecommenderMax I(user; item)Mutual information
Differential privacyepsilon-DP bounds Iepsilon, delta parameters
Knowledge distillationKL[soft_teacher || student]KL divergence

Historical note

Soft Actor-Critic (Haarnoja et al., 2018) formalized entropy regularization in RL. RLHF (InstructGPT, 2022) uses KL divergence as a penalty preventing policy mode collapse. Differential Privacy (Dwork, 2006) is an information-theoretic privacy guarantee.

Information theory only applies to compression and telecommunications.

IT runs through ML: VAE (ELBO), RLHF (KL penalty), differential privacy (epsilon-DP), feature selection (MI), model compression (MDL). It is the universal language for describing information in systems.

Information is a foundational concept. Wherever there is data, learning, or uncertainty, there is IT. ML is largely 'learning efficient information representations'.

RLHF for ChatGPT adds a KL[pi || pi_ref] penalty. Why?

Takeaways

  • **Entropy:** H = 0 (deterministic) up to log_2(n) (uniform). Biased coin p = 0.99 has H ~ 0.08 bits. Conditioning reduces entropy.
  • **Coding:** Huffman gives H <= L < H + 1. nH bits is the minimum size for n symbols. At power-of-two probabilities Huffman is optimal.
  • **Capacity:** C = B log_2(1 + SNR) for AWGN. BSC(p): C = 1 - H(p). I(X; Y) = H(X) - H(X | Y) >= 0.
  • **ML applications:** RLHF (KL penalty), SAC (entropy reg), VAE (ELBO), differential privacy (epsilon-DP). Data processing inequality: processing does not create information.

Related topics

Interview preparation depends on every earlier topic.

  • Entropy and mutual information — Base definitions, a precondition for every other question.
  • Information theory in ML — ELBO, IB, MINE are the most common advanced questions at ML interviews.
  • Channel and capacity — Shannon's formula and channel capacity are System Design staples.

Вопросы для размышления

  • If the interviewer asks for 'intuition behind KL divergence', how do you answer in 2 minutes with no formulas?
  • How would you answer 'why is ChatGPT sometimes confidently wrong?' using IT concepts (output entropy, KL from reference)?
  • Suppose the task is 'given N videos, pick 10 to show the user'. How would an information-theoretic approach help here?

Связанные уроки

  • stat-03-mle
Info theory at the interview

0

1

Sign In