Information Theory

Info theory at the interview

Information theory shows up at FAANG interviews more often than you would expect: 'explain KL divergence', 'why does L2 regularization help generalization', 'what is the information bottleneck'. These questions separate candidates who understand ML theoretically from those who just call libraries.

**Google Brain** has stated publicly that candidates who can explain ELBO through information principles receive significantly higher scores at ML Research interviews.
**Meta AI Research** often asks 'how do you measure how much information about the input a hidden representation retains?', directly inside the Information Bottleneck frame.
**OpenAI** uses KL divergence as a central component of RLHF (the ChatGPT training loop). Understanding it helps at interviews for alignment and RL teams.

Предварительные знания

Information Theory in Deep Learning

Entropy questions

Entropy is the most common information theory topic at interviews. Basic questions: 'what is entropy?', 'what is the entropy of a coin?', 'what happens to the entropy if you add an event of probability zero?'. Tougher ones: 'minimum and maximum entropy with n outcomes?', 'how is entropy related to the number of bits needed to code something?'.

**Core entropy facts:** H(X) = -sum p(x) log_2 p(x) bits. H = 0 iff X is deterministic. H = log_2 n iff the distribution is uniform (maximum). H(X | Y) <= H(X) (conditioning reduces entropy). H(X, Y) = H(X) + H(Y | X). H(X) >= 0. Zero-probability events: 0 log 0 = 0 by convention.

Question	Answer	Trick
H(deterministic X) = ?	0 bits	-1 * log(1) = 0
H(uniform {1..n}) = ?	log_2(n) bits	Maximum
H(X, Y) when X = Y?	H(X) = H(Y)	H(X \| X) = 0
H(X + Y) vs H(X) + H(Y)?	<= sum	Correlation lowers it
0 log 0 = ?	0 (by convention)	lim p -> 0 of p log p = 0

Historical note

These questions are standard at ML engineering interviews at Google, Meta, Amazon. The usual opening is 'what is entropy intuitively?', then it moves into concrete problems. The key is being able to tie the formal definition to the intuition of 'uncertainty'.

A high probability on one outcome means high entropy.

The opposite: the more concentrated one outcome, the lower the entropy. Maximum entropy sits at the uniform distribution.

Entropy is uncertainty. If one outcome is almost certain (p ~ 1) there is no uncertainty, no information, entropy ~ 0.

Interview question: 'A coin has p(H) = 0.99. Roughly what is its entropy?' Best answer:

Coding questions

A frequent task: 'design an optimal code for an alphabet with these probabilities'. The expectation is a Huffman tree, the average code length, and a comparison with entropy. Another type: 'what is the minimum size of a file with N symbols of entropy H?' - the answer is nH bits plus a small overhead.

**Huffman at the interview:** 1. Sort the probabilities. 2. Merge the two smallest. 3. Repeat. Average length L: H(X) <= L < H(X) + 1. If asked 'can you reach H(X)?', only when probabilities are powers of two. Otherwise use arithmetic coding or ANS.

Task	Method	Key fact
Min bits for n symbols	nH(X)	AEP: exactly nH bits needed
Average code word	Huffman tree	H <= L < H + 1
Min code per 1 symbol	Huffman	Integer bits, >= H(X)
Min code per block, n -> infinity	Arithmetic / ANS	-> H(X) bits/symbol
Cannot compress below...	H(X)	Source coding theorem

Historical note

Coding problems are standard at Systems Design and ML engineering interviews at FAANG. The most common is the link between entropy and minimum size, with the answer 'nH bits' from Shannon's source coding theorem (1948).

Lossless compression always shrinks files.

Lossless compression cannot shrink a random file. It is already at maximum entropy. Some files even grow under 'compression'.

Pigeonhole: not all length-n strings can be shorter than n bits. For every well-compressed input there are inputs that grow.

A file holds 10000 symbols from {A, B, C, D} with probabilities 1/2, 1/4, 1/8, 1/8. Minimum size under lossless compression?

Capacity questions

Capacity questions are common at System Design interviews: 'how do you estimate the theoretical limit of a Wi-Fi channel?'. The expectation is applying Shannon's formula C = B log_2(1 + SNR), knowing the BSC, and basic math. ML engineers also get 'what is mutual information?' and 'how does it relate to channel capacity?'.

**Interview points:** 1. C = B log_2(1 + SNR) for AWGN. 2. BSC(p): C = 1 - H(p). 3. I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X). 4. Independent X, Y: I(X; Y) = 0. 5. I(X; Y) >= 0 always. 6. Data processing: I(X; Y) >= I(X; g(Y)).

Interview question	Key answer	Formula
I(X; Y) when X is independent of Y?	0	I = H(X) - H(X \| Y) = H(X) - H(X)
I(X; Y) maximum?	min(H(X), H(Y))	Reached at functional dependence
C = ? for an ideal channel	infinity (noiseless)	SNR -> infinity: C -> infinity
What limits internet speed?	Bandwidth + SNR	C = B log(1 + SNR)
Data processing inequality	I(X; Y) >= I(X; g(Y))	Processing cannot grow I

Historical note

Capacity and mutual information questions appear often in the context of 'how would you design a recommender system?' with I(user; recommendation) as a quality metric. A real-world ML product application of Shannon theory.

Feature engineering can increase mutual information between features and target.

By DPI feature engineering cannot increase I(X; Y), only preserve or reduce it. The goal is to avoid losing useful information during transforms.

Markov chain: X -> g(X) -> Y. By DPI: I(X; Y) >= I(g(X); Y). Features g(X) are functions of X, so information about Y cannot exceed what is in X.

Interview question: 'What is the data processing inequality and why does it matter for ML?'

Practical applications

The final family of questions: 'where does information theory show up in real systems?'. The right move is to connect theory to concrete instances: JPEG -> R-D tradeoff; recommender systems -> I(user; recommendation) as a quality signal; A/B tests -> KL divergence between groups; differential privacy -> an information limit on leakage.

**IT applications in ML industry:** 1. Entropy regularization in RL (SAC, max-entropy policy). 2. KL divergence in RLHF (KL from a reference policy). 3. Mutual information in feature selection. 4. Differential privacy: epsilon-DP bounds I(algorithm; data). 5. Model compression = minimizing I(weights; data).

System	IT use	Metric
SAC (RL)	Entropy regularization: max H(pi)	Entropy H(pi)
RLHF (ChatGPT)	KL penalty from ref policy	KL[pi \|\| pi_ref]
Recommender	Max I(user; item)	Mutual information
Differential privacy	epsilon-DP bounds I	epsilon, delta parameters
Knowledge distillation	KL[soft_teacher \|\| student]	KL divergence

Historical note

Soft Actor-Critic (Haarnoja et al., 2018) formalized entropy regularization in RL. RLHF (InstructGPT, 2022) uses KL divergence as a penalty preventing policy mode collapse. Differential Privacy (Dwork, 2006) is an information-theoretic privacy guarantee.

Information theory only applies to compression and telecommunications.

IT runs through ML: VAE (ELBO), RLHF (KL penalty), differential privacy (epsilon-DP), feature selection (MI), model compression (MDL). It is the universal language for describing information in systems.

Information is a foundational concept. Wherever there is data, learning, or uncertainty, there is IT. ML is largely 'learning efficient information representations'.

RLHF for ChatGPT adds a KL[pi || pi_ref] penalty. Why?

Takeaways

**Entropy:** H = 0 (deterministic) up to log_2(n) (uniform). Biased coin p = 0.99 has H ~ 0.08 bits. Conditioning reduces entropy.
**Coding:** Huffman gives H <= L < H + 1. nH bits is the minimum size for n symbols. At power-of-two probabilities Huffman is optimal.
**Capacity:** C = B log_2(1 + SNR) for AWGN. BSC(p): C = 1 - H(p). I(X; Y) = H(X) - H(X | Y) >= 0.
**ML applications:** RLHF (KL penalty), SAC (entropy reg), VAE (ELBO), differential privacy (epsilon-DP). Data processing inequality: processing does not create information.

Вопросы для размышления

If the interviewer asks for 'intuition behind KL divergence', how do you answer in 2 minutes with no formulas?
How would you answer 'why is ChatGPT sometimes confidently wrong?' using IT concepts (output entropy, KL from reference)?
Suppose the task is 'given N videos, pick 10 to show the user'. How would an information-theoretic approach help here?

Связанные уроки

stat-03-mle

Information Theory

Info theory at the interview

**Google Brain** has stated publicly that candidates who can explain ELBO through information principles receive significantly higher scores at ML Research interviews.
**Meta AI Research** often asks 'how do you measure how much information about the input a hidden representation retains?', directly inside the Information Bottleneck frame.
**OpenAI** uses KL divergence as a central component of RLHF (the ChatGPT training loop). Understanding it helps at interviews for alignment and RL teams.

Предварительные знания

Information Theory in Deep Learning

Entropy questions

Question	Answer	Trick
H(deterministic X) = ?	0 bits	-1 * log(1) = 0
H(uniform {1..n}) = ?	log_2(n) bits	Maximum
H(X, Y) when X = Y?	H(X) = H(Y)	H(X \| X) = 0
H(X + Y) vs H(X) + H(Y)?	<= sum	Correlation lowers it
0 log 0 = ?	0 (by convention)	lim p -> 0 of p log p = 0

Historical note

A high probability on one outcome means high entropy.

The opposite: the more concentrated one outcome, the lower the entropy. Maximum entropy sits at the uniform distribution.

Entropy is uncertainty. If one outcome is almost certain (p ~ 1) there is no uncertainty, no information, entropy ~ 0.

Interview question: 'A coin has p(H) = 0.99. Roughly what is its entropy?' Best answer:

Coding questions

Task	Method	Key fact
Min bits for n symbols	nH(X)	AEP: exactly nH bits needed
Average code word	Huffman tree	H <= L < H + 1
Min code per 1 symbol	Huffman	Integer bits, >= H(X)
Min code per block, n -> infinity	Arithmetic / ANS	-> H(X) bits/symbol
Cannot compress below...	H(X)	Source coding theorem

Historical note

Lossless compression always shrinks files.

Lossless compression cannot shrink a random file. It is already at maximum entropy. Some files even grow under 'compression'.

Pigeonhole: not all length-n strings can be shorter than n bits. For every well-compressed input there are inputs that grow.

A file holds 10000 symbols from {A, B, C, D} with probabilities 1/2, 1/4, 1/8, 1/8. Minimum size under lossless compression?

Capacity questions

Interview question	Key answer	Formula
I(X; Y) when X is independent of Y?	0	I = H(X) - H(X \| Y) = H(X) - H(X)
I(X; Y) maximum?	min(H(X), H(Y))	Reached at functional dependence
C = ? for an ideal channel	infinity (noiseless)	SNR -> infinity: C -> infinity
What limits internet speed?	Bandwidth + SNR	C = B log(1 + SNR)
Data processing inequality	I(X; Y) >= I(X; g(Y))	Processing cannot grow I

Historical note

Feature engineering can increase mutual information between features and target.

By DPI feature engineering cannot increase I(X; Y), only preserve or reduce it. The goal is to avoid losing useful information during transforms.

Markov chain: X -> g(X) -> Y. By DPI: I(X; Y) >= I(g(X); Y). Features g(X) are functions of X, so information about Y cannot exceed what is in X.

Interview question: 'What is the data processing inequality and why does it matter for ML?'

Practical applications

System	IT use	Metric
SAC (RL)	Entropy regularization: max H(pi)	Entropy H(pi)
RLHF (ChatGPT)	KL penalty from ref policy	KL[pi \|\| pi_ref]
Recommender	Max I(user; item)	Mutual information
Differential privacy	epsilon-DP bounds I	epsilon, delta parameters
Knowledge distillation	KL[soft_teacher \|\| student]	KL divergence

Historical note

Information theory only applies to compression and telecommunications.

Information is a foundational concept. Wherever there is data, learning, or uncertainty, there is IT. ML is largely 'learning efficient information representations'.

RLHF for ChatGPT adds a KL[pi || pi_ref] penalty. Why?

Takeaways

**Entropy:** H = 0 (deterministic) up to log_2(n) (uniform). Biased coin p = 0.99 has H ~ 0.08 bits. Conditioning reduces entropy.
**Coding:** Huffman gives H <= L < H + 1. nH bits is the minimum size for n symbols. At power-of-two probabilities Huffman is optimal.
**Capacity:** C = B log_2(1 + SNR) for AWGN. BSC(p): C = 1 - H(p). I(X; Y) = H(X) - H(X | Y) >= 0.
**ML applications:** RLHF (KL penalty), SAC (entropy reg), VAE (ELBO), differential privacy (epsilon-DP). Data processing inequality: processing does not create information.

Вопросы для размышления

If the interviewer asks for 'intuition behind KL divergence', how do you answer in 2 minutes with no formulas?
How would you answer 'why is ChatGPT sometimes confidently wrong?' using IT concepts (output entropy, KL from reference)?
Suppose the task is 'given N videos, pick 10 to show the user'. How would an information-theoretic approach help here?

Связанные уроки

stat-03-mle

Info theory at the interview

Предварительные знания

Entropy questions

Historical note

Coding questions

Historical note

Capacity questions

Historical note

Practical applications

Historical note

Takeaways

Related topics

Вопросы для размышления

Связанные уроки

Info theory at the interview

Предварительные знания

Entropy questions

Historical note

Coding questions

Historical note

Capacity questions

Historical note

Practical applications

Historical note

Takeaways

Related topics

Вопросы для размышления

Связанные уроки