Deep Learning

Autoencoders and VAE

2013. Diederik Kingma publishes 'Auto-Encoding Variational Bayes' with 14 pages of math and 50 lines of code. A year later VAEs generate faces of people who never existed; two years later, interpolations between Hollywood celebrities. By 2023 the same principles sit at the heart of Stable Diffusion: a VAE encodes images into latent space, diffusion runs there, and the decoder returns the result. One trick - reparameterization - changed the field of deep learning.

**Anomaly detection in production**: AE for credit card fraud, manufacturing defects, network intrusion - an anomaly equals high reconstruction error at a specific point (banks train models on normal behaviour)
**Recommender systems**: VAE-CF, released by Netflix in 2018, trains on the user x movie matrix and recommends via decoding the user latent vector
**Drug discovery**: ChemVAE and MolVAE encode molecules as SMILES strings into latent space; finding new compounds = searching for z points with desired properties and decoding back to a molecular formula

From deep autoencoders to the VAE

In 2006 Geoffrey Hinton and Ruslan Salakhutdinov published a Science paper showing that deep autoencoders, pretrained layer by layer, could compress data far better than PCA - a key result of the deep learning revival. In 2013 Diederik Kingma and Max Welling introduced the Variational Autoencoder, which turned the autoencoder into a probabilistic generative model. Their reparameterization trick made the sampling step differentiable, so the whole model could be trained end to end with backpropagation, and it became a foundation of modern generative modeling.

Предварительные знания

Encoder-decoder architectures and bottlenecks
Reconstruction loss (MSE) and gradient-based training
Probability basics: normal distribution, KL divergence

Encoder-Decoder Architecture

An **autoencoder** is a neural network that learns to copy its input to its output through a narrow bottleneck. Structure: an encoder compresses x into a latent vector z of dimensionality far smaller than the input, a decoder reconstructs x' from z. Loss is MSE or binary cross-entropy between x and x'. The value lies not in the copy but in the latent code z: the network is forced to learn a compact representation that preserves essential features. This is unsupervised learning - no labels needed, only the data itself.

Autoencoder variants: Denoising AE - feeds in noisy x + noise and targets the clean x. The network learns to ignore noise; Sparse AE - adds L1 regularisation on z activations, forcing most units to zero - producing an interpretable representation; Contractive AE - regularisation on the encoder Jacobian, making z robust to small changes in x. None of these solve the main problem of plain AE: latent space can be 'holey' - random z outside training examples decodes to garbage.

What happens to an autoencoder if the bottleneck is removed (latent_dim equals input size)?

Latent Space

The latent space z of an autoencoder is a compressed map of input data. After training on MNIST a 32-dimensional z encodes digit attributes: stroke thickness, slant, loop shape. Interpolation in latent space shows smooth transitions: a point between 'three' and 'eight' in latent space decodes to a hybrid morphing between them. This is more than compression - the network is learning a representation of data structure.

Problem with plain AE: latent space is **unstructured**. Neighbouring points in z may decode into entirely different images, and large regions of the space contain no training points - a random z in those 'holes' decodes to noise. Plain AE is therefore a poor generator: sampling z ~ N(0, I) and decoding almost always yields garbage. VAE solves precisely this - it forces the latent distribution to be smooth, filled, and close to N(0, I).

Why is a plain autoencoder poorly suited for generating new images?

KL Divergence and VAE

**Variational Autoencoder (VAE)** by Kingma & Welling (2013) solves the holey-latent-space problem with a radically different approach. The encoder outputs not a point z, but parameters of a distribution - mu and sigma of a normal distribution q(z|x). From this distribution z ~ N(mu, sigma^2) is sampled and fed into the decoder. Loss has two parts: (1) reconstruction loss as before, (2) KL divergence KL(q(z|x) || N(0, I)) - a penalty for the latent distribution deviating from standard normal. This penalty makes latent space smooth and continuous.

ELBO loss: reconstruction + KL divergence to the prior p(z) = N(0, I).

The reparameterization trick is the key technical move in VAE: to push gradients through stochastic sampling z ~ N(mu, sigma^2), write z = mu + sigma * epsilon, where epsilon ~ N(0, I) is random noise. Now gradients flow through mu and sigma (deterministic functions of the encoder) and epsilon is treated as 'data noise'. Without this trick backprop through sampling would be impossible. KL divergence between two normal distributions has a closed form: KL = 0.5 * sum(mu^2 + sigma^2 - 1 - log sigma^2).

Why does VAE use the reparameterization trick z = mu + sigma * epsilon instead of sampling z ~ N(mu, sigma^2) directly?

Generation with VAE

After training, VAE generation is a one-liner: z ~ N(0, I) -> decoder(z) -> new image. KL regularisation keeps latent space close to standard normal, so random points decode into meaningful images (rather than noise as in plain AE). VAE gives principled generative capabilities: conditional generation (Conditional VAE - feed class c alongside z), interpolation between specific images via linear combinations of their mu vectors, and attribute arithmetic in latent space.

VAE vs GAN - the two main generative models before diffusion. VAE: easier to train, stable convergence, explicit probabilistic interpretation (ELBO), but images come out 'blurry' due to averaging over the latent distribution. GAN: sharp images, but unstable training (mode collapse, vanishing gradients), no probabilistic interpretation. Diffusion models (DALL-E 2, Stable Diffusion) are iterative denoising processes giving better quality and stability at the cost of speed (dozens of forward passes per image). VAE remains relevant as a latent module (Stable Diffusion internally uses VAE for compression).

Autoencoder and VAE are the same model with different regularisation

Autoencoder is a deterministic compression function, VAE is a probabilistic generative model

AE minimises only reconstruction loss and outputs a point z; VAE models p(x, z) and optimises the ELBO (a lower bound on log-likelihood). The distinction is fundamental: AE is feature learning, VAE is density modelling. VAE can sample new images; AE cannot (without tricks).

VAE-generated MNIST images look 'blurry'. What is the fundamental reason?

Key Ideas

**Autoencoder** - encoder + decoder with a narrow bottleneck, trained to copy input; the value is the latent representation z learned through compression
**Latent space** of plain AE is unstructured: random z decodes to noise, which makes plain AE a poor generator
**VAE** replaces a point z with a distribution q(z|x) = N(mu, sigma^2) and adds KL divergence to N(0, I) - making latent space smooth and suitable for generation
**Reparameterization trick** z = mu + sigma * epsilon is the technical key to VAE, making stochastic sampling differentiable by isolating randomness in epsilon

Вопросы для размышления

VAE produces 'blurry' images, GAN produces sharp but unstable ones, diffusion produces sharp and stable but slow. What trade-offs justify using each model in 2025?
The reparameterization trick solved a specific problem - differentiability of stochastic sampling. What other ML tasks could benefit from similar 'externalising of randomness'?
If VAE latent space allows meaningful interpolation and arithmetic, can it be used to understand the internal representations of neural networks in general? What are the limits of this approach?

Связанные уроки

dl-12 — Distributed training - context for scaling VAEs
dl-14 — GAN - alternative generative framework, contrast with VAE
it-03 — KL divergence in the VAE loss - straight from information theory
prob-04-bayes — VAE - Bayesian inference: prior, likelihood, posterior
la-13-eigenvectors

Deep Learning

Autoencoders and VAE

**Anomaly detection in production**: AE for credit card fraud, manufacturing defects, network intrusion - an anomaly equals high reconstruction error at a specific point (banks train models on normal behaviour)
**Recommender systems**: VAE-CF, released by Netflix in 2018, trains on the user x movie matrix and recommends via decoding the user latent vector
**Drug discovery**: ChemVAE and MolVAE encode molecules as SMILES strings into latent space; finding new compounds = searching for z points with desired properties and decoding back to a molecular formula

From deep autoencoders to the VAE

Предварительные знания

Encoder-decoder architectures and bottlenecks
Reconstruction loss (MSE) and gradient-based training
Probability basics: normal distribution, KL divergence

Encoder-Decoder Architecture

What happens to an autoencoder if the bottleneck is removed (latent_dim equals input size)?

Latent Space

Why is a plain autoencoder poorly suited for generating new images?

KL Divergence and VAE

ELBO loss: reconstruction + KL divergence to the prior p(z) = N(0, I).

Why does VAE use the reparameterization trick z = mu + sigma * epsilon instead of sampling z ~ N(mu, sigma^2) directly?

Generation with VAE

Autoencoder and VAE are the same model with different regularisation

Autoencoder is a deterministic compression function, VAE is a probabilistic generative model

VAE-generated MNIST images look 'blurry'. What is the fundamental reason?

Key Ideas

**Autoencoder** - encoder + decoder with a narrow bottleneck, trained to copy input; the value is the latent representation z learned through compression
**Latent space** of plain AE is unstructured: random z decodes to noise, which makes plain AE a poor generator
**VAE** replaces a point z with a distribution q(z|x) = N(mu, sigma^2) and adds KL divergence to N(0, I) - making latent space smooth and suitable for generation
**Reparameterization trick** z = mu + sigma * epsilon is the technical key to VAE, making stochastic sampling differentiable by isolating randomness in epsilon

Вопросы для размышления

VAE produces 'blurry' images, GAN produces sharp but unstable ones, diffusion produces sharp and stable but slow. What trade-offs justify using each model in 2025?
The reparameterization trick solved a specific problem - differentiability of stochastic sampling. What other ML tasks could benefit from similar 'externalising of randomness'?
If VAE latent space allows meaningful interpolation and arithmetic, can it be used to understand the internal representations of neural networks in general? What are the limits of this approach?

Связанные уроки

dl-12 — Distributed training - context for scaling VAEs
dl-14 — GAN - alternative generative framework, contrast with VAE
it-03 — KL divergence in the VAE loss - straight from information theory
prob-04-bayes — VAE - Bayesian inference: prior, likelihood, posterior
la-13-eigenvectors

Autoencoders and VAE

From deep autoencoders to the VAE

Предварительные знания

Encoder-Decoder Architecture

Latent Space

KL Divergence and VAE

Generation with VAE

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Autoencoders and VAE

From deep autoencoders to the VAE

Предварительные знания

Encoder-Decoder Architecture

Latent Space

KL Divergence and VAE

Generation with VAE

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки