Machine Learning
Autoencoders and VAE
What if a neural network could compress images better than JPEG? A standard compression algorithm uses the same rules for all images - photographs, drawings, diagrams. An autoencoder takes a different approach: it looks at thousands of examples and learns on its own to find the most compact representation, keeping only what truly matters and discarding the rest. From 784 pixels of a digit it retains 32 numbers - and reconstructs the image from them with almost no loss. How does it do that?
- **Fraud detection** - an autoencoder is trained on normal bank transactions, and when a fraudulent transaction is presented as input, the reconstruction error spikes sharply: the model doesn't know how to reconstruct what it has never seen, and this becomes the alarm signal
- **Face and image generation** - a VAE trained on millions of photos creates a continuous latent space in which you can smoothly change age, facial expression, and head rotation by moving just a few numbers in the z vector
- **Denoising in medical imaging** - a Denoising Autoencoder removes noise from X-rays and MRI scans while preserving diagnostically important details, allowing doctors to see pathologies hidden behind artifacts
Предварительные знания
From deep autoencoders to the variational leap
In 2006 Geoffrey Hinton and Ruslan Salakhutdinov published a Science paper showing that a deep autoencoder, carefully pretrained layer by layer, could compress data into a low-dimensional code far better than principal component analysis. It was one of the results that helped revive interest in deep networks. The next big step came in 2013, when Diederik Kingma and Max Welling introduced the Variational Autoencoder, which treats the latent code as a probability distribution rather than a fixed point. That change turned the autoencoder from a compression tool into a true generative model that can be sampled from, and it remains a cornerstone of generative modeling today.
Encoder-Decoder architecture
An autoencoder is a neural network that learns to **reproduce its own input at the output**. Sounds pointless? The secret is that between input and output there is a narrow passage - a **bottleneck**. The network consists of two parts: the **encoder** compresses the input into a small representation z, and the **decoder** tries to reconstruct the original data from this compressed code. If reconstruction succeeds - the network has learned to find a compact representation that preserves the essence of the data.
Why is this useful? The encoder learns to **extract the most important features** from the data. If a 28x28 image (784 pixels) can be encoded into 32 numbers and then reconstructed - it means those 32 numbers capture the essence of the image. This is **learned compression**: unlike JPEG, which uses fixed rules, an autoencoder adapts to the specific type of data and finds patterns unique to it.
**Why the bottleneck is needed:** Without a bottleneck (if latent size >= input size), the network can simply copy the input to the output via an identity mapping. No useful learning will happen. **The bottleneck forces the network to:** - Discard noise and insignificant details - Find patterns and correlations between features - Create a compact, informative representation **Analogy:** suppose you need to describe a photograph in just 10 words. You would pick the most important: "man in red jacket against mountain backdrop". The bottleneck forces the network to do the same - choose the essence.
The autoencoder's loss function is the **reconstruction error**: how much the output X' differs from the input X. **MSE** (Mean Squared Error) is most commonly used: the mean of (xi - xi')^2 for each element. For binary data (black-and-white images) **Binary Cross-Entropy** works better. Training proceeds via normal backpropagation through both parts of the network simultaneously.
Applications of the encoder-decoder architecture go far beyond compression. The encoder can be used separately as a **feature extractor**: representations learned on large data are often better than hand-crafted features. This is especially valuable for tasks with little labeled data - first train the autoencoder on unlabeled data (unsupervised pretraining), then use the encoder as a starting point for a classifier.
What happens if the latent layer (bottleneck) size of an autoencoder equals or exceeds the input size?
Latent space
The autoencoder's bottleneck creates a **latent space** - a hidden low-dimensional space in which data is represented in compressed form. Each point in this space - a vector z - encodes one example from the training set. A surprising property: **similar inputs end up close together** in the latent space. All 5s from MNIST cluster in one region, all 3s in another. The network discovers data structure on its own, without supervision.
The autoencoder's latent space is **nonlinear dimensionality reduction**. Classic PCA also compresses data, but only finds linear dependencies. An autoencoder with nonlinear activations (ReLU, Sigmoid) can discover **complex nonlinear structures** in data. If you use an autoencoder with only linear layers and MSE loss, the result is mathematically equivalent to PCA - but with nonlinearities the autoencoder is significantly more powerful.
**Latent space dimension - the key trade-off:** - **Too small** (2-5 for MNIST): the network can't remember all details, reconstructions are blurry. But the space is well-structured and easy to visualize. - **Optimal** (16-64 for MNIST): good balance between reconstruction quality and compression. The network learns meaningful features. - **Too large** (200+ for MNIST): excellent reconstruction, but the latent space is chaotic with lots of redundancy. The network is not forced to find a compact representation. Rule of thumb: start with latent_dim roughly 10-50x smaller than the input and tune by reconstruction quality.
One of the most interesting properties of the latent space is **interpolation**. If you take two images, encode them to points z1 and z2, then walk uniformly from z1 to z2 in the latent space, decoding intermediate points - you get smooth transitions between images. A 3 gradually transforms into an 8, a smile into a serious face. This works because the latent space is continuous: nearby points decode into similar images.
Latent representations are useful not only for visualization. The vector z can be used as an **input feature for downstream tasks**: classification, clustering, similarity search. Autoencoders are trained on unlabeled data (unsupervised) and extract features that then help solve tasks with few labels. This approach is called **representation learning**.
How is an autoencoder's latent space fundamentally different from PCA as a dimensionality reduction method?
Variational Autoencoder (VAE)
A regular autoencoder compresses data well, but its latent space has a serious problem: it is **non-uniform and has holes**. Between digit clusters in the latent space there can be "dead zones" where no real point landed during training. Decoding a random point from such a zone produces garbage, resembling nothing. This means a regular autoencoder **cannot generate new data** - it only compresses and reconstructs existing data.
**VAE (Variational Autoencoder)** solves this problem elegantly: instead of encoding the input to a point z, the encoder outputs **distribution parameters** - mean (mu) and variance (sigma). Each input is described not by a single point but by a "cloud" of possible codes. This forces clouds of different examples to overlap, filling the holes in the latent space.
**Reparameterization trick - the key to training VAE:** Problem: if z is a random variable, how do we backpropagate through it? Solution - reparameterization: `z = mu + sigma * epsilon`, where `epsilon ~ N(0, 1)` - `mu` and `sigma` - encoder outputs (learnable) - `epsilon` - random noise from standard normal distribution - Gradient w.r.t. `mu` and `sigma` is computed normally - Stochasticity is isolated in `epsilon` (independent of parameters) This trick allows training the VAE with standard backpropagation despite the random sampling.
The VAE loss function has two parts: **reconstruction loss** (how well data is reconstructed) and **KL-divergence** (how much the latent distribution deviates from the standard normal N(0,1)). The KL-loss acts as a **regularizer**: it nudges all encodings toward the center (0,0) and standard variance, preventing holes and isolated clusters. The balance between the two losses determines quality: too strong KL - blurry reconstructions, too weak - holes in latent space.
The main advantage of VAE is the **ability to generate**. Since the KL-loss pushes the latent space toward a standard normal distribution, you can simply sample z from N(0,1) and decode - yielding a new, previously non-existent but realistic image. You can also control generation: change individual coordinates of z and observe how the output changes. One coordinate might control the tilt of a digit, another the stroke thickness, another the writing style.
What role does KL-divergence play in the VAE loss function?
Denoising and practical applications
A **Denoising Autoencoder (DAE)** is a variant that addresses a specific problem: a regular autoencoder may learn a "lazy" mapping, simply copying the input. DAE fights this elegantly: the input is a **corrupted (noisy) version of the data**, while the expected output is the **clean original**. The network is forced to learn not to copy pixels but to **understand the data structure** so it can distinguish signal from noise.
Why does DAE learn better than a regular autoencoder? Adding noise is a form of **regularization**. The network can't just memorize the input because the noise is different every time. It is forced to learn **robust features** - those that survive in the presence of noise. These features reflect the true structure of the data. That's why DAE is often used for **pretraining**: features robust to noise turn out to be useful for any downstream tasks.
**Other variants of autoencoders:** **Sparse Autoencoder:** adds L1 regularization on the activations of the latent layer. Most neurons in z are forced to be zero - each input activates only a small subset. Result: each neuron specializes in a specific feature. **Contractive Autoencoder (CAE):** penalizes the sensitivity of the latent representation to small changes in input. The Frobenius norm of the encoder's Jacobian is added to the loss. Result: z changes little with small input noise - the model learns robust features. **Convolutional Autoencoder:** uses convolutional layers instead of fully connected. Encoder: Conv + Pooling (compression). Decoder: ConvTranspose + Upsampling (expansion). Standard for working with images.
Practical applications of autoencoders go far beyond image generation. **Anomaly detection:** train an autoencoder on normal data, then measure the reconstruction error for new inputs. If the error is high - the input is anomalous, the model can't reconstruct it. This approach is used to detect fraudulent transactions, manufacturing defects, and network intrusions. **Data compression:** encoder reduces size, decoder reconstructs. **Pretraining:** encoder extracts features from unlabeled data, then is fine-tuned on labeled data.
Autoencoders became the foundation for an entire family of models. VAEs opened the path to generative modeling, which evolved into GANs and diffusion models. The denoising autoencoder laid the foundation for **Denoising Diffusion Models** (DDPM) - the architecture behind Stable Diffusion and DALL-E. The idea is the same: train a network to remove noise, then iteratively transform pure noise into an image.
Autoencoders are only for generating images
Autoencoders are a powerful tool for data compression, anomaly detection, pretraining, and denoising with any type of data
Image generation is just one application (mostly VAE). Regular autoencoders are widely used for fraud detection (anomalous transactions have high reconstruction error), data compression (encoder as a compressor), pretraining neural networks on unlabeled data, and denoising (denoising autoencoder). They work with tables, time series, text, audio - not just images.
How can an autoencoder detect anomalies if it was trained only on normal data?
Key ideas
- **Encoder-Decoder architecture:** an autoencoder compresses input through a bottleneck and reconstructs it at the output - the bottleneck forces the network to learn the most important features, discarding noise and insignificant details
- **Latent space:** the bottleneck creates a low-dimensional data representation where similar inputs cluster together - a nonlinear analogue of PCA that enables visualization and interpolation between examples
- **VAE for generation:** a variational autoencoder encodes data into distributions (mu + sigma) instead of points, and KL-divergence aligns the latent space with N(0,1) - the result: a continuous space without holes from which new data can be generated
- **Denoising and variants:** adding noise during training forces the network to learn robust features; sparse regularization gives interpretability; contractive penalty gives stable representations
- **From compression to generation:** autoencoders started as a compression tool - from 784 pixels to 32 numbers - but grew into the foundation of generative AI, from VAE to diffusion models powering Stable Diffusion and DALL-E
Related topics
Autoencoders sit at the intersection of information compression and generative modeling, linking classical dimensionality reduction methods to modern generative architectures:
- GAN (Generative Adversarial Networks) — An alternative approach to data generation: while VAE learns through reconstruction and KL-divergence, GAN uses a competition between a generator and a discriminator. GANs often produce sharper images, but VAE provides a more structured latent space
- PCA (Principal Component Analysis) — The linear predecessor of autoencoders: PCA finds optimal linear projections of data, and a linear autoencoder with MSE loss is mathematically equivalent to PCA. Nonlinear autoencoders extend this idea to complex nonlinear structures
Вопросы для размышления
- Why does VAE generate blurrier images than GAN? How are reconstruction loss, KL-divergence, and generation quality related - and what could be changed in the architecture to improve sharpness?
- If you train an autoencoder on photos of cats and then feed it a photo of a dog, what happens to the reconstruction error and why? How is this property used in practice?
- A denoising autoencoder and a Denoising Diffusion Model (DDPM) both use the same idea - training on noise removal. What is so different about DDPM's approach that allows it to generate photographic-quality images?
Связанные уроки
- ml-31-transformers — Builds on deep network and encoder concepts
- ml-33-gan — Both are generative, different training signal
- ml-19-pca — Linear autoencoder recovers PCA subspace
- la-15-svd — PCA bottleneck relates to SVD factorization
- ml-20-anomaly-detection — Reconstruction error flags anomalies
- stat-14-pca