Machine Learning

Generative Adversarial Networks (GAN)

In 2014, a PhD student at the University of Montreal named Ian Goodfellow was arguing with friends at a bar about how to make a neural network generate realistic images. Every idea seemed too complicated. Then it hit him: what if you pitted two neural networks against each other - one creates fakes, the other exposes them? He went home, wrote the code in a single night, ran it - and it worked on the very first try. That was the birth of GANs - generative adversarial networks, which today create photorealistic faces of people who have never existed.

  • **Face generation (ThisPersonDoesNotExist.com)** - StyleGAN creates photorealistic portraits of non-existent people at 1024x1024 resolution, indistinguishable from real photographs: from skin texture to reflections in the eyes
  • **Super-resolution and restoration** - ESRGAN upscales images by 4–8x, recovering details that were not present in the original: from a blurry 64x64 photo it produces a sharp 512x512, applied in medical imaging and restoration of old films
  • **Data augmentation in medicine** - GANs generate synthetic X-rays and MRIs to train diagnostic models, addressing the shortage of labeled medical data and patient privacy requirements

Предварительные знания

  • Autoencoders and VAE

A bar argument that became a generative revolution

The story goes that Ian Goodfellow sketched the idea for generative adversarial networks during an argument in a Montreal bar in 2014: pit two networks against each other, one generating fakes and one trying to catch them, and let the contest push the generator toward realism. He coded a working prototype that same night, and the 2014 paper launched a whole field. In 2015 Radford and colleagues introduced DCGAN, which made GANs train stably with convolutional architectures and clear design rules. By 2018 NVIDIA's StyleGAN, led by Tero Karras, was producing photorealistic human faces so convincing that telling them from real photographs became genuinely hard.

Generator: Creating Data from Noise

The generator is a neural network that takes a **random vector z** (latent vector) drawn from a normal distribution and transforms it into synthetic data - for example, an image. Think of a counterfeiter who receives random ink blotches and learns to turn them into convincing banknotes. At the start of training, the generator produces meaningless noise, but gradually learns to create increasingly realistic data.

**Latent space** is the space of all possible input vectors z. Each point in this space corresponds to some output image. A remarkable property of a well-trained GAN: *similar points in latent space produce similar images*. If z1 generates a smiling face, then z1 shifted slightly will produce a face with a slightly different smile. This allows smooth interpolation between images.

**Why Tanh at the generator output?** Tanh produces values from -1 to +1. Images are normalized to the same range. This provides: - **Symmetric range** - centered around zero, which helps training stability - **Bounded output** - pixel values don't diverge to infinity - **Compatibility** - the discriminator receives real and fake images on the same scale The alternative is Sigmoid (0 to 1), but Tanh generally gives better stability when training GANs.

The generator never sees real data directly. It learns only through **gradients from the discriminator**. When the discriminator says "this is fake," the generator receives a signal indicating which direction to update its weights so it can fool the discriminator next time. It's like a counterfeiter who has never seen real money but gets feedback from a counterfeit detector: "the watermark is off here," "the ink color is wrong."

What is the input to the generator in a GAN?

Discriminator: The Fake Detector

The discriminator is a **binary classifier** that takes an image (real or generated) and outputs the probability that it is genuine. If the generator is a counterfeiter, the discriminator is a detective specializing in catching fakes. Its job: learn to distinguish real data from the training set from fakes created by the generator.

Notice **LeakyReLU** instead of regular ReLU in the discriminator. Standard ReLU zeros out all negative values, which can lead to "dead neurons" - neurons through which the gradient stops flowing. LeakyReLU lets a small negative signal pass through (typically 0.2 * x), which stabilizes GAN training. Also, the discriminator typically uses **no BatchNorm** - an empirical rule that improves stability.

**The adversarial game: generator vs discriminator** The two networks train simultaneously with opposing objectives: - **Discriminator** wants: D(real) = 1 (real is genuine), D(G(z)) = 0 (fake is counterfeit) - **Generator** wants: D(G(z)) = 1 (fool the discriminator into thinking fakes are real) This is the **adversarial** approach: one network's success is the other's failure. The generator cannot directly improve its images - it can only try to fool the discriminator. And the discriminator cannot relax - the generator keeps getting better.

Key difference from autoencoders: an autoencoder learns to *reconstruct* input data through a bottleneck, whereas a GAN learns to *generate* new data that is indistinguishable from real data. An autoencoder compares output to input directly (pixel-to-pixel loss), while a GAN uses the discriminator as a "trained critic." This is why GANs generate sharper, more realistic images - the discriminator notices blurriness and penalizes it.

What activation function is typically used in the hidden layers of a GAN discriminator, and why?

GAN Training Dynamics

GAN training is a **minimax game**: the generator minimizes and the discriminator maximizes the value function V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]. The first term measures how well D recognizes real data. The second measures how well D rejects fakes. D wants both terms maximized (correct classification), G wants the second term minimized (fooling D).

Ideally, training reaches **Nash equilibrium**: the discriminator outputs 0.5 for any input because it can no longer distinguish real data from generated data. In practice, this equilibrium is unstable. The two main enemies of GAN training are **mode collapse** and the **imbalance** between generator and discriminator.

**Mode collapse - the main GAN problem:** Mode collapse occurs when the generator finds a "loophole" - one type of image that reliably fools the discriminator - and starts generating only that. Example: a GAN trained on faces generates only smiling female faces. All 1000 different z vectors produce nearly identical images. The generator has "collapsed" into one mode of the data distribution, ignoring all its diversity. Why it happens: it is more profitable for the generator to perfectly master one mode than to risk diversity. This is a rational strategy in the minimax game, but a disaster for generation quality.

**Wasserstein GAN (WGAN) - solving the instability problem:** A standard GAN uses BCE loss, which can cause vanishing gradients when the discriminator becomes too strong. WGAN replaces the loss with the **Wasserstein distance** (Earth Mover's Distance): - The discriminator ("critic") outputs a score rather than a probability - no Sigmoid - loss_D = D(fake) - D(real) (critic wants to maximize the gap) - loss_G = -D(fake) (generator wants a high score) - **Gradient clipping** or **gradient penalty** to enforce the Lipschitz constraint WGAN provides a meaningful quality metric: the lower the Wasserstein distance, the better the generator. In a standard GAN, the generator loss does not correlate with output quality.

What is mode collapse in the context of GAN training?

DCGAN: Deep Convolutional GANs

**DCGAN** (Deep Convolutional GAN) was the first architecture to systematize the rules for building stable convolutional GANs. Before DCGAN (2015), attempts to use CNNs in GANs often ended in unstable training. Alec Radford and colleagues identified a set of architectural guidelines that became the standard for all subsequent GANs.

**DCGAN architectural rules:** 1. **No pooling** - replace Max Pooling with strided convolutions (discriminator) and transposed convolutions (generator). Let the network learn to downsample/upsample on its own. 2. **Batch Normalization everywhere** - except the generator's output layer and the discriminator's input layer. BatchNorm stabilizes training and prevents mode collapse. 3. **No fully-connected hidden layers** - only convolutional layers. Fully-connected layers are only used to project z into the initial tensor of the generator. 4. **ReLU in the generator, LeakyReLU in the discriminator** - except at the generator output (Tanh) and discriminator output (Sigmoid).

**Transposed convolution** (sometimes called "deconvolution," though that term is mathematically imprecise) is the key operation in the DCGAN generator. A regular convolution reduces spatial resolution (32x32 -> 16x16), while a transposed convolution increases it (16x16 -> 32x32). Intuitively, it inserts zeros between pixels and applies a standard convolution, "stretching" the image. Stride=2 doubles the size.

One of DCGAN's most striking discoveries was **arithmetic in latent space**. It turned out that latent vectors can be added and subtracted with semantically meaningful results: vector("man with glasses") - vector("man") + vector("woman") = vector("woman with glasses"). This proved that a GAN doesn't simply memorize training images but learns a *meaningful* representation - the latent space is organized around semantic attributes.

Which of the following architectural decisions is a rule of DCGAN?

StyleGAN and Modern GANs

**StyleGAN** (2018, NVIDIA) is a revolutionary architecture that reimagined the GAN generator. Instead of feeding z directly into the convolutional layers, StyleGAN first passes z through a **mapping network** (8 fully-connected layers), transforming it into an intermediate vector w. This w vector is then "injected" into each layer of the generator via **Adaptive Instance Normalization (AdaIN)**, controlling the style at each level of detail.

**Adaptive Instance Normalization (AdaIN)** is the key mechanism in StyleGAN. For each layer it normalizes the feature maps (subtracts the mean, divides by std), then scales and shifts them using parameters derived from the style vector w. The formula: AdaIN(x, w) = scale(w) * normalize(x) + bias(w). This allows the w vector to independently control "style" at each level of detail.

**Progressive growing: from 4x4 to 1024x1024** The original StyleGAN used **progressive growing** - a technique of gradually increasing resolution: 1. Start training on 4x4 images (simple task, fast convergence) 2. Add layers for 8x8, continue training 3. Progressively grow to 16x16, 32x32, 64x64... 1024x1024 This solves the problem: training a GAN directly on 1024x1024 is practically infeasible - too many details, too unstable. Progressive growing lets the network first learn coarse structure, then gradually add fine detail. **StyleGAN2** abandoned progressive growing in favor of path length regularization and other stabilization techniques.

**StyleGAN2** (2020) fixed artifacts present in the original StyleGAN (characteristic "droplet" patterns on images), removed progressive growing, and improved generation quality. **StyleGAN3** (2021) addressed "texture sticking" - the phenomenon where textures become attached to pixel coordinates instead of following the object's geometry.

Today the generative model landscape has shifted. **Diffusion models** (DALL-E 2, Stable Diffusion, Midjourney) have surpassed GANs in generation quality and diversity. They train more stably (no adversarial dynamics), do not suffer from mode collapse, and better cover the entire data distribution. However, GANs remain relevant: they generate images in a single forward pass (milliseconds), whereas diffusion models require dozens of denoising steps (seconds). For real-time applications - video, interactive systems - GANs are still the preferred choice.

GANs are the best way to generate images and nothing can surpass them

Diffusion models (DALL-E 2, Stable Diffusion) have surpassed GANs in quality and training stability, but GANs remain faster at inference and are essential for real-time applications

GANs suffer from mode collapse and unstable training. Diffusion models train stably, better cover the full data distribution, and generate more diverse results. However, diffusion models require dozens of denoising steps (seconds), while GANs generate in a single forward pass (milliseconds). The choice depends on the task: for maximum quality - diffusion, for speed - GAN.

What role does the mapping network play in StyleGAN?

Key Ideas

  • **The generator** takes a random vector z from a normal distribution and transforms it into a synthetic image through a series of layers - it never sees real data and learns only through gradients from the discriminator
  • **The discriminator** is a binary classifier that distinguishes real data from generated data, using LeakyReLU instead of ReLU to prevent dead neurons
  • **The minimax game** alternates between training D (to distinguish real from fake) and G (to fool D), but suffers from mode collapse and instability - WGAN with Wasserstein distance addresses some of these problems
  • **DCGAN** established architectural rules: strided convolutions instead of pooling, Batch Normalization, no fully-connected hidden layers - and revealed latent space arithmetic (man with glasses - man + woman = woman with glasses)
  • **StyleGAN** decoupled generation into a mapping network (z -> w) and a synthesis network with AdaIN, enabling style control at every level of detail - from pose to skin texture
  • Just like that night in 2014 when Goodfellow's idea of two competing networks worked on the first run, GANs demonstrated that the adversarial approach is a powerful training principle - though today diffusion models have surpassed GANs in quality, leaving them their advantage in inference speed

Related Topics

GANs sit at the intersection of generative models and adversarial approaches, connecting classical autoencoders with modern image generation:

  • Autoencoders — A predecessor to GANs in generative modeling: an autoencoder compresses data into latent space and reconstructs it back, while a GAN replaces the pixel-to-pixel reconstruction loss with a discriminator-critic, producing sharper and more realistic images
  • Image Classification — The GAN discriminator is really just a CNN classifier trained on a binary real/fake task. Architectural choices from classification (convolutions, BatchNorm, strided convolutions) are directly carried over to GANs and form the foundation of DCGAN

Вопросы для размышления

  • Why can a GAN generator, which never sees real data directly, produce realistic images? What role does the discriminator play as an intermediary between real data and the generator?
  • Mode collapse is one of the main problems with GANs. If you were designing a new GAN architecture, what mechanisms would you add to encourage diversity in generation?
  • Diffusion models have surpassed GANs in generation quality, but GANs generate in milliseconds while diffusion models take seconds. In which real-world applications is GAN speed critical, and in which can you afford to wait for better quality?

Связанные уроки

  • ml-32-autoencoders — Shares generative latent-space foundation
  • ml-38-image-classification — Discriminator is a binary image classifier
  • ml-27-activation-functions — Generator stability depends on activations
  • prob-04-bayes — GANs implicitly model the data distribution
  • aie-26-image-generation — GANs underpin practical image generation
  • dl-14
Generative Adversarial Networks (GAN)

0

1

Sign In