Computer Vision

Filtering and Convolution

Цели урока

  • Understand convolution: kernel, padding, stride - and the output size formula
  • Choose the right blur type for a given noise profile
  • Apply Canny edge detection with correct threshold parameters
  • Build a morphological pipeline to clean binary masks

Предварительные знания

  • Digital Images: Pixels and Color

Yann LeCun and the First Practical CNN

1989. Yann LeCun at Bell Labs trains LeNet-5 - a convolutional neural network for handwritten digit recognition deployed by the US Postal Service. The revolutionary insight: instead of fully connected layers - 5x5 convolution kernels with shared weights. Fewer parameters, and crucially - position invariance. The same digit in different positions activates the same kernel. By 1998, LeNet processed 10-20% of bank checks in the US. In 2012, its ideas powered AlexNet, which broke ImageNet records and launched the deep learning era.

2015. ResNet-152: 3.57% error on ImageNet - better than humans (5.1%). 152 layers. One trick: residual connections. Today, DALL-E 3, Stable Diffusion, Tesla Autopilot - all built on convolutional architectures. At the core: a 3x3 matrix sliding across an image. The same idea Yann LeCun applied in 1989.

  • **Portrait Mode on smartphones** - Gaussian blur applied to a mask computed by a neural network. Every photo taken involves convolution running 30-40 times in sequence
  • **Tesla Autopilot** - Canny edge detection + Hough transform for lane markings, 30 frames per second across 8 cameras
  • **OCR (Google Lens, Apple Live Text)** - morphological operations join broken letter segments and remove scan artifacts before the neural network runs
  • **DALL-E 3 and Stable Diffusion** - a convolutional encoder compresses the image into latent space before the diffusion process begins

The Convolution Operation

1989. Yann LeCun at Bell Labs trains LeNet on handwritten digits for the US Postal Service. The key decision: instead of a fully connected layer - a **convolution kernel**, a 5x5 matrix sliding across the image. Same numbers, same weights - at every position. That gave 10x fewer parameters and the first practical CV neural network in history. Today, convolution kernels are inside iPhone Portrait Mode, Tesla Autopilot, and DALL-E 3.

**Padding** solves the border problem: when a 3x3 kernel slides near the edges, boundary pixels have no neighbors. **Same padding** (zero padding) - pad with zeros so output matches input size. **Valid** - no padding, output shrinks. **Stride** - step size of the kernel shift (typically 1). In CNNs: stride=2 replaces pooling for resolution reduction.

The pure-Python loop implementation is **catastrophically slow** (seconds per image). OpenCV calls optimized C++ code: `cv2.filter2D(img, -1, kernel)` - thousands of times faster. Write convolution by hand only to understand the mechanics.

The kernel determines **what the convolution does**. All-ones - averaging (blur). Positive center, negative surroundings - edge detection. Gaussian - smooth blur without artifacts. One operation, infinite effects. CNNs learn kernels automatically from data - that is the central breakthrough of deep learning in CV.

A 100x100 image is convolved with a 5x5 kernel, no padding (valid). What is the output size?

Blurring and Noise Reduction

A paradox: blurring is one of the most useful operations in CV. It removes noise (random brightness fluctuations), smooths textures before edge detection, and reduces distracting detail. Without blur: Canny on a noisy image generates thousands of false edges. With blur: only real contours survive. Harris, SIFT, Canny - all start with Gaussian blur.

**Averaging blur** - the simplest: a kernel of equal values. Each pixel is replaced by the average of its neighbors. It blurs everything uniformly including edges - which is why it is rarely used in practice. **Gaussian blur** - the gold standard: weights distributed along a bell curve, the center pixel contributes the most.

FilterKernelBest forPreserves edges?
AveragingAll values = 1/N2Quick previewNo
GaussianGaussian-weightedGeneral noise reductionPartially
MedianMedian of neighborsSalt-and-pepper noiseYes!
BilateralGaussian + intensity weightingNoise reduction with edge preservationYes

In a CV pipeline, **Gaussian blur almost always comes first**. Canny edge detector, Harris corner detector, SIFT - all start with blurring. The **sigma** parameter controls blur strength: sigma=1 is gentle, sigma=10 is aggressive.

A surveillance camera image has heavy salt-and-pepper noise (random black and white pixels). Which filter works best?

Edge Detection: Sobel and Canny

Tesla Autopilot processes 8 video streams in real time. The first task: find lane markings, curbs, object silhouettes. These are **edges**: locations of sharp brightness changes. For an algorithm, an edge = the location where the **gradient** (rate of brightness change) is maximal. Two 3x3 convolution kernels are all it takes for primary detection.

**Sobel** - the simplest detector: two 3x3 kernels, one finds horizontal gradients (Gx), the other vertical (Gy). Full gradient magnitude: G = sqrt(Gx^2 + Gy^2).

**Canny** - the gold standard in edge detection, by John Canny in 1986. Not just one filter, but a **4-step pipeline**, each step fixing a specific weakness of Sobel: Gaussian blur (remove noise) - Sobel (find gradient) - Non-maximum suppression (thin to 1 pixel) - Hysteresis thresholding (drop weak isolated edges).

**Hysteresis thresholding** is Canny's defining feature. Two thresholds instead of one: a low threshold produces noise, a high threshold produces broken contours. Canny takes the best of both: strong edges are found with the high threshold, then weak edge segments along the contour are extended down to the low threshold. Practical rule: high:low = 2:1 or 3:1.

In Canny edge detection, non-maximum suppression is needed to:

Morphological Operations

OCR systems (Google Lens, Tesseract, Apple Live Text) work with binary masks. After threshold segmentation of text - holes inside letters, small noise, broken contours. **Morphological operations** are the cleanup tools. They work with a **structuring element** (a small kernel) that defines the neighborhood shape.

Two basic operations: **erosion** - if at least one pixel under the kernel = 0, the result = 0. Objects shrink, small specks disappear. **Dilation** - if at least one pixel = 1, the result = 1. Objects grow, holes fill in.

**Opening** = erosion -> dilation. Removes small noise (erosion deletes specks), then restores object size (dilation). **Closing** = dilation -> erosion. Fills holes and gaps (dilation expands), then returns the contour (erosion shrinks). Standard pipeline for cleaning OCR masks: opening first, then closing.

OperationFormulaEffectUse case
ErosionShrinkObjects get smallerRemove thin connections, noise
DilationExpandObjects get largerFill holes, connect segments
OpeningErode -> DilateRemove small objectsNoise cleanup
ClosingDilate -> ErodeFill holesClose contours
GradientDilate - ErodeObject outlineOutline visualization

**Morphological gradient** (`MORPH_GRADIENT`) = dilation - erosion. The result is an object outline with thickness proportional to the kernel size. A fast alternative to Canny for binary masks.

Convolution and correlation are the same operation

Mathematically, convolution flips the kernel 180 degrees before applying it (reflecting along both axes). Correlation applies the kernel as-is. For symmetric kernels (Gaussian, averaging) there is no difference. For asymmetric kernels (Sobel, directional) the difference is significant.

OpenCV and most deep learning frameworks implement correlation but call it convolution. True mathematical convolution is used in signal processing. When constructing a kernel for a specific direction, remember that cv2.filter2D() performs correlation.

A binary mask has small black holes inside an object and small white noise specks around it. Which pipeline is correct?

Main points

  • **Convolution** - one operation (kernel x region), but different kernels produce different effects: blur, edge detection, sharpening. Yann LeCun applied this idea in 1989 - it underlies all modern CV
  • **Blur** removes noise and prepares for analysis. Gaussian - universal, Median - for salt-and-pepper noise, Bilateral - when edges need to be preserved
  • **Canny** = 4-step pipeline (blur - gradient - NMS - hysteresis). Two thresholds solve the noise-vs-broken-contours dilemma. Rule: 2:1 or 3:1
  • **Morphology** cleans binary masks: opening removes noise, closing fills holes. OCR, medical scanners, object detection - all use this pipeline
  • **CNNs in 2024** learn kernels automatically - but understanding Sobel, Gaussian, Canny helps interpret what the network has learned

Related topics

Convolution is the bridge between raw pixels and high-level image understanding:

  • Digital Images: Pixels and Color — Convolution operates on pixel arrays - understanding coordinates and data types is required to construct kernels correctly
  • Features: SIFT, SURF, ORB — Feature detectors use Gaussian blur for scale-space and Sobel for gradients internally

Вопросы для размышления

  • Why does the Canny edge detector start with blurring when blurring destroys detail? Doesn't that contradict the goal of finding edges?
  • If CNNs learn convolution kernels automatically, why understand Sobel and Canny? In which situations are classical filters more reliable?
  • Morphological operations work on binary masks. What if an object has semi-transparent edges (alpha gradient)? How would the pipeline be adapted?

Связанные уроки

  • cv-01 — Pixels, coordinates, and data types - the foundation for convolution
  • cv-03 — Feature detectors SIFT and ORB use Gaussian blur and gradients internally
  • dl-05 — CNNs learn convolution kernels automatically - this is the evolution of manual filters
  • aie-25-multimodal — DALL-E 3 and Stable Diffusion use convolutional encoders inside
  • cv-04 — Object detection and tracking build on edge detection and morphology
  • la-06-transformations
Filtering and Convolution

0

1

Sign In