Deep Learning
Neural Networks: From Biology to Mathematics
**86 billion neurons** in the human brain - and just **800 artificial ones** are enough to recognize handwritten digits with 98% accuracy. How did scientists compress the essence of how the brain works into a few lines of code?
- **Face recognition:** Face ID on the iPhone uses a neural network with millions of neurons to identify the owner even in the dark
- **Voice assistants:** Siri, Alexa, Google Assistant - neural networks convert sound waves to text and back
- **Recommendations:** Netflix, YouTube, Spotify use neural networks to predict what a user will enjoy
Предварительные знания
- Basic linear algebra: vectors, dot product, matrix multiplication
- Single-variable calculus: derivatives and the chain rule
- Reading simple Python and NumPy code
When biology became a formula
In 1943 neurophysiologist Warren McCulloch and logician Walter Pitts published the first mathematical model of a neuron, using threshold logic to show that networks of simple binary units could compute any logical function. In 1952 Alan Hodgkin and Andrew Huxley described how a real neuron fires through ion currents, work that earned them the 1963 Nobel Prize in Physiology or Medicine. Then in 1958 Frank Rosenblatt built the perceptron, the first neuron that adjusted its own weights from data. Three steps, fifteen years: biology turned into mathematics, and mathematics turned into a machine that learns.
Biological vs Artificial Neuron
**In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts asked a bold question:** can we describe how the brain works with a formula? They studied the biological neuron - a cell that receives electrical signals through dendrites, processes them in the cell body (soma), and transmits the result through an axon. And they proposed a simple mathematical model.
**The analogy works like this:** input signals (x₁, x₂, x₃) are the dendrites. Weights (w₁, w₂, w₃) are the synapses, determining the importance of each input. The soma sums the signals: z = w₁·x₁ + w₂·x₂ + w₃·x₃ + b, where b (bias) is the firing threshold. The activation function f(z) decides whether the neuron will "fire".
**The McCulloch-Pitts model (1943)** - the first formal neuron model. It is binary: the neuron is either active (1) or not (0). Despite its simplicity, this idea became the foundation of all deep learning.
**Biological neurons are far more complex** than the artificial version. A real neuron has thousands of synapses, fires with temporal spike patterns, and even a single dendrite can do nonlinear computation on its own. The artificial neuron is a crude but useful simplification.
McCulloch and Pitts: an unlikely pair
**McCulloch was a philosopher and neurophysiologist.** Pitts was an 18-year-old self-taught prodigy with no fixed address. They met at a seminar, and McCulloch invited Pitts to live with him. Their 1943 paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity," showed that neural networks can compute any logical function, decades before the first computer existed.
What role do weights play in an artificial neuron?
Perceptron: the first trainable model
**In 1957, Frank Rosenblatt took the next step:** the McCulloch-Pitts neuron could not learn - weights had to be set manually. Rosenblatt created the **Perceptron** - a neuron that adjusts its own weights based on data. The New York Times wrote: "The embryo of a computer that will be able to walk, talk, see, and be aware of its existence."
**The perceptron formula:** y = sign(w·x + b), where sign is a threshold function (returns +1 if the argument ≥ 0, else -1). The key idea: if the perceptron makes a mistake, we adjust the weights toward the correct answer.
**The Perceptron Learning Rule** is brilliantly simple: if the answer is correct - do nothing. If wrong - shift the weights toward the correct answer. It is mathematically proven that if the data is **linearly separable**, the perceptron will always find a solution in a finite number of steps.
**But there is a problem.** In 1969, Marvin Minsky and Seymour Papert published the book "Perceptrons", mathematically proving that a single perceptron cannot solve the XOR problem.
**Minsky's book triggered the "AI winter"** - funding for neural network research virtually stopped for 15 years. The irony: the solution (multilayer networks) was already known, but an algorithm to train them did not yet exist.
Why can't a perceptron solve the XOR problem?
Layers: from perceptron to neural network
**The solution to the XOR problem turned out to be elegant:** if a single perceptron draws one straight line, a combination of several can draw an arbitrarily complex boundary. This gave rise to the idea of **multilayer neural networks** (multilayer perceptron, MLP).
**A neural network consists of three types of layers.** **Input layer** - receives data (pixels, numbers, text). **Hidden layers** - intermediate layers that extract increasingly abstract features. **Output layer** - produces the result (class, number, probability).
**Universal Approximation Theorem (Hornik, 1989):** a neural network with a single hidden layer and a sufficient number of neurons can approximate **any continuous function** to arbitrary precision. This means neural networks are universal approximators.
**But then why does deep learning use dozens of layers instead of one huge one?** Because depth is exponentially more efficient than width. A network with 10 layers of 100 neurons each (1000 neurons total) can represent functions for which a single layer would require billions of neurons.
| Property | Wide (1 layer) | Deep (many layers) |
|---|---|---|
| Parameters | May require exponentially many | Polynomially few |
| Feature hierarchy | None - everything in one layer | Yes - from simple to complex |
| Example (vision) | Tries to recognize a cat directly | Edges → textures → parts → cat |
| Trainability | Hard to optimize | Has its own problems (vanishing gradients) |
**Deep networks build a hierarchy.** In image recognition: the first layer finds edges, the second finds corners and textures, the third finds object parts (ear, eye), the fourth finds whole objects (cat, dog). Each layer uses the results of the previous one.
What does the Universal Approximation Theorem state?
Activation Functions: nonlinearity is everything
**Without activation functions, a multilayer network is useless.** Here's why: if each layer is a linear transformation (y = Wx + b), then a composition of linear functions is again a linear function. Two layers without activation are equivalent to one: W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) = W'x + b'.
**An activation function introduces nonlinearity** - it is what allows the network to approximate complex functions. Let's look at the main ones.
| Function | Formula | Range | When to use |
|---|---|---|---|
| Sigmoid | σ(z) = 1/(1+e⁻ᶻ) | (0, 1) | Output: probability |
| Tanh | tanh(z) = (eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ) | (-1, 1) | Hidden layers (zero-centered) |
| ReLU | max(0, z) | [0, ∞) | Hidden layers (default) |
| Leaky ReLU | max(0.01z, z) | (-∞, ∞) | When ReLU doesn't work |
| GELU | z·Φ(z) | (-0.17, ∞) | Transformers (BERT, GPT) |
**Why did ReLU win?** Before the 2010s, sigmoid and tanh were used. But they have a fatal flaw - the **vanishing gradient problem**: at large or small values of z the derivative approaches zero. In a deep network, gradients are multiplied (chain rule), and by the first layers virtually nothing arrives - the network stops learning.
**ReLU is not perfect either.** The "dying ReLU" problem: if a neuron's input is always negative, its output and gradient are permanently zero - the neuron "dies". Leaky ReLU solves this by passing a small signal (0.01·z) for negative values.
More layers always means better results. A deep network is automatically more powerful than a shallow one.
Without proper activation functions a deep network is mathematically equivalent to a single linear layer. And an excessively deep network without special techniques (residual connections, batch normalization) trains worse than a shallow one due to vanishing/exploding gradients.
A composition of linear functions is again a linear function. It is precisely the nonlinear activations that give deep networks their power. And gradient problems make naively increasing depth harmful. ResNet (2015) showed that skip connections are necessary for successfully training deep networks.
Why did ReLU become the standard activation function for hidden layers?
Key Ideas
- **Artificial neuron** - a simplified model of a biological one: inputs × weights → sum + bias → activation
- **Perceptron** (1957) - the first trainable neuron, but limited to linear problems (cannot solve XOR)
- **Multilayer networks** solve the linearity problem: each layer extracts increasingly complex features
- **Activation function** - the key to everything: without nonlinearity a deep network is useless. ReLU became the standard by solving the vanishing gradient problem
- Those 800 neurons for digits are now fully explained: layers + activations + weight training
Related Topics
Neural networks are the foundation of all deep learning. Here is where these ideas lead:
- Backpropagation — The weight training algorithm - how a neural network learns from mistakes
- Linear Regression — A neuron without activation is linear regression. The foundation for understanding
Вопросы для размышления
- Why did the "AI winter" after Minsky's book last 15 years, even though the solution (multilayer networks) was already known?
- If the Universal Approximation Theorem guarantees that one hidden layer can approximate any function, why do we need deep networks?
- Can artificial neural networks be considered a model of the brain? Where does the analogy break down?
Связанные уроки
- dl-02 — Neural network architectures: from perceptron to CNN and RNN
- ml-01-intro — Machine learning basics as the foundation
- calc-01-sequences — Calculus to understand gradients
- la-01-vectors-intro — Matrix multiplication is the core of the forward pass
- aie-03-llm-fundamentals — LLMs are a direct application of deep architectures
- ml-26-backpropagation — Backpropagation as iterative error feedback that adjusts weights