Deep Learning

Neural Networks: From Biology to Mathematics

**86 billion neurons** in the human brain - and just **800 artificial ones** are enough to recognize handwritten digits with 98% accuracy. How did scientists compress the essence of how the brain works into a few lines of code?

**Face recognition:** Face ID on the iPhone uses a neural network with millions of neurons to identify the owner even in the dark
**Voice assistants:** Siri, Alexa, Google Assistant - neural networks convert sound waves to text and back
**Recommendations:** Netflix, YouTube, Spotify use neural networks to predict what a user will enjoy

Предварительные знания

Basic linear algebra: vectors, dot product, matrix multiplication
Single-variable calculus: derivatives and the chain rule
Reading simple Python and NumPy code

When biology became a formula

In 1943 neurophysiologist Warren McCulloch and logician Walter Pitts published the first mathematical model of a neuron, using threshold logic to show that networks of simple binary units could compute any logical function. In 1952 Alan Hodgkin and Andrew Huxley described how a real neuron fires through ion currents, work that earned them the 1963 Nobel Prize in Physiology or Medicine. Then in 1958 Frank Rosenblatt built the perceptron, the first neuron that adjusted its own weights from data. Three steps, fifteen years: biology turned into mathematics, and mathematics turned into a machine that learns.

Biological vs Artificial Neuron

**In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts asked a bold question:** can we describe how the brain works with a formula? They studied the biological neuron - a cell that receives electrical signals through dendrites, processes them in the cell body (soma), and transmits the result through an axon. And they proposed a simple mathematical model.

**The analogy works like this:** input signals (x₁, x₂, x₃) are the dendrites. Weights (w₁, w₂, w₃) are the synapses, determining the importance of each input. The soma sums the signals: z = w₁·x₁ + w₂·x₂ + w₃·x₃ + b, where b (bias) is the firing threshold. The activation function f(z) decides whether the neuron will "fire".

**The McCulloch-Pitts model (1943)** - the first formal neuron model. It is binary: the neuron is either active (1) or not (0). Despite its simplicity, this idea became the foundation of all deep learning.

**Biological neurons are far more complex** than the artificial version. A real neuron has thousands of synapses, fires with temporal spike patterns, and even a single dendrite can do nonlinear computation on its own. The artificial neuron is a crude but useful simplification.

McCulloch and Pitts: an unlikely pair

**McCulloch was a philosopher and neurophysiologist.** Pitts was an 18-year-old self-taught prodigy with no fixed address. They met at a seminar, and McCulloch invited Pitts to live with him. Their 1943 paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity," showed that neural networks can compute any logical function, decades before the first computer existed.

What role do weights play in an artificial neuron?

Perceptron: the first trainable model

**In 1957, Frank Rosenblatt took the next step:** the McCulloch-Pitts neuron could not learn - weights had to be set manually. Rosenblatt created the **Perceptron** - a neuron that adjusts its own weights based on data. The New York Times wrote: "The embryo of a computer that will be able to walk, talk, see, and be aware of its existence."

**The perceptron formula:** y = sign(w·x + b), where sign is a threshold function (returns +1 if the argument ≥ 0, else -1). The key idea: if the perceptron makes a mistake, we adjust the weights toward the correct answer.

**The Perceptron Learning Rule** is brilliantly simple: if the answer is correct - do nothing. If wrong - shift the weights toward the correct answer. It is mathematically proven that if the data is **linearly separable**, the perceptron will always find a solution in a finite number of steps.

**But there is a problem.** In 1969, Marvin Minsky and Seymour Papert published the book "Perceptrons", mathematically proving that a single perceptron cannot solve the XOR problem.

**Minsky's book triggered the "AI winter"** - funding for neural network research virtually stopped for 15 years. The irony: the solution (multilayer networks) was already known, but an algorithm to train them did not yet exist.

Why can't a perceptron solve the XOR problem?

Layers: from perceptron to neural network

**The solution to the XOR problem turned out to be elegant:** if a single perceptron draws one straight line, a combination of several can draw an arbitrarily complex boundary. This gave rise to the idea of **multilayer neural networks** (multilayer perceptron, MLP).

**A neural network consists of three types of layers.** **Input layer** - receives data (pixels, numbers, text). **Hidden layers** - intermediate layers that extract increasingly abstract features. **Output layer** - produces the result (class, number, probability).

**Universal Approximation Theorem (Hornik, 1989):** a neural network with a single hidden layer and a sufficient number of neurons can approximate **any continuous function** to arbitrary precision. This means neural networks are universal approximators.

**But then why does deep learning use dozens of layers instead of one huge one?** Because depth is exponentially more efficient than width. A network with 10 layers of 100 neurons each (1000 neurons total) can represent functions for which a single layer would require billions of neurons.

Property	Wide (1 layer)	Deep (many layers)
Parameters	May require exponentially many	Polynomially few
Feature hierarchy	None - everything in one layer	Yes - from simple to complex
Example (vision)	Tries to recognize a cat directly	Edges → textures → parts → cat
Trainability	Hard to optimize	Has its own problems (vanishing gradients)

**Deep networks build a hierarchy.** In image recognition: the first layer finds edges, the second finds corners and textures, the third finds object parts (ear, eye), the fourth finds whole objects (cat, dog). Each layer uses the results of the previous one.

What does the Universal Approximation Theorem state?

Activation Functions: nonlinearity is everything

**Without activation functions, a multilayer network is useless.** Here's why: if each layer is a linear transformation (y = Wx + b), then a composition of linear functions is again a linear function. Two layers without activation are equivalent to one: W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) = W'x + b'.

**An activation function introduces nonlinearity** - it is what allows the network to approximate complex functions. Let's look at the main ones.

Function	Formula	Range	When to use
Sigmoid	σ(z) = 1/(1+e⁻ᶻ)	(0, 1)	Output: probability
Tanh	tanh(z) = (eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)	(-1, 1)	Hidden layers (zero-centered)
ReLU	max(0, z)	[0, ∞)	Hidden layers (default)
Leaky ReLU	max(0.01z, z)	(-∞, ∞)	When ReLU doesn't work
GELU	z·Φ(z)	(-0.17, ∞)	Transformers (BERT, GPT)

**Why did ReLU win?** Before the 2010s, sigmoid and tanh were used. But they have a fatal flaw - the **vanishing gradient problem**: at large or small values of z the derivative approaches zero. In a deep network, gradients are multiplied (chain rule), and by the first layers virtually nothing arrives - the network stops learning.

**ReLU is not perfect either.** The "dying ReLU" problem: if a neuron's input is always negative, its output and gradient are permanently zero - the neuron "dies". Leaky ReLU solves this by passing a small signal (0.01·z) for negative values.

More layers always means better results. A deep network is automatically more powerful than a shallow one.

Without proper activation functions a deep network is mathematically equivalent to a single linear layer. And an excessively deep network without special techniques (residual connections, batch normalization) trains worse than a shallow one due to vanishing/exploding gradients.

A composition of linear functions is again a linear function. It is precisely the nonlinear activations that give deep networks their power. And gradient problems make naively increasing depth harmful. ResNet (2015) showed that skip connections are necessary for successfully training deep networks.

Why did ReLU become the standard activation function for hidden layers?

Key Ideas

**Artificial neuron** - a simplified model of a biological one: inputs × weights → sum + bias → activation
**Perceptron** (1957) - the first trainable neuron, but limited to linear problems (cannot solve XOR)
**Multilayer networks** solve the linearity problem: each layer extracts increasingly complex features
**Activation function** - the key to everything: without nonlinearity a deep network is useless. ReLU became the standard by solving the vanishing gradient problem
Those 800 neurons for digits are now fully explained: layers + activations + weight training

Вопросы для размышления

Why did the "AI winter" after Minsky's book last 15 years, even though the solution (multilayer networks) was already known?
If the Universal Approximation Theorem guarantees that one hidden layer can approximate any function, why do we need deep networks?
Can artificial neural networks be considered a model of the brain? Where does the analogy break down?

Связанные уроки

dl-02 — Neural network architectures: from perceptron to CNN and RNN
ml-01-intro — Machine learning basics as the foundation
calc-01-sequences — Calculus to understand gradients
la-01-vectors-intro — Matrix multiplication is the core of the forward pass
aie-03-llm-fundamentals — LLMs are a direct application of deep architectures
ml-26-backpropagation — Backpropagation as iterative error feedback that adjusts weights

Deep Learning

Neural Networks: From Biology to Mathematics

**Face recognition:** Face ID on the iPhone uses a neural network with millions of neurons to identify the owner even in the dark
**Voice assistants:** Siri, Alexa, Google Assistant - neural networks convert sound waves to text and back
**Recommendations:** Netflix, YouTube, Spotify use neural networks to predict what a user will enjoy

Предварительные знания

Basic linear algebra: vectors, dot product, matrix multiplication
Single-variable calculus: derivatives and the chain rule
Reading simple Python and NumPy code

When biology became a formula

Biological vs Artificial Neuron

McCulloch and Pitts: an unlikely pair

What role do weights play in an artificial neuron?

Perceptron: the first trainable model

**But there is a problem.** In 1969, Marvin Minsky and Seymour Papert published the book "Perceptrons", mathematically proving that a single perceptron cannot solve the XOR problem.

Why can't a perceptron solve the XOR problem?

Layers: from perceptron to neural network

Property	Wide (1 layer)	Deep (many layers)
Parameters	May require exponentially many	Polynomially few
Feature hierarchy	None - everything in one layer	Yes - from simple to complex
Example (vision)	Tries to recognize a cat directly	Edges → textures → parts → cat
Trainability	Hard to optimize	Has its own problems (vanishing gradients)

What does the Universal Approximation Theorem state?

Activation Functions: nonlinearity is everything

**An activation function introduces nonlinearity** - it is what allows the network to approximate complex functions. Let's look at the main ones.

Function	Formula	Range	When to use
Sigmoid	σ(z) = 1/(1+e⁻ᶻ)	(0, 1)	Output: probability
Tanh	tanh(z) = (eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)	(-1, 1)	Hidden layers (zero-centered)
ReLU	max(0, z)	[0, ∞)	Hidden layers (default)
Leaky ReLU	max(0.01z, z)	(-∞, ∞)	When ReLU doesn't work
GELU	z·Φ(z)	(-0.17, ∞)	Transformers (BERT, GPT)

More layers always means better results. A deep network is automatically more powerful than a shallow one.

Why did ReLU become the standard activation function for hidden layers?

Key Ideas

**Artificial neuron** - a simplified model of a biological one: inputs × weights → sum + bias → activation
**Perceptron** (1957) - the first trainable neuron, but limited to linear problems (cannot solve XOR)
**Multilayer networks** solve the linearity problem: each layer extracts increasingly complex features
**Activation function** - the key to everything: without nonlinearity a deep network is useless. ReLU became the standard by solving the vanishing gradient problem
Those 800 neurons for digits are now fully explained: layers + activations + weight training

Вопросы для размышления

Why did the "AI winter" after Minsky's book last 15 years, even though the solution (multilayer networks) was already known?
If the Universal Approximation Theorem guarantees that one hidden layer can approximate any function, why do we need deep networks?
Can artificial neural networks be considered a model of the brain? Where does the analogy break down?

Связанные уроки

dl-02 — Neural network architectures: from perceptron to CNN and RNN
ml-01-intro — Machine learning basics as the foundation
calc-01-sequences — Calculus to understand gradients
la-01-vectors-intro — Matrix multiplication is the core of the forward pass
aie-03-llm-fundamentals — LLMs are a direct application of deep architectures
ml-26-backpropagation — Backpropagation as iterative error feedback that adjusts weights

Neural Networks: From Biology to Mathematics

Предварительные знания

When biology became a formula

Biological vs Artificial Neuron

McCulloch and Pitts: an unlikely pair

Perceptron: the first trainable model

Layers: from perceptron to neural network

Activation Functions: nonlinearity is everything

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Neural Networks: From Biology to Mathematics

Предварительные знания

When biology became a formula

Biological vs Artificial Neuron

McCulloch and Pitts: an unlikely pair

Perceptron: the first trainable model

Layers: from perceptron to neural network

Activation Functions: nonlinearity is everything

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки