Deep Learning

RNN, LSTM, GRU

2014: seq2seq on LSTM - Google Translate jumps 20 points in one year. 2017: Transformer arrives and 'kills' RNN. Yet in 2024 LSTM still runs in Apple Siri, every iPhone audio chip, and Tesla climate control. Speed and compactness are not a compromise - they are a different design choice. An LSTM with 2.5M parameters processes text 15x faster than GPT-2 (117M) on CPU.

LSTM in Apple Siri - real-time on-device speech recognition, under 20ms latency, fully offline
GRU in Google Magenta music generation - melodies and chords as token sequences
Bidirectional LSTM in SpaCy NER - 97% accuracy on CoNLL-2003, model size 12MB
Tesla climate control - time-series prediction without Transformer overhead on embedded hardware

Предварительные знания

Backpropagation and the chain rule (BPTT extends them through time)
What the vanishing gradient problem is and why depth makes it worse
Parameter sharing as used in CNNs

Three ideas that taught networks to remember

Recurrent networks took shape in 1990 when Jeffrey Elman introduced the Elman network, adding a context layer that fed the previous hidden state back as input so the model could carry memory across a sequence. The problem was that plain RNNs forgot quickly, since gradients vanished after a handful of steps. In 1997 Sepp Hochreiter and Jürgen Schmidhuber answered with the LSTM, whose gated cell state acts as a protected highway for gradients across hundreds of steps. In 2014 Kyunghyun Cho and colleagues proposed the GRU, a leaner variant with two gates instead of three that often matches LSTM at lower cost.

Sequences and Memory: why fixed windows fall short

**2014. Sutskever, Vinyals, Le publish seq2seq with LSTM. Google Translate quality jumps 20 BLEU points in a single year.** Before this, machine translation relied on n-gram statistics with fixed windows of 5-7 words. LSTM was the first architecture that could retain information from the beginning of a sentence while processing its end. One result changed NLP permanently.

The fixed window (sliding window) approach is intuitive: take the last N tokens and predict the next one. **The problem**: language dependencies are not bounded by window size. "The cat that lived across the street **was** orange" - the verb "was" agrees with "cat" across 8 words. A recurrent network takes a different route: instead of a fixed window, it maintains a hidden state h_t passed forward at every step. Time is folded into weights.

**Char-RNN (Karpathy, 2015)**: an LSTM trained character-by-character on Shakespeare started generating structurally valid scenes, stage directions, and character-consistent dialogue after 100k training steps. This was the first public demonstration that LSTM genuinely memorizes structure across hundreds of characters - not just recent context.

**Intuition for hidden state**: h_t is a compressed "memory" of everything the model has seen up to step t. With long sequences, a plain RNN must pack all history into a fixed-size vector. LSTM addresses this structurally - separating "what to remember" from "what to pass forward".

What is the fundamental limitation of the sliding window approach for language modeling compared to RNN?

LSTM Gates: selective memory through sigmoid

**Apple Siri has been doing real-time speech recognition on-device since 2016. The architecture: LSTM.** Not GPT, not Transformer - LSTM, because it fits in 2MB, runs on the Neural Engine under 20ms, and requires no internet connection. Quality trails cloud models, but speed and privacy matter more. This illustrates LSTM's core advantage: compact, fast, and good enough for sequential real-time data.

The central innovation of LSTM (Hochreiter & Schmidhuber, 1997) is the **cell state** c_t: a separate vector that flows through the whole sequence nearly unchanged unless the gates decide otherwise. Three gates control information flow: **forget gate** (what to drop from the past), **input gate** (what to add from the current input), **output gate** (what to expose externally). Each gate is a sigmoid (output 0-1) multiplied elementwise by the relevant vector. Sigmoid acts as a soft switch: 0 means block completely, 1 means pass through unchanged.

**GRU vs LSTM**: GRU trains faster (fewer parameters), performs comparably on tasks up to 100 tokens. LSTM holds a small advantage on longer dependencies (500+ tokens). Practical rule: start with GRU, switch to LSTM if quality falls short. Production speech recognition (Siri, Google Voice) typically uses LSTM for slightly better accuracy.

**LSTM vs Transformer in production**: an LSTM with 2.5M parameters processes text 15x faster than GPT-2 (117M parameters) on CPU. For batch processing millions of reviews or running on edge hardware, LSTM frequently wins economically despite slightly lower accuracy.

What does the forget gate do in LSTM?

Bidirectionality: reading twice

**"The bank can guarantee deposits will cover future tuition costs."** The word "bank" - financial institution or riverbank? A unidirectional RNN processes "The bank" and commits to a meaning before seeing "deposits" and "tuition" later. SpaCy small (en_core_web_sm) uses Bi-LSTM for NER: 97% accuracy on CoNLL-2003. The key - the model reads the sentence twice: left-to-right and right-to-left, then merges both contexts.

Bi-LSTM is not one network but two independent LSTMs trained on the same data. The forward LSTM processes the sequence from start, the backward LSTM processes it from end. At each position t their hidden states are concatenated: h_t = [h_t_forward; h_t_backward]. The resulting vector captures context from both sides. **This observation directly led to BERT**: if Bi-LSTM reads both directions sequentially, why not train a Transformer to attend to everything at once via self-attention?

**Bi-LSTM cannot be used for autoregressive text generation**. When generating the next token, future tokens do not exist yet - the backward pass is impossible. Bi-LSTM is for tasks where the full sequence is available at once: classification, NER, machine translation encoder. For the decoder in seq2seq - only unidirectional LSTM.

Why can Bi-LSTM not be used for autoregressive token-by-token text generation?

Vanishing Gradient: why LSTM is the solution, not the victim

**Bengio, Simard, Frasconi, 1994. "Learning Long-Term Dependencies with Gradient Descent is Difficult."** The paper proved mathematically: during BPTT in a plain RNN, the gradient is multiplied by the weight matrix W_h at every step. If the largest eigenvalue of W_h is below 1, the gradient decays exponentially. If above 1, it explodes exponentially. Over a 100-step sequence the gradient is either 10^-30 or 10^30. Training becomes impossible.

**How LSTM solves the problem**: cell state c_t is updated additively: c_t = f_t * c_{t-1} + i_t * g_t. During backpropagation the gradient from c_t to c_{t-1} flows through an elementwise multiplication by f_t - not a matrix multiplication. This is a direct highway for gradients through time without exponential decay. Cell state acts like residual connections in ResNet, but along the time axis instead of depth.

**Practical LSTM limit**: despite the cell state highway, LSTM loses information in practice on sequences longer than 300-500 tokens. This is not a bug - it is a fundamental consequence of sequential computation. Each step subtly modifies cell state through the forget gate. After 500 steps, information about the beginning is diluted regardless. This is why Transformer (2017) with global self-attention replaced LSTM in NLP.

Vanishing gradient is a bug in LSTM that was never fully fixed

LSTM was specifically designed to solve vanishing gradient. Cell state is an architectural solution: a gradient highway through time. Plain RNN suffers from vanishing gradient; LSTM does not by construction

The confusion arises because LSTM still has a practical limit (~500 tokens). But that is information dilution through the forget gate over very long sequences - not vanishing gradient. These are completely different phenomena with different solutions

Why does the cell state in LSTM effectively address the vanishing gradient problem?

Key ideas

**RNN folds time into weights**: hidden state h_t propagates across steps, BPTT trains through time - but plain RNN suffers from exponential gradient vanishing/explosion
**Cell state as gradient highway**: LSTM solves vanishing gradient architecturally - additive update of c_t instead of matrix multiplication creates a direct gradient path across hundreds of steps
**GRU - compact alternative**: 2 gates instead of 3, no separate cell state, trains faster, comparable quality on tasks up to 100 tokens
**Bi-LSTM reads twice**: forward + backward hidden states are concatenated, providing context from both directions - the backbone of NER and seq2seq encoders

Вопросы для размышления

LSTM at 2.5M parameters vs GPT-2 at 117M: in what production scenarios does choosing LSTM make economic and technical sense?
The LSTM cell state and ResNet residual connections solve a similar problem. What is the fundamental similarity, and where does the analogy break down?
Transformer displaced LSTM in NLP. Yet on real-time streaming data (audio, IoT sensors), LSTM still dominates. Why is sequential computation an advantage rather than a limitation in those scenarios?