Machine Learning

ML Types: supervised, unsupervised, reinforcement

Arthur Samuel and the birth of the term

In 1959, Arthur Samuel of IBM wrote a checkers program that improved through self-play. He coined the term "machine learning" as an alternative to "explicit programming". His definition: "Field of study that gives computers the ability to learn without being explicitly programmed". Fifty-three years after that definition, the ImageNet moment made machine learning an industry worth hundreds of billions of dollars.

Цели урока

Distinguish supervised, unsupervised, and reinforcement learning by data structure and feedback mechanism
Identify the ML type for a given business problem
Understand why self-supervised learning became the foundation for GPT and BERT
Know production examples of each paradigm

2012. ImageNet. AlexNet cut the error rate from 26% to 15% overnight. Supervised learning on 1.2 million labeled images. Five years later: ResNet-152, 3.57% error - better than human. But GPT-4 did not train supervised on billions of texts. AlphaGo received no correct moves as labels. Three different answers to the question "how to learn without explicit rules" split all of ML.

**Gmail spam filter** - supervised learning on billions of labeled emails, 99.9% blocking accuracy
**Spotify Discover Weekly** - unsupervised clustering of 600M+ listeners across track embedding space
**ChatGPT (RLHF)** - PPO updates language model weights based on human comparisons of responses
**Tesla Autopilot** - supervised learning on 1.4 billion miles of real driving with road object annotations
**Google Data Center** - RL agent cut cooling costs by 40% while managing 120 environment variables
**AlphaFold2 (DeepMind)** - supervised plus self-supervised combination for protein structure prediction, 92% accuracy

Предварительные знания

What Is Machine Learning

Supervised Learning: learning with a teacher

2012. Toronto. AlexNet cut the ImageNet error rate from 26% to 15% in a single night. 1.2 million labeled images, each annotated by a human. That is the essence of **supervised learning**: a model trains on labeled pairs **(x, y)**, where x is input data (features) and y is the correct answer (label). The model learns a function f(x) ≈ y. The "teacher" is not a person at a screen - it is **the dataset itself**.

Supervised learning solves two types of tasks: - **Classification** - predicting a discrete category: spam / not spam, cat / dog, benign / malignant tumor - **Regression** - predicting a continuous number: apartment price, tomorrow's temperature, customer churn probability

Classification vs Regression

Two types of supervised learning tasks

**Classification (discrete labels):** - Email → spam or not spam - Photo → cat, dog, bird - X-ray → healthy or sick - Transaction → fraud or legitimate **Regression (continuous values):** - Apartment area → price (USD 350K) - Date and location → temperature (22.5 C) - Customer parameters → churn probability (0.73) - Stock history → tomorrow's price (USD 142.30)

**How to measure quality?** For classification - **accuracy** (fraction of correct answers), **precision**, and **recall**. For regression - **MSE** (mean squared error) and **MAE** (mean absolute error). Metrics are computed on the **test set** - data the model has never seen during training. ImageNet top-1 accuracy for ResNet-152: 78.57%. Comparing model performance always uses the same test set.

A company wants to predict how much each customer will spend next month. What type of supervised learning task is this?

Unsupervised Learning: learning without a teacher

Spotify never asks users which "musical cluster" they belong to. The algorithm receives 600 million listeners with listening histories - and finds the structure on its own. This is **unsupervised learning**: working with **unlabeled data** - only inputs **x**, no labels **y**. Patterns, groups, anomalies - all discovered without a "teacher".

Three main unsupervised learning tasks: - **Clustering** - grouping similar objects: customer segments, document types, gene groups - **Dimensionality reduction** - compressing 1000 features down to 50 while preserving meaning: PCA, t-SNE, UMAP - **Anomaly detection** - finding unusual data points: fraud, equipment failures, network intrusions

	Supervised	Unsupervised
Data	Labeled (x, y)	Unlabeled (x only)
Goal	Predict label y	Find hidden structure
Quality evaluation	Compare with correct answer	No 'correct' answer
Example task	Spam / not spam	Find customer groups
Typical algorithms	Linear Regression, Random Forest, Neural Networks	K-Means, PCA, DBSCAN, Autoencoders

Unsupervised learning in production

Where it runs right now

**Spotify Discover Weekly** - K-Means plus collaborative filtering segment listeners across a track embedding space. 30 minutes of weekly playlist, outperforming manual curation in engagement. **PayPal Fraud Detection** - Isolation Forest and autoencoders across 15 million transactions per day. An anomaly is a point that sits far from normal behavior, with zero labeled fraud examples. **Genomics** - PCA compresses the genome from 20,000 genes down to 2-3 components. Populations and family relationships become visible to the naked eye.

The main challenge of unsupervised learning is **there is no objective quality metric**. In supervised learning the prediction is compared to the correct answer. In unsupervised there is no "correct answer". Did K-Means divide customers well into 3 groups? Maybe 5 are needed? Or 7? The answer depends on business context, not mathematics. Silhouette score helps, but does not resolve the question.

An online store wants to divide customers into groups by behavior, without knowing in advance what groups exist. Which approach fits?

Reinforcement Learning: learning through rewards

2016. AlphaGo beat Lee Sedol 4:1. The task: 10^170 board positions. Supervised learning could not solve it - labeling every Go position is impossible. Unsupervised could not either: there is no structure without a goal. The answer was **reinforcement learning**: an agent learns by interacting with an environment and receiving rewards or penalties. The correct move is never given. Only the objective - win.

Key components of RL: - **Agent** - the decision-maker (robot, game bot, algorithm) - **Environment** - the world the agent interacts with (game, road, market) - **State** - the current state of the world (board position, car speed) - **Action** - the agent's action (move right, accelerate, buy a stock) - **Reward** - a numerical feedback signal (+1 for a correct move, -100 for losing)

The core dilemma in RL is **exploration vs exploitation**. A restaurant rated 4.2 has been found. Return there (exploitation - use what is known) or try a new one that could be 4.8 or 2.5 (exploration - discover the unknown)? The agent must balance proven strategies against better alternatives. Too much exploitation: stuck in a local optimum. Too much exploration: never converge.

RL in production

Breakthrough applications of reinforcement learning

**AlphaGo (DeepMind, 2016)** - MCTS plus policy gradient across 10^170 positions. Self-play: the agent played millions of games against itself, gradually improving its strategy. **ChatGPT / InstructGPT** - the final training stage uses **RLHF** (Reinforcement Learning from Human Feedback): humans compare responses, PPO shifts weights toward more helpful outputs. **Boston Dynamics Spot** - quadruped walking via RL in simulation: thousands of falls, penalty for falling, reward for distance. The policy trained in simulation transfers to real hardware. **Google Data Center Cooling** - DeepMind cut cooling energy use by 40% using an RL agent controlling fans and chillers across 120 environment variables.

**Key difference from supervised:** in supervised the model gets the correct answer **immediately** - here is the input, here is the label. In RL the reward can be **delayed**: a chess move on turn 10 might lead to a win on turn 40. The agent must figure out which actions led to success - this is the **credit assignment problem**. RLHF in ChatGPT solves exactly this: which part of the response deserves the human evaluator's reward signal?

A robot vacuum learns to clean an apartment more efficiently. It receives +1 for each square meter cleaned and -10 for colliding with furniture. What type of ML is this?

Hybrid approaches: semi-supervised and self-supervised

Labeling data is expensive. Annotating medical scans requires specialists at thousands of dollars per month. But unlabeled scans number in the millions. **Semi-supervised learning** combines a small amount of labeled data with a large amount of unlabeled data. Even without a label, data carries information about the structure of the space.

**Self-supervised learning** goes further: the model **creates its own task from unlabeled data**. It hides part of the input and tries to reconstruct it: - **BERT** masks 15% of words in a sentence and learns to predict them: "The cat sat on the [MASK]" → "windowsill" - **GPT** predicts the next word: "Machine learning is a" → "subfield" - **MAE (Masked Autoencoder)** hides 75% of image patches and reconstructs the image No labels - but the task is **generated automatically** from the structure of the data.

Self-supervised: the foundation of modern AI

How self-supervised learning changed the industry

**GPT-4** was trained in a self-supervised way: it read trillions of words from the internet, predicting the next word. No human-provided labels during pre-training. **CLIP (OpenAI)** - contrastive learning: the model learns to connect images and text. 400 million image-caption pairs from the internet. **SimCLR (Google)** - takes one image, creates two augmentations, and learns that they are **the same thing**. Completely without labels. Result: self-supervised models fine-tuned on just 1% of labeled data often outperform supervised models trained on 100% of labels.

Approach	Labeled data	Unlabeled data	Example
Supervised	All data labeled	Not used	Spam classifier
Unsupervised	No labels	All data	K-Means clustering
Semi-supervised	A little (1-10%)	A lot (90-99%)	Medical diagnostics
Self-supervised	No labels *	All data	GPT, BERT, CLIP
Reinforcement	No labels	No data **	AlphaGo, robotics

* Self-supervised **formally** creates labels automatically (the masked word = label), so it is sometimes classified as supervised. The key difference: **labeling is free**, because it is generated from the data itself. ** In RL, data is generated through the agent's interaction with the environment, not stored in advance.

Supervised learning is always better than unsupervised because it uses labeled data

Key ideas

**Supervised learning** - training on labeled pairs (input → answer): classification predicts a category, regression predicts a number. AlexNet, ResNet, Gmail spam filter
**Unsupervised learning** - finding hidden structure in unlabeled data: clustering, dimensionality reduction, anomaly detection. Spotify segmentation, PayPal fraud detection
**Reinforcement learning** - agent learns through environment interaction, balancing exploration and exploitation. AlphaGo, ChatGPT RLHF, Boston Dynamics
**Self-supervised learning** - model creates a task from data structure (masking, token prediction). Foundation of GPT-4, BERT, CLIP - no human labels during pre-training
The paradigm choice is determined by the task and label availability, not preference
Arthur Samuel coined the term "machine learning" in 1959 for a checkers program that taught itself
Semi-supervised and self-supervised fill the gap: when labeled data is scarce but unlabeled data is an ocean

Вопросы для размышления

Building a recommendation system for an online store - which ML types would be combined and why? Where supervised, where unsupervised?
Why did self-supervised learning become the dominant approach in NLP (GPT, BERT), while supervised remained dominant in computer vision for a long time? What changed?
If reinforcement learning can beat world champions in Go and Dota 2, why is it not used everywhere instead of supervised learning?

Связанные уроки

ml-01-intro — Core concepts of model, training, and generalization
ml-09-gradient-descent — Gradient descent updates weights inside the supervised pipeline
ml-48-rl-intro — Deep dive into RL: Q-Learning, Policy Gradient, MDP
prob-17 — Markov chains are the math behind states and transitions in RL
aie-03-llm-fundamentals — GPT trains self-supervised on next-token prediction
calc-08-chain-rule — Chain rule is the math behind backprop in supervised learning
stat-01-sampling