Recommender Systems

Context-Aware Recommendations

In 2016, YouTube optimized recommendations for watch time. Users spent more time on the platform but in surveys said: "I feel like I'm wasting time". In 2019 the team published a paper on switching to multi-task learning - jointly optimizing watch time, satisfaction, and engagement. Contextuality and multi-objective became the industry standard.

**Netflix Content Scheduling** - different content recommended on weekdays vs weekends; seasonal collections (summer blockbusters, holiday films)
**Foursquare/Swarm** - time+location aware venue recommendations; the "morning near office = coffee" model
**SASRec in Amazon** - self-attentive session recommendations; standard baseline in e-commerce sequential rec

Предварительные знания

Collaborative filtering and the user-item rating matrix
Feature interactions and embeddings from deep recommendation models
The explore-exploit tradeoff behind bandits

Factorization Machines and the Rise of Context

In 2010, Steffen Rendle introduced Factorization Machines, a model that learns pairwise interactions between every feature through shared latent vectors. The breakthrough was practical: context such as time of day, location, device, or weather could be added as ordinary features alongside user and item ids, and the model would still estimate reliable interaction weights even when most feature combinations were never observed. This made context a first-class signal rather than an afterthought and turned the rating-prediction problem into general feature-based prediction. Factorization Machines underpin much of context-aware recommendation and connect directly to contextual bandits, where the system also balances exploiting known preferences against exploring new context.

Temporal Context: seasonality and interest decay

Netflix analyzed viewing patterns and found: on weekday evenings users watch short episodes (20-30 min); on weekends - feature films and long drama series. The same user wants different content at different times. A context-blind recommender system has this as a fundamental blind spot.

**Seasonality vs short-term patterns:** seasonality (summer -> beach films, December -> holiday content) is a long-term trend. Circadian patterns (morning -> podcasts, evening -> series) are short-term. A model needs to capture both levels through different features or a multi-scale architecture.

Why encode the hour of day via sin/cos rather than as a number 0-23?

Location-Aware: geographic context and local relevance

Foursquare in 2012 discovered: a user at 8:00 AM near their office is very likely looking for coffee. The same user at 7:00 PM in the same area is more likely looking for a bar or restaurant. Location and time together create a context that neither variable describes on its own.

**Home vs Office vs Traveling:** one user has several "local contexts". Clustering visited locations (k-means on geocoordinates) identifies: home, office, gym, "traveling". Content preferences differ for each cluster.

Why do location-aware recommendations require accounting for TIME in addition to coordinates?

Session-Based Recommendations: modeling short-term intent

Classic collaborative filtering uses the user's entire history. But user intent **within a session** shifts: came looking for sneakers - five clicks later looking at socks and shorts. Session-based recommendations model the current intent through the sequence of actions in the session, ignoring long-term history.

Model	Architecture	Advantage	Disadvantage
GRU4Rec (2016)	GRU	Effective for short sessions	Poor long-range dependency capture
SASRec (2018)	Transformer (causal)	Long-range dependencies, parallel training	More parameters, slower
BERT4Rec (2019)	BERT (bidirectional)	Bidirectional context	Cannot use directly in online inference
FMLP-Rec (2022)	MLP + FFT	Faster than Transformer, competitive quality	Less studied

How do session-based recommendations differ from classic collaborative filtering?

Multi-Task Learning: joint optimization of multiple objectives

YouTube in 2019 found that optimizing only for clicks (CTR) promoted clickbait. Optimizing only for watch time - long boring videos. **Multi-task learning** addresses this: jointly optimizing CTR + watch time + like probability + "no regret" together produced content users actually wanted to see.

**Tasks in MTL:** CTR (click-through rate), CVR (conversion), watch time, skip rate, like/share/save. Each task has its own labels and loss. Final scoring is a weighted combination: `score = w1*CTR + w2*watch_time - w3*skip_rate`. Weights are tuned via A/B tests.

Multi-task learning complicates the system without real gain - it's better to build separate specialized models.

MTL improves each task through shared representations: signal from one task helps another. MMOE lets tasks have different expert weights - conflicting tasks get different specialized paths. One inference pass vs N independent models = lower latency.

Shared lower layers in MTL act as regularization - rare signals (likes) get more training signal through shared features with frequent events (clicks). N separate models don't have this effect.

YouTube moved to multi-task learning (CTR + watch time + satisfaction). Why is CTR-only optimization insufficient?