Recommender Systems

Multi-Objective and Re-Ranking

Spotify Discover Weekly - 30 personalized tracks every Monday. The first version (2015) delivered very similar tracks: high relevance, zero variety. Users listened to the first few and closed it. After adding diversity-aware re-ranking, completion rate grew dramatically - the playlist began to "feel like discovery" rather than "more of the same".

**Spotify Discover Weekly** - diversity re-ranking to balance personalization and discovery of new music
**LinkedIn Feed** - fairness constraints: new content creators get guaranteed minimum exposure regardless of initial engagement
**TikTok For You Page** - epsilon-greedy exploration: every N videos is an exploration slot for new content with no history

Предварительные знания

Ranking and scoring of candidates
Embeddings and cosine similarity
Basic probability (Beta distribution, expectation)

From Learning to Rank to Multi-Task Ranking

In 2005, Chris Burges and colleagues at Microsoft Research published RankNet, a neural learning-to-rank model trained on pairwise preferences with a probabilistic cost. RankNet powered Microsoft's Bing ranking and started the modern learning-to-rank era, later followed by LambdaRank and LambdaMART. The next leap was learning many objectives at once. In 2018 Jiaqi Ma and colleagues at Google introduced Multi-gate Mixture-of-Experts (MMoE), where shared experts are combined through per-task gating networks so loosely related objectives stop fighting each other. In 2019 Zhe Zhao and the YouTube team applied multi-task ranking in production, predicting engagement and satisfaction objectives together and adding a shallow tower to correct position bias. That architecture is why a single ranking model can balance clicks, watch time, and satisfaction at the re-ranking stage.

Diversity: why 10 similar recommendations are worse than 10 varied ones

Early versions of Spotify Discover Weekly delivered 30 tracks very similar to each other - same genre, same tempo. Users listened to the first 5 and closed the playlist. After introducing diversity-aware re-ranking, the playlist began to "feel fresh" - and completion rate grew. Relevance and diversity are different quality axes.

**Maximal Marginal Relevance (MMR)** is an iterative selection algorithm that balances relevance and novelty. At each step, the document is selected with the maximum difference between relevance and similarity to already-selected items.

**Submodular optimization:** diversity problems are often submodular - adding each new item yields diminishing marginal gains in variety. This allows greedy algorithms (like MMR) with a provable (1-1/e) ≈ 63% of optimal solution guarantee at polynomial complexity.

MMR with lambda=0 (minimum) prioritizes:

Fairness: equity for providers and users

Spotify found that 90% of listens go to 1% of artists - even among tracks of equal quality. New artists receive almost no exposure. This is **provider fairness** - inequity for content creators. **User fairness** is when the system systematically serves minorities worse (users with niche tastes, non-English speakers).

Fairness type	Who suffers	Metric	Solution
Provider fairness	New/niche artists	Exposure per group	Min-exposure constraints
User fairness	Users with niche tastes	Error rate by group	Group-specific calibration
Disparate impact	Protected categories	DI ratio (>= 0.8)	Post-processing re-ranking
Popularity bias	Unpopular items	Long-tail coverage	Exploration boost for new items

A disparate impact ratio < 0.8 means:

Exploration-Exploitation: bandits for new items

Cold-start problem: a new track on Spotify has no history - collaborative filtering gives it a zero score. With epsilon-greedy, with probability ε a random item is recommended (exploration), with probability 1-ε the best by current estimate (exploitation). Thompson Sampling is a Bayesian approach without a hard ε.

**When to use which method:** epsilon-greedy is simple but inefficient (constant exploration rate). Thompson Sampling is adaptive - high uncertainty leads to more exploration, low uncertainty to more exploitation. LinUCB incorporates user context - the best choice for personalized exploration.

Why is Thompson Sampling more efficient than epsilon-greedy for cold-start?

Business Rule Injection: reality on top of ML

An ML model optimizes a proxy metric. The business has additional requirements: content safety, licensing restrictions, sponsored content, regional legal prohibitions. A **post-processing re-ranking layer** applies these rules after ML scoring - without retraining the model.

**Re-ranking pipeline architecture:** ML candidate generation (1000 items) -> ML scoring & ranking -> business rules post-processing -> diversity re-ranking (MMR) -> fairness constraints -> final top-K. Each layer is independent and can be changed without retraining the model.

Diversity and fairness conflict with relevance - they cannot be optimized simultaneously.

A small relevance drop (+diversity, +fairness) often increases long-term engagement and user satisfaction. MMR with lambda=0.7 typically loses 2-5% precision while gaining +20-30% diversity. This is an acceptable trade-off for most products.

Users tire of filter bubbles - homogeneous content reduces session engagement. Studies at Netflix and Spotify showed: diverse recommendations increase user return rate.

Why are business rules applied after ML scoring rather than as constraints during model training?