Recommender Systems
Sequential Recommendations
Netflix, 2016. RNN-based recommendations replace matrix factorization. The key insight: viewing history is a sequence, not a set. The next film is determined by the last 3-5 watched items, not by averaging the full history. TikTok goes further: a Transformer over a 100-item sequence predicts the next video in under 10 ms. The architecture evolves from GRU4Rec to SASRec to BERT4Rec - each step revealing a new aspect of user behavior in time.
- TikTok: Transformer over a 100-item session sequence predicts the next video, accounting for watch-completion patterns and topic switches
- Spotify Discover Weekly: sequential patterns in listening history reveal daily mood through temporal genre transitions
- Amazon: 'customers also viewed' reimagined through SASRec - the order of product views encodes purchase intent
- Booking.com: sequential hotel recommendations via a BERT4Rec-based model incorporate trip history and seasonal patterns
Предварительные знания
- Deep learning recommendation models (NCF, embeddings)
- Recurrent networks and attention as ways to process ordered data
- The idea of a user-item interaction history
From GRU4Rec to BERT4Rec: Sequence Models Take Over
In 2016, Balazs Hidasi and colleagues published GRU4Rec, the first work to model a browsing session as a sequence fed through a recurrent network rather than as a bag of items. It showed that order carries intent. In 2018, Wang-Cheng Kang and Julian McAuley replaced the RNN with self-attention in SASRec, letting the model attend directly to any earlier item in the sequence and training far faster. In 2019, Fei Sun and colleagues at Alibaba introduced BERT4Rec, applying the masked-item objective of BERT so the model could use context from both directions. Within three years, sequential recommendation moved from a niche idea to the architecture behind the feeds at TikTok, Amazon, and Booking.com.
GRU4Rec: Recurrent Networks for Sessions
Netflix, 2016. The recommendations team replaces matrix factorization with a recurrent network. The core observation: a viewing history is a sequence, not a set. The next film depends on the last few watched items, not an average across the entire history. GRU4Rec (Hidasi et al., 2016) is the first work applying GRU to session-based recommendations. Input: a sequence of item IDs within a session. Output at each step: a probability distribution over all items.
GRU is preferred over LSTM for recommendations: fewer parameters per session (2 gates instead of 3), faster on short sessions (5-20 items). Zalando deployed GRU4Rec for session-based recommendations in e-commerce and measured +15% CTR over item-KNN.
Session-parallel mini-batches are the key trick in GRU4Rec. Each position in the batch corresponds to a separate session. Hidden states persist across steps within a session but reset between sessions. This enables thousands of sessions to be processed in parallel without truncation.
The limitation of GRU4Rec: no attention mechanism over distant items. If a user watched a film 15 positions ago, GRU gradually forgets it. For long sessions (Netflix, YouTube) this is critical - it directly motivates the move to self-attention.
Why does GRU4Rec use session-parallel mini-batches instead of standard batching?
Self-Attention for Item Sequences
2018. Self-attention in NLP has already shown that long-range dependencies can be captured directly - without a recurrent pass. The same idea is applied to item sequences. Each item in a user's history attends to all other items and weights them by relevance. Positional embeddings encode order: an item at position 1 and an item at position 10 carry different context even with the same ID.
Self-attention complexity: $O(L^2 d)$, where $L$ is sequence length and $d$ is dimensionality. RNN: $O(L d^2)$. When $L < d$, attention is cheaper. Typical sessions: $L = 50{-}200$, $d = 64{-}256$. At $L = 100$, $d = 256$: attention at $O(2.56M)$, RNN at $O(6.55M)$ operations.
Positional encoding is critical for recommendations. Consider: a user watches an Action film, then Comedy, then Action. Without positions the model cannot distinguish the first Action from the third. Netflix found that the last 3-5 items have disproportionate influence on the next choice - learnable positional embeddings encode this directly.
Under what condition is self-attention computationally more efficient than RNN?
SASRec: Unidirectional Transformer for Recommendations
Self-Attentive Sequential Recommendation (Wang-Cheng Kang, Julian McAuley, 2018). The approach: take a Transformer decoder (past items only), remove cross-attention, keep 2 self-attention layers with a causal mask. The causal mask means item at position $t$ sees only items $1, 2, ..., t-1$ - no future leakage. Inference: the last hidden state is multiplied by the item embedding matrix, top-K by inner product - these are the recommendations.
SASRec became the standard baseline for sequential recommendations. On Amazon Beauty: NDCG@10 = 0.0735 (SASRec) vs 0.0627 (GRU4Rec) vs 0.0481 (Caser). On MovieLens-1M: HR@10 = 0.8295 vs 0.7367 (GRU4Rec). SASRec requires 2-3x fewer parameters than BERT4Rec.
Binary cross-entropy with negative sampling is preferred over softmax for SASRec. For each positive item, one negative item is sampled (one the user never interacted with). Softmax over all items (100K+) is too expensive. Approximated softmax via negative sampling preserves quality at 50-100x lower computational cost.
Why does SASRec use causal (unidirectional) attention rather than bidirectional?
BERT4Rec: Bidirectional Training for Recommendations
BERT4Rec (Sun et al., 2019) applies the BERT idea to item sequences. Cloze task: 15-20% of items in a session are randomly masked, and the model learns to reconstruct them from context in both directions. Bidirectional attention means item[3] sees item[1], item[2] and item[5], item[6] - if item[4] is masked. Hypothesis: bidirectional context produces richer item representations.
BERT4Rec inference: a [MASK] token is appended to the end of the sequence; the model predicts it - this is the next item. The paradox: during training the model uses bidirectional context, but at inference it does not. [MASK] at the last position sees only the past (no future exists), which closely mirrors a unidirectional setup.
Comparison on MovieLens-20M: BERT4Rec NDCG@10 = 0.2711 vs SASRec = 0.2713. Difference within noise. On Amazon Books (sparse): SASRec is better by 8%. Conclusion: bidirectional training helps on dense datasets with long sessions; it loses on sparse ones. Most production systems use SASRec or its variants.
Bidirectional attention is always better than unidirectional for sequential recommendations
SASRec (unidirectional) matches or outperforms BERT4Rec on most benchmarks, especially on sparse data
Bidirectional training via Cloze task is a proxy objective that does not align with next-item prediction. At inference, the [MASK] token sees only the past, reproducing a unidirectional setup. Bidirectional training only helps on dense datasets with long sessions, where the richness of context outweighs the train/inference alignment gap.
What is the key train/inference compromise in BERT4Rec?
Key Ideas
- **GRU4Rec** (2016) - first application of RNN to session-based recommendations. Session-parallel batching preserves hidden states. Limitation: no direct access to distant items.
- **Self-Attention** captures dependencies at any distance in $O(L^2 d)$. Positional embeddings encode item order. More efficient than RNN when $L < d$.
- **SASRec** (2018) - 2-layer Transformer with causal mask. Unidirectional attention aligns with next-item prediction. Standard baseline for sequential rec.
- **BERT4Rec** (2019) - Cloze task with bidirectional attention. Train/inference mismatch: at inference [MASK] sees only the past. Advantage only on dense datasets.
Related Topics
Sequential recommendations bridge NLP architectures and recommender system objectives:
- Deep Learning for Recommendations — GRU4Rec and SASRec build on neural CF and embedding approaches from the previous lesson
- Collaborative Filtering — Sequential models address the static nature of CF: user and item embeddings update with each new interaction
- Transformer and Self-Attention — SASRec and BERT4Rec are direct adaptations of the Transformer for discrete item sequences
- Model Evaluation Metrics — NDCG@K, HR@K, MRR are standard metrics for evaluating sequential recommendations
- AI Service API Integration — Deploying sequential recommendations requires a real-time inference API with under 10 ms latency
Вопросы для размышления
- SASRec uses learnable positional embeddings rather than sinusoidal. What is the difference between 'absolute position in the session' and 'relative distance from the last item'? Which approach is more appropriate for recommendations?
- If a user has interacted with 500 items over six months but the model is trained on sequences of length 200 - what is the correct truncation strategy? Remove older items (from the start) or newer ones (from the end)? How does this choice affect quality?
- BERT4Rec shows an advantage on dense datasets and underperforms on sparse ones. What mechanism explains this? Is it related to the train/inference mismatch, or to the quality of item representations?
Связанные уроки
- rec-04 — Deep learning rec baselines (NCF, Two-Tower) are the prerequisite for sequential architectures
- rec-01 — Sequential models extend collaborative filtering by adding a time axis to user-item interactions
- dl-05 — GRU4Rec and SASRec apply the same RNN/Transformer primitives from deep learning to item sequences
- ml-05-evaluation — NDCG@K and HR@K from standard ML evaluation are the canonical metrics for sequential rec benchmarks
- aie-05-api-integration — Deploying sequential rec models requires real-time inference APIs with sub-10ms latency
- rec-02
- rec-03
- ml-01-intro