Feature Engineering for Recommendations
Цели урока
- Separate user features into session and historical and understand their dynamics
- Build item features with position bias correction
- Create cross features for both linear and non-linear models
- Use learned embeddings as dense features for ranking
In 2009 Netflix awarded $1M to the team that improved recommendations by 10%. The winning team (BellKor) used not a better algorithm, but better feature engineering: temporal patterns, interaction features, viewing context. Netflix today estimates that strong feature engineering contributes +40% to quality compared to a baseline with poor features.
- **Netflix** - temporal features (time of day, day of week) outperform algorithm changes in A/B tests
- **Google Play** - Deep & Cross Network for automatic feature crossing across millions of features
- **Alibaba** - DIN (Deep Interest Network) uses attention over user history embeddings
Предварительные знания
- One-hot encoding and sparse vectors
- Embeddings and dot products
- Linear and logistic models
Factorization Machines and the Feature-Engineering Era
Before deep learning, recommendation accuracy came mostly from hand-built features: feature crosses that multiply user and item attributes, bucketized counts, and learned embeddings for high-cardinality ids. The trouble is that explicit crosses explode in number and most are never observed in the data, so their weights cannot be learned. In 2010 Steffen Rendle introduced Factorization Machines, which give every feature a latent vector and model each pairwise interaction as a dot product of those vectors. That let the model estimate interaction strength even for feature pairs that never co-occur in training, fixing the sparsity problem that crippled hand-crafted crosses. Factorization Machines won several Kaggle and KDD Cup competitions and directly inspired later hybrids such as DeepFM, where an FM component and a deep network share the same embeddings.
User Features: session and historical
A ranking model sees a user as a feature vector. Recommendation quality is directly determined by how well that vector captures the user's current intent. User features fall into two groups with different dynamics: **session features** (what the user is doing right now) and **historical features** (what they did over the past days and months).
Session features are critical for intent detection: a user who listens to jazz in the morning may want rock in the evening. A model without session features produces averaged recommendations and misses the current context.
Why is time of day better encoded as sin/cos rather than as an integer 0-23?
Item Features: content and statistical
An item - track, video, article - is described by two groups of features with different origins. **Content features** are extracted from the item itself: genre, tempo, key, tags, duration. **Statistical features** are aggregates of user behavior: CTR over the last 7 days, average completion rate, number of playlist additions. Both are necessary: content features solve the cold-start problem; statistical features rank popular items more precisely.
Position bias is the main trap in statistical features: an item's CTR depends on where it appears in the results, not just on quality. Models trained without debiasing learn to promote what was already shown at the top - amplifying the popularity of the already popular.
Why does a recommender system need content features (genre, audio tempo) when statistical features (CTR, completion rate) provide an accurate quality signal?
Cross Features and Feature Crossing
A model knows that a user listens to rock and that a track is in the rock genre. Separately, these features do not convey their interaction. **Feature crossing** creates a new feature from the combination: `user_genre=rock * item_genre=rock = 1.0`, directly encoding a genre match. This lets linear models capture non-linear patterns.
Why create a cross feature `mobile_x_short_track` instead of passing `device` and `duration` as separate features?
Embeddings as features
User ID and Item ID are categorical features with millions of unique values. One-hot encoding creates a 100-million-dimensional vector - infeasible. An **embedding** compresses each ID into a dense low-dimensional vector (32-512 dimensions), where similar entities are geometrically close. This vector becomes the primary feature for the ranking model.
Why use a learned embedding for User ID in a recommendation system rather than one-hot encoding?
Feature Engineering for Recommendations
- User features: session (intent) + short-term + long-term historical
- Item features: content (cold start) + statistical (quality) + position bias debiasing
- Feature crossing: explicit combinations for linear models, FM/DCN for automatic crossing
- Embeddings: User ID and Item ID compressed to dense 64-256 vectors via two-tower training
- Feature Store: centralized storage for online serving of current feature values
Related Topics
Feature engineering is the central component of the ranking stage in the two-stage pipeline.
- Candidate Generation — Candidates from retrieval are enriched with features before being passed to the ranker
- Matrix Factorization — MF trains item and user embeddings that are then used as ranking features
- Multi-Objective and Re-Ranking — Re-ranking uses the same features, adding diversity constraints and business rules
Вопросы для размышления
- How can a model's exposure to position bias be detected and measured?
- When does a sequence-based user embedding (BERT4Rec) outperform a simple average over interaction history?
- How should a Feature Store handle feature updates under high traffic (100K+ RPS)?
Связанные уроки
- rec-09 — Candidates must exist before features are built
- rec-04 — Dense features feed deep ranking models
- rec-11 — Feature quality is validated through A/B testing
- aie-09-embeddings — Embeddings turn categories into dense features
- prob-04-bayes — Target encoding leans on Bayesian smoothing
- stat-08-correlation