Recommender Systems

Graph Neural Networks for Recommendations

2019. Pinterest. 3 billion edges. Matrix factorization sees each pin in isolation. The graph sees: this pin leads to that pin in two hops. PinSage captured that structure - and recommendation engagement rose 30%. Not from bigger data. From the right mathematics: propagation through graph topology.

Pinterest PinSage: 300M pins, GraphSAGE with random walk importance sampling, +30% engagement in production over the previous system
Alibaba GATNE: multi-relational graph (purchases, views, clicks) as separate edge types with Graph Attention Networks - +10% CTR on Taobao
YouTube DNN + graph: user-item interactions modeled as a graph for the co-watch matrix, two-stage retrieval and ranking pipeline
Microsoft Azure Recommendations: LightGCN in a cloud pipeline for B2B recommendations, replacing ALS with lower latency

Предварительные знания

Sequential recommendation models and embeddings
Graphs and adjacency: nodes, edges, neighbors
Message passing as repeated aggregation over neighbors

Sequential Recommendations

PinSage, NGCF, and LightGCN: Graphs Reach Web Scale

In 2018, Rex Ying and colleagues at Pinterest and Stanford published PinSage, the first graph convolutional network deployed at web scale. It ran over a graph of three billion nodes using random-walk sampling so neighborhoods never had to fit in memory, and it lifted engagement in production. In 2019, Xiang Wang and colleagues introduced NGCF, which propagated collaborative signal explicitly through the user-item bipartite graph. In 2020, Xiangnan He and colleagues showed with LightGCN that the feature transformations and non-linearities NGCF inherited from generic GCNs actually hurt recommendation accuracy; stripping them out left only neighborhood aggregation and produced a simpler, stronger, faster model that became the standard baseline.

The User-Item Graph: Structure a Matrix Loses

2019. Pinterest. 300 million pins, 3 billion edges in the graph. A user saves a pin "DIY wooden shelf" - and two hops away in the graph sits "Scandinavian interior design". Matrix factorization sees two independent vectors. The graph sees a connection.

The bipartite recommendation graph is $G = (U, V, E)$, where $U$ is users, $V$ is items, and $E$ is interactions. An edge $(u, i) \in E$ means user $u$ interacted with item $i$. No user-user or item-item edges - only cross-connections.

Why does the graph beat the matrix? Matrix $R \in \mathbb{R}^{|U| \times |V|}$ stores only direct interactions. The graph allows signal to propagate through multi-hop paths: user A liked item X, item X was liked by user B, B liked item Y - so Y may appeal to A. This is called high-order connectivity, and it is precisely what matrix factorization loses.

The $D^{-1/2} A D^{-1/2}$ normalization is critical. Without it, high-degree nodes (popular items) dominate aggregation and drown out the signal from niche pins. After normalization, each neighbor's contribution is scaled inversely proportional to the square root of its degree.

Real-world scale: Pinterest - 300M nodes, 3B edges. A 300M x 300M matrix at 0.00001% density is beyond any RAM budget. Storage relies on sparse COO or CSR format. PyTorch Geometric and DGL are specialized exactly for such sparse graphs.

What is high-order connectivity in a user-item graph, and why does matrix factorization fail to capture it?

LightGCN: Graph Convolution Without the Noise

He et al., 2020. The original NGCF added feature transformations and nonlinearities to GNN - by analogy with classical GCN. The LightGCN authors asked: why? The ablation study answered: remove transformations - performance improves. Remove nonlinearities - improves further. What remained: pure message passing on the bipartite graph.

LightGCN is iterative neighbor aggregation with no weight matrices and no activations. At each layer $k$: $e_u^{(k)} = \sum_{i \in N_u} \frac{1}{\sqrt{|N_u||N_i|}} e_i^{(k-1)}$ for users and symmetrically for items. The final embedding is a weighted sum across all layers: $e_u = \sum_{k=0}^{K} \alpha_k e_u^{(k)}$.

Layer $k=0$: the node's own embedding (id embedding). Layer $k=1$: average of first-order neighbors (items the user interacted with). Layer $k=2$: users who interacted with the same items - similar users. Layer $k=3$: items of similar users - classic item-item CF through 3 hops of the graph.

On Amazon-Book: LightGCN Recall@20 = 0.0411 vs NGCF = 0.0344 (+19%). On Yelp2018: LightGCN = 0.0649 vs NGCF = 0.0579 (+12%). All gains come from removing components. The practical lesson: before adding complexity, check whether it is helping or hurting.

Why does LightGCN outperform NGCF despite having a simpler architecture?

PinSage: GraphSAGE at 3 Billion Edges

2019. Pinterest. LightGCN operates on the full graph - but what if the graph does not fit in memory? PinSage (Ying et al., 2018) is GraphSAGE for industrial scale. Instead of the full adjacency matrix: sampled neighbor aggregation, so each node processes a fixed number of neighbors. Engagement from recommendations rose 30% - because "similar pin" now means "connected within 2 hops in the graph".

The core idea of GraphSAGE: sample a fixed number of neighbors $S$ for each node, aggregate their features, concatenate with the node's own vector, apply a linear transformation. PinSage adds: importance sampling via random walks - neighbors are selected proportionally to visit counts in the random walk, not uniformly. This emphasizes structurally important neighbors.

PinSage vs LightGCN in production: LightGCN requires the full normalized adjacency matrix in memory - impossible at Pinterest's scale. PinSage samples neighbors on the fly from a Kafka stream of interactions. Each pin has multi-modal features: image (ResNet-50, 2048 dim), text (word2vec, 256 dim), metadata. GraphSAGE aggregates these rich features - LightGCN works only with ID embeddings.

Curriculum training in PinSage: start with easy negatives (random pins), then introduce hard negatives (pins from the top-1000 by cosine similarity but without interaction). Without this strategy the model learns to separate visibly different pins and performs poorly on truly similar ones.

Inference at Pinterest: daily batch recomputation of all 300M pin embeddings on a MapReduce cluster. New pins are handled via inductive inference - the model uses features, not IDs - so it can produce an embedding for a new pin without retraining. This is the principal advantage of PinSage over transductive LightGCN.

LightGCN and PinSage solve the same problem with different methods - either can be chosen freely

LightGCN is transductive (ID-only, full graph in memory); PinSage is inductive (features, sampling) - these are different production regimes

LightGCN requires knowledge of the entire graph during training and cannot handle new nodes without retraining. PinSage works with features and can embed a new pin without retraining - critical for Pinterest, where thousands of new pins are added every second.

Why does PinSage use importance sampling via random walks rather than uniform neighbor sampling?

Key Ideas

**Bipartite graph** $G=(U,V,E)$ encodes user-item interactions. High-order connectivity through $k$ hops is what matrix factorization loses. Normalization $D^{-1/2}AD^{-1/2}$ equalizes the contribution of popular and niche nodes.
**LightGCN** (2020): pure message passing with no weight matrices or nonlinearities. Final embedding = mean across K layers. Outperforms NGCF by +12-19% with fewer parameters. Applicable when the graph fits in memory.
**PinSage** (2018): GraphSAGE for industrial scale. Importance sampling via random walks, multi-modal features (image + text), curriculum training with hard negatives. Inductive - new pins without retraining.
**Transductive vs inductive**: LightGCN works with ID embeddings (full graph required), PinSage with features (new nodes can be inferred). The choice is driven by production constraints, not model quality alone.

Вопросы для размышления

LightGCN averages embeddings from all K layers with equal weights (alpha_k = 1/(K+1)). What if a user's interactions are skewed - a few very popular items and many niche ones? How does this affect the final embedding, and what can be done?
PinSage uses inductive inference through pin features (image + text). But for Pinterest users there are no analogous multi-modal features - only interaction history. How should a new user with no history be handled?
LightGCN works only with binary interactions (liked / not liked). How would one extend the model to account for implicit feedback of varying strength: view, click, purchase, repeat purchase?

Связанные уроки

rec-01 — GNN recommendations build on top of collaborative filtering user-item embeddings
rec-04 — Neural CF and Two-Tower are the predecessors of GNN-based recommendation
dl-08 — LightGCN and PinSage apply the same GNN message-passing primitives as graphs for molecules or social networks
rec-05 — GNN is an alternative to sequential models: graph topology instead of temporal order
ml-05-evaluation — NDCG@K and Recall@K are standard metrics for evaluating GNN-based recommenders
ml-01

Recommender Systems

Graph Neural Networks for Recommendations

Pinterest PinSage: 300M pins, GraphSAGE with random walk importance sampling, +30% engagement in production over the previous system
Alibaba GATNE: multi-relational graph (purchases, views, clicks) as separate edge types with Graph Attention Networks - +10% CTR on Taobao
YouTube DNN + graph: user-item interactions modeled as a graph for the co-watch matrix, two-stage retrieval and ranking pipeline
Microsoft Azure Recommendations: LightGCN in a cloud pipeline for B2B recommendations, replacing ALS with lower latency

Предварительные знания

Sequential recommendation models and embeddings
Graphs and adjacency: nodes, edges, neighbors
Message passing as repeated aggregation over neighbors

Sequential Recommendations

PinSage, NGCF, and LightGCN: Graphs Reach Web Scale

The User-Item Graph: Structure a Matrix Loses

What is high-order connectivity in a user-item graph, and why does matrix factorization fail to capture it?

LightGCN: Graph Convolution Without the Noise

Why does LightGCN outperform NGCF despite having a simpler architecture?

PinSage: GraphSAGE at 3 Billion Edges

LightGCN and PinSage solve the same problem with different methods - either can be chosen freely

LightGCN is transductive (ID-only, full graph in memory); PinSage is inductive (features, sampling) - these are different production regimes

Why does PinSage use importance sampling via random walks rather than uniform neighbor sampling?

Key Ideas

**Bipartite graph** $G=(U,V,E)$ encodes user-item interactions. High-order connectivity through $k$ hops is what matrix factorization loses. Normalization $D^{-1/2}AD^{-1/2}$ equalizes the contribution of popular and niche nodes.
**LightGCN** (2020): pure message passing with no weight matrices or nonlinearities. Final embedding = mean across K layers. Outperforms NGCF by +12-19% with fewer parameters. Applicable when the graph fits in memory.
**PinSage** (2018): GraphSAGE for industrial scale. Importance sampling via random walks, multi-modal features (image + text), curriculum training with hard negatives. Inductive - new pins without retraining.
**Transductive vs inductive**: LightGCN works with ID embeddings (full graph required), PinSage with features (new nodes can be inferred). The choice is driven by production constraints, not model quality alone.

Вопросы для размышления

LightGCN averages embeddings from all K layers with equal weights (alpha_k = 1/(K+1)). What if a user's interactions are skewed - a few very popular items and many niche ones? How does this affect the final embedding, and what can be done?
PinSage uses inductive inference through pin features (image + text). But for Pinterest users there are no analogous multi-modal features - only interaction history. How should a new user with no history be handled?
LightGCN works only with binary interactions (liked / not liked). How would one extend the model to account for implicit feedback of varying strength: view, click, purchase, repeat purchase?

Связанные уроки

rec-01 — GNN recommendations build on top of collaborative filtering user-item embeddings
rec-04 — Neural CF and Two-Tower are the predecessors of GNN-based recommendation
dl-08 — LightGCN and PinSage apply the same GNN message-passing primitives as graphs for molecules or social networks
rec-05 — GNN is an alternative to sequential models: graph topology instead of temporal order
ml-05-evaluation — NDCG@K and Recall@K are standard metrics for evaluating GNN-based recommenders
ml-01

Graph Neural Networks for Recommendations

Предварительные знания

PinSage, NGCF, and LightGCN: Graphs Reach Web Scale

The User-Item Graph: Structure a Matrix Loses

LightGCN: Graph Convolution Without the Noise

PinSage: GraphSAGE at 3 Billion Edges

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Graph Neural Networks for Recommendations

Предварительные знания

PinSage, NGCF, and LightGCN: Graphs Reach Web Scale

The User-Item Graph: Structure a Matrix Loses

LightGCN: Graph Convolution Without the Noise

PinSage: GraphSAGE at 3 Billion Edges

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки