Databases

Graph Databases

Google Knowledge Graph contains 500 billion facts about 5 billion entities. When searching 'Barack Obama', Google traverses the relationship graph to find spouse, children, profession, alma mater, and related events - all in a single query. The relational equivalent would require 15+ JOINs across tables that would take seconds. Graph traversal takes milliseconds.

**PayPal**: Neo4j for real-time fraud detection. Graph traversal identifies money laundering rings: account A -> B -> C -> A with high transaction frequency. The same detection in SQL requires 5+ self-JOINs and runs too slowly for real-time blocking.
**LinkedIn**: property graph of 900+ million users for 'People You May Know' and degree-of-connection calculations. BFS at depth 2 finds 2nd-degree connections in milliseconds across the full network.
**NASA**: dependency graph for 300,000+ spacecraft components. Impact analysis - 'which missions fail if this sensor breaks?' - traverses the dependency graph in seconds instead of minutes with recursive SQL CTEs.

The Property Graph Model

A property graph consists of nodes (entities) and edges (relationships), each with a label and a map of properties. Unlike relational tables where relationships are implicit (via foreign keys and JOINs), graph edges are first-class objects stored with pointers directly to adjacent nodes. Traversal follows pointers - no JOIN computation required.

Google Knowledge Graph contains 500 billion facts about 5 billion entities. When searching 'Barack Obama', the graph traversal finds spouse, children, profession, education, and related events in a single multi-hop query. This relationship traversal would require dozens of JOINs in a relational model.

Why is a graph database more efficient than SQL for 'friends of friends' queries as the dataset grows?

Cypher Query Language

Cypher is Neo4j's declarative graph query language, designed to visually represent graph patterns using ASCII art syntax. Nodes are expressed as parentheses (), relationships as arrows -[]->, and properties as curly braces {}. Cypher queries describe the pattern to find, not the traversal algorithm to execute.

LinkedIn uses a graph model for its 900+ million user professional network. The 'People You May Know' feature computes common connections and 2nd-degree paths in real time. A Cypher-style traversal of 2-3 hops through the connection graph is milliseconds; the equivalent SQL JOINs on a 900-million-node network would be minutes.

Cypher query: MATCH (a)-[:KNOWS*2..4]-(b) WHERE a.name = 'Alice'. What does *2..4 specify?

Graph Traversal Algorithms

Graph traversal algorithms explore nodes and edges to answer questions about connectivity, shortest paths, and community structure. BFS (Breadth-First Search) finds shortest hop-count paths. Dijkstra finds shortest weighted paths. PageRank scores node importance by incoming link structure. These algorithms are built into graph databases and run directly on the stored graph structure.

PayPal uses Neo4j for real-time fraud detection. Graph traversal identifies rings: account A sends to B, B to C, C back to A - a common money laundering pattern. Detecting this with SQL requires a self-JOIN chain of 5+ tables. Neo4j pattern matching runs in milliseconds on live transaction data.

LinkedIn shows '2nd degree connections' - friends of friends. Which algorithm efficiently finds these?

Graph DB vs SQL: When to Choose

Graph databases excel at relationship-heavy queries with variable depth. Relational databases excel at structured queries with ACID guarantees, aggregations, and reporting. Most production systems combine both: PostgreSQL as the source of truth for transactional data, Neo4j or Amazon Neptune for relationship queries.

A financial application needs to store transactions (ACID) AND detect fraud patterns (graph traversal). What architecture is recommended?

Graph Database Use Cases

Graph databases are purpose-built for connected data problems. The canonical use cases - social networks, fraud detection, recommendation engines, knowledge graphs, and network topology - all share the same characteristic: the answer depends on traversing relationships across many hops, not on scanning rows or columns.

NASA uses a graph database to model dependencies among 300,000+ spacecraft components. An impact analysis query - 'if thermal sensor X fails, which missions are at risk?' - traverses the dependency graph to find all systems that depend on that sensor. In a relational model, this would require a stored procedure with recursive CTEs that takes minutes. The graph query returns in seconds.

Which use case is NOT a good fit for a graph database?

Summary

**Property graph model**: nodes and edges are first-class objects with properties. Edge traversal follows stored pointers - O(1) per hop vs. O(n) SQL JOIN per hop.
**Cypher**: declarative pattern-matching language. *2..4 specifies variable-length paths (2 to 4 hops). shortestPath() finds minimum-hop connections.
**Graph algorithms**: BFS for hop-count paths, Dijkstra for weighted paths, PageRank for node importance, Louvain for community detection - all in the Neo4j GDS library.
**Choose graph DB** when queries traverse unknown-depth relationships (fraud rings, org hierarchies, friend-of-friend). **Choose SQL** for ACID transactions, aggregations, and reporting.
**Hybrid architecture** is common: PostgreSQL for transactional writes, Neo4j for relationship queries, synced via CDC or Kafka.

Вопросы для размышления

LinkedIn shows 2nd-degree connections with mutual connection count. How would this query differ between Neo4j Cypher and PostgreSQL recursive CTEs? At what scale does the difference become critical?
A fraud detection system flags account A. The team wants to find all accounts reachable from A within 5 hops via SENT relationships. What graph algorithm is used, and what would the equivalent SQL look like?
When would adding Neo4j to a PostgreSQL-based application NOT be worth the operational complexity?

Связанные уроки

ds-16-graphs-intro

Databases

Graph Databases

**PayPal**: Neo4j for real-time fraud detection. Graph traversal identifies money laundering rings: account A -> B -> C -> A with high transaction frequency. The same detection in SQL requires 5+ self-JOINs and runs too slowly for real-time blocking.
**LinkedIn**: property graph of 900+ million users for 'People You May Know' and degree-of-connection calculations. BFS at depth 2 finds 2nd-degree connections in milliseconds across the full network.
**NASA**: dependency graph for 300,000+ spacecraft components. Impact analysis - 'which missions fail if this sensor breaks?' - traverses the dependency graph in seconds instead of minutes with recursive SQL CTEs.

The Property Graph Model

Why is a graph database more efficient than SQL for 'friends of friends' queries as the dataset grows?

Cypher Query Language

Cypher query: MATCH (a)-[:KNOWS*2..4]-(b) WHERE a.name = 'Alice'. What does *2..4 specify?

Graph Traversal Algorithms

LinkedIn shows '2nd degree connections' - friends of friends. Which algorithm efficiently finds these?

Graph DB vs SQL: When to Choose

A financial application needs to store transactions (ACID) AND detect fraud patterns (graph traversal). What architecture is recommended?

Graph Database Use Cases

Which use case is NOT a good fit for a graph database?

Summary

**Property graph model**: nodes and edges are first-class objects with properties. Edge traversal follows stored pointers - O(1) per hop vs. O(n) SQL JOIN per hop.
**Cypher**: declarative pattern-matching language. *2..4 specifies variable-length paths (2 to 4 hops). shortestPath() finds minimum-hop connections.
**Graph algorithms**: BFS for hop-count paths, Dijkstra for weighted paths, PageRank for node importance, Louvain for community detection - all in the Neo4j GDS library.
**Choose graph DB** when queries traverse unknown-depth relationships (fraud rings, org hierarchies, friend-of-friend). **Choose SQL** for ACID transactions, aggregations, and reporting.
**Hybrid architecture** is common: PostgreSQL for transactional writes, Neo4j for relationship queries, synced via CDC or Kafka.

Вопросы для размышления

LinkedIn shows 2nd-degree connections with mutual connection count. How would this query differ between Neo4j Cypher and PostgreSQL recursive CTEs? At what scale does the difference become critical?
A fraud detection system flags account A. The team wants to find all accounts reachable from A within 5 hops via SENT relationships. What graph algorithm is used, and what would the equivalent SQL look like?
When would adding Neo4j to a PostgreSQL-based application NOT be worth the operational complexity?

Связанные уроки

ds-16-graphs-intro

Graph Databases

The Property Graph Model

Cypher Query Language

Graph Traversal Algorithms

Graph DB vs SQL: When to Choose

Graph Database Use Cases

Summary

Related Topics

Вопросы для размышления

Связанные уроки

Graph Databases

The Property Graph Model

Cypher Query Language

Graph Traversal Algorithms

Graph DB vs SQL: When to Choose

Graph Database Use Cases

Summary

Related Topics

Вопросы для размышления

Связанные уроки