Knowledge Graphs
Property Graphs: Neo4j
Цели урока
- Understand the difference between Property Graph and the RDF approach
- Create nodes with labels and relationships with properties in Neo4j
- Write Cypher queries with variable-length paths for real-world tasks
- Choose between Neo4j and PostgreSQL based on the nature of the problem
Предварительные знания
April 2016. Panama Papers - 11.5 million documents about offshore companies. ICIJ grabbed Neo4j and within weeks built a graph of millions of connections between companies, directors, and assets. SQL on the same data would have demanded years of query development. Cypher surfaced circular tax evasion patterns in seconds.
- **Fraud Detection** (eBay, PayPal, ICIJ/Panama Papers): detecting circular transactions and shell companies
- **Recommendations** (Airbnb, Walmart): collaborative filtering through a user-purchase-product graph
- **Drug Discovery** (AstraZeneca, Novartis): finding connections between genes, proteins, and diseases in a KG with billions of facts
Emil Eifrem and Neo4j born from a JOIN problem
In 2000 Emil Eifrem was building a content management system for a Swedish company. A relational schema with countless JOINs performed catastrophically on hierarchy queries. On an airplane napkin he sketched a graph - and realized an entirely different database was needed. Neo4j was born, with its first production release in 2010. By 2024 it is the world's largest graph database with over 1,000 enterprise customers. The ISO GQL standard adopted in 2024 effectively crowned Cypher as an international standard.
Nodes and Labels
RDF is expressive but verbose. The 2000s brought an alternative - the **Property Graph**. Forget triples - it uses full-fledged objects: each **node** (vertex) carries one or more **labels** (type tags) and a set of **properties** (key-value pairs). Far closer to how developers naturally think about objects. That is why eBay, Airbnb, and AstraZeneca picked Neo4j over RDF stores.
**Labels** are tags that define entity type. One node can wear multiple labels: (:Person:Scientist:Author). Far more flexible than a rigid OWL class hierarchy. Labels drive indexing and scope queries.
**Neo4j** is the most popular graph database (over 1,000 enterprise customers). It stores data natively as a graph via **index-free adjacency**: each node physically points to its neighbors, making traversal O(1) per edge regardless of graph size.
| Characteristic | RDF | Property Graph |
|---|---|---|
| Basic unit | Triple (s, p, o) | Node + Relationship |
| Properties on edges | Reification (complex) | Natively supported |
| Identification | URI (global) | Internal ID (local) |
| Multiple types | rdf:type (separate triples) | Labels (built-in) |
| Standard | W3C (RDF, OWL, SPARQL) | None unified (GQL in progress) |
| Interoperability | High (URI, Linked Data) | Limited |
A node in Neo4j has labels [:Person, :Author, :Scientist]. What does this mean?
Relationships
**Relationships** (edges) in a Property Graph are first-class objects. Each edge has a **type**, a **direction**, and carries its **own properties**. The key advantage over RDF, where edge properties demand complex reification. Storing "when=2023, amount=5000, reason=payment" directly on the edge is exactly what makes fraud detection in Neo4j so compact.
Relationships in Neo4j are **always directed** when created. But queries can ignore direction: `(a)-[:KNOWS]-(b)` matches the relationship either way. Relationship types follow CAPS_SNAKE_CASE by convention: ACTED_IN, DIRECTED, BORN_IN.
**Index-free adjacency**: in Neo4j each node stores direct pointers to its relationships. Traversal from a node to a neighbor is O(1), whether the graph holds 10 nodes or 10 billion. The equivalent JOIN in a relational database slows down as the table grows.
Relationships form **paths** - chains of node-relationship-node. Many of Neo4j's most capable queries are path searches: shortest path between people (six degrees of separation), all transaction paths from A to B (fraud detection), recommendations through shared neighbors.
How is a relationship in a Property Graph better than a predicate in RDF?
Properties
**Properties** are key-value pairs attachable to nodes and relationships alike. This makes the Property Graph self-describing: all information lives inside the graph, no external schema required.
| Data Type | Example | Applies To |
|---|---|---|
| String | name: 'Einstein' | Node |
| Integer | born: 1879 | Node |
| Float | rating: 4.7 | Relationship |
| Boolean | active: true | Node |
| List | genres: ['Sci-Fi', 'Drama'] | Node |
| Date | since: date('2020-01-15') | Relationship |
| Point | location: point({lat:48.39, lon:9.98}) | Node |
Unlike a relational database, a Property Graph is **schema-optional**. Two nodes labeled :Person can carry different property sets. Flexible, but risky: no guaranteed data integrity without explicit constraints.
Neo4j supports **constraints**: UNIQUE (property uniqueness), NOT NULL (mandatory field), and NODE KEY (composite uniqueness). They blend schema-optional flexibility with a minimal set of integrity guarantees.
A property graph is schema-optional. This means:
Cypher
**Cypher** is the declarative query language for Property Graphs, built by Neo4j. Its signature feature: **ASCII-art syntax** - graph patterns get drawn directly in the query. `(a)-[:ACTED_IN]->(m)` reads naturally as "a, connected by an ACTED_IN edge to m". SQL forces mental translation of a graph into JOINs. Cypher lets developers think in the graph itself.
Cypher supports **variable-length paths**: `(a)-[:KNOWS*1..6]->(b)` finds a path through 1-6 KNOWS edges. Unlocks six-degrees-of-separation queries, transaction path searches, and supply chain analysis - problems that crush SQL.
| Cypher Pattern | Meaning | SQL Equivalent |
|---|---|---|
| (n:Person) | Node with label Person | FROM persons |
| (n)-[:ACTED_IN]->(m) | Directed relationship | JOIN ... ON ... |
| (n)-[:KNOWS*2]-(m) | Path of length 2 | Nested JOIN |
| (n)-[:KNOWS*1..5]-(m) | Path of length 1-5 | Recursive CTE |
| shortestPath(...) | Shortest path | No direct equivalent |
| WHERE NOT (a)-->(b) | Absence of relationship | NOT EXISTS (subquery) |
**GQL** (Graph Query Language) is a new ISO standard (ISO/IEC 39075) based on Cypher, adopted in 2024. The first international standard for graph queries - the SQL equivalent for graph databases. Neo4j, Amazon Neptune, and TigerGraph have all announced support.
Neo4j stretches well past social networks: **fraud detection** (circular transaction rings), **recommendations** (collaborative filtering through the graph), **supply chain** (tracking chains of custody), **drug discovery** (gene-protein-disease links), **identity resolution** (merging customer data from disparate systems).
Neo4j is only for social networks and friend graphs
Property Graphs are used in fraud detection, recommendation systems, drug discovery, supply chain, identity resolution, and dozens of other domains.
Any problem where relationships matter - who is connected to whom, through whom, along which paths - is a candidate for a graph database. Fraud detection (transaction rings), recommendations (collaborative filtering), drug discovery (gene→protein→disease→drug), and Master Data Management are all production Neo4j use cases.
What does the query `MATCH (a)-[:TRANSFER*3..5]->(a) RETURN a` do?
Key Takeaways
- **Nodes** with multiple **labels** and **properties** - flexible, self-describing objects
- **Relationships** - first-class objects with a type, direction, and their own properties
- **Properties** (key-value) on both nodes and edges; schema-optional with optional constraints
- **Cypher** - ASCII-art query language with variable-length paths and shortestPath for tasks intractable in SQL
Related Topics
Property Graphs are an alternative to the RDF approach:
- RDF and the Semantic Web — RDF and Property Graph are two approaches to knowledge graphs with different trade-offs
- What Is a Knowledge Graph — A Property Graph is a concrete implementation of the knowledge graph concept from the first lesson
Вопросы для размышления
- When would Neo4j be chosen over PostgreSQL? And when is PostgreSQL the better fit?
- How would a recommendation system for an online store be modeled as a Property Graph?
- Why does SQL struggle with queries like "find all paths of length 3-5 between A and B"? What makes graph databases faster?
Связанные уроки
- kg-02 — RDF is the alternative model that Property Graph displaces in production
- kg-01 — The concept of a knowledge graph is introduced in the first lesson
- kg-04 — Cypher and Neo4j are the foundation for embeddings and ML on graphs
- ds-14 — Fraud detection is a production Neo4j use case with real money at stake
- ml-05 — Graph neural networks operate on top of the property graph model
- alg-07 — Graph traversal algorithms (BFS/DFS) are the essence of Cypher variable-length paths
- ml-01-intro