Knowledge Graphs

SPARQL and Cypher

Wikidata knows that Tolstoy was born in Yasnaya Polyana, wrote War and Peace in 1869, and received the Nobel Prize nomination three times. Retrieving "all 19th-century Russian writers born in the Tula region" requires querying 100 million triples. SPARQL and Cypher were built exactly for this.

**Wikidata Query Service** handles billions of SPARQL queries per day against the world's largest open knowledge base
**Neo4j** powers LinkedIn's connection graph, eBay's recommendations, and NASA's mission data management
**Pharma:** path queries through protein-protein interaction graphs to find drug interaction routes

SPARQL: SELECT, WHERE, FILTER

The Wikidata knowledge graph holds 100 million statements as triples (subject, predicate, object). Retrieving "all cities with a population above one million" requires a query language designed for triples - that language is **SPARQL** (SPARQL Protocol and RDF Query Language).

SPARQL syntax resembles SQL: `SELECT` names the output variables, `WHERE` defines a graph pattern, `FILTER` adds value constraints. Variables start with `?` and are bound by matching triples in the graph.

Each line in `WHERE` is a triple pattern. Variables `?city` and `?population` get bound to real graph nodes. `wdt:P31` and `wd:Q515` are URI identifiers from the Wikidata ontology. Multiple patterns are implicitly ANDed: all must match simultaneously.

**OPTIONAL** in SPARQL is a LEFT JOIN: `OPTIONAL { ?book schema:isbn ?isbn }` returns books without ISBNs too, with an unbound `?isbn`. Without OPTIONAL the pattern is mandatory and non-matching rows are excluded.

The SPARQL line `?person wdt:P19 wd:Q649 .` means:

Cypher: MATCH, CREATE, WITH

Neo4j stores graphs as labeled nodes and typed edges. **Cypher** was designed for it with a visual syntax: patterns look like ASCII graph diagrams. `(node)` is a node, `-[rel]->` is a directed edge.

The pattern `(actor:Person)-[:ACTED_IN]->(movie:Movie)` reads literally: find a node labeled Person connected by an ACTED_IN edge to a Movie node. Curly braces filter on properties inline. For graph data this is far more readable than SQL JOINs.

`WITH` creates a pipeline: it aggregates or transforms data and passes it to the next query stage. Without `WITH` it is impossible to apply `WHERE` after an aggregation.

**SPARQL vs Cypher:** SPARQL is the W3C standard for RDF, powering Wikidata and DBpedia. Cypher started as Neo4j's proprietary language and is now open as openCypher. The semantics are similar, but the data models and syntax differ.

What does the pattern `(a)-[:KNOWS]->(b)-[:KNOWS]->(c)` express in a Cypher MATCH?

Aggregation in SPARQL and Cypher

Knowledge graphs are rarely queried for a single triple. Typical needs are statistical: "how many films did a director make", "average city population per country". Both SPARQL and Cypher support aggregate functions: `COUNT`, `SUM`, `AVG`, `MIN`, `MAX`.

Cypher's `collect()` gathers all matching values into a list. A single query can return both an aggregate and sample data. The SPARQL equivalent is `GROUP_CONCAT(?value; separator=",")`.

In SPARQL, aggregation follows SQL conventions: `GROUP BY` + `HAVING`. In Cypher, aggregate functions in `RETURN` or `WITH` automatically group by all non-aggregated variables - no explicit `GROUP BY` needed.

In `RETURN actor.name, count(movie)` in Cypher, what determines the grouping key?

Shortest Paths and Traversal

LinkedIn's knowledge graph holds "worked at the same company" and "studied at the same university" edges. Finding how two people are connected through colleagues is a path query. This is where graph query languages outperform relational databases.

The syntax `[:KNOWS*]` denotes a variable-length path. `*1..3` means one to three edges. `shortestPath()` is a built-in Cypher function using BFS. For weighted shortest paths, `apoc.algo.dijkstra` from the APOC library is used.

SPARQL 1.1 Property Paths allow regular expressions over edges: `*` (zero or more), `+` (one or more), `/` (sequence), `|` (alternative). One operator replaces recursive subqueries when traversing hierarchies.

**Performance:** path queries can be expensive. Cypher `shortestPath` uses BFS and is efficient. SPARQL `*` on a deep graph may return millions of triples - always use `LIMIT`.

What does `wdt:P279*` mean in a SPARQL Property Path?

SPARQL and Cypher

SPARQL: W3C standard for RDF graphs. SELECT + triple patterns in WHERE + FILTER + GROUP BY/HAVING
Cypher: visual syntax for Neo4j - (node)-[:EDGE]->(node). MATCH finds, CREATE inserts, WITH pipelines results
Aggregation: COUNT, SUM, AVG in both. Cypher groups implicitly by all non-aggregated fields
Paths: Cypher shortestPath() and [:REL*1..N]. SPARQL Property Paths with *, +, / quantifiers

Related lessons

SPARQL and Cypher are query layers on top of specific graph storage models:

RDF and Triplestores — The data model SPARQL queries against
Ontologies and OWL — Predicate and class vocabularies used in SPARQL queries
GNN on Knowledge Graphs — ML layer over the same graphs queried via Cypher and SPARQL

Вопросы для размышления

In which scenarios are SPARQL Property Paths preferable to recursive CTEs in SQL for hierarchy traversal?
Why does Cypher omit explicit GROUP BY, and what mistakes can this cause for developers coming from SQL?
How would a query for "people connected to Alice within three handshakes" look in both SPARQL and Cypher?

Связанные уроки

ml-01-intro