AI Engineering

Knowledge Graphs + LLM: GraphRAG, structured knowledge, Neo4j + embeddings

Цели урока

Understand the limitations of vector search and when a Knowledge Graph is needed
Implement entity extraction and entity resolution via LLM
Integrate Neo4j with vector embeddings for hybrid search
Break down the full Microsoft GraphRAG architecture (indexing + query phases)

Google Knowledge Graph holds 500 billion facts. Every time search answers "Einstein was born in 1879" without a source link - that's a knowledge graph, not an LLM. RAG with a graph delivers accuracy that vector search alone cannot: Microsoft GraphRAG 2024 showed +20-40% accuracy vs naive RAG on multi-hop questions. The gap between "find similar text" and "understand the structure of knowledge" - that's exactly here.

Microsoft GraphRAG (2024) - +20-40% accuracy vs naive RAG on multi-hop questions, used in Azure for enterprise document analysis
LinkedIn Knowledge Graph: 1B+ entities, search + recommendations + skill matching - all on top of one graph
Amazon Product Graph: products, brands, categories, reviews linked in a graph - the backbone of a recommendation system generating USD 400B+ in revenue
NASA: Knowledge Graphs linking scientific publications, missions, and technologies - cross-module search that vectors simply cannot do

From Google Knowledge Graph to GraphRAG

On May 16, 2012 Google launched the Knowledge Graph under the slogan things, not strings: search began to understand entities and the relationships between them, not just text. At launch the graph covered more than 500 million entities and about 3.5 billion facts, and the data tripled within seven months to roughly 570 million entities and 18 billion facts. That turned graph-structured knowledge into a mainstream technology: answers like a birth date or a capital city started coming from the graph rather than from link ranking. Structured knowledge had existed earlier in academic projects such as Cyc and DBpedia, but Google was the one that ran it in production across billions of queries. In 2024 Microsoft Research published GraphRAG, an approach where entities and relationships are first extracted from text, a graph is built, summaries are generated per graph community, and the LLM answers on top of that. In the paper From Local to Global the team showed that GraphRAG produces more comprehensive and diverse answers over large corpora than plain vector-search RAG. The code went up on GitHub in July 2024. The loop closed: knowledge graphs from the search era came back as a way to give an LLM structured memory.

Предварительные знания

Advanced RAG: hybrid search, re-ranking, query expansion, self-RAG

Limits of vector search and why Knowledge Graphs exist

Google Knowledge Graph holds 500 billion facts. Every time search answers "Einstein was born in 1879" without citing a source - that's a knowledge graph, not an LLM. The question isn't why it's needed. The question is why vector RAG still can't replace it.

Vector search finds chunks **similar in meaning**. But it's blind to **relationships between entities**. Ask "which clients of company X also invested in Y?" - and the system breaks: the answer doesn't live in a single chunk, it has to be assembled across multiple graph nodes. Embeddings weren't built for that.

A Knowledge Graph (KG) is a graph where **nodes** = entities (people, companies, concepts) and **edges** = relationships between them. Unlike a vector store, a KG holds the **structure of knowledge** - not just textual similarity, but the actual semantics of connections. That's why Google can answer "Who is the wife of Einstein's brother?" in 10 ms - no LLM involved.

Characteristic	Vector Search	Knowledge Graph	GraphRAG (hybrid)
Query type	Semantic similarity	Structural relationships	Both
Multi-hop reasoning	No	Yes	Yes
Build speed	Fast (embed chunks)	Slow (extract entities)	Slow
Updates	Simple (re-embed)	More complex (re-extract)	More complex
Best for	Document Q&A	Relationship analytics	Complex questions across a corpus

**Microsoft GraphRAG** (2024) showed +20-40% accuracy vs naive RAG on multi-hop questions. On tasks like "name the main themes across all documents" the gap reaches 70%. Not an academic result - this is production performance on real enterprise corpora.

In which case does a Knowledge Graph have an advantage over pure vector search?

Entity Extraction: pulling entities and relationships from text

Before building a graph - it needs to be populated. Every document in the corpus passes through an LLM which produces not a summary, but a structure: **(Subject, Predicate, Object)** triples. Anthropic, founded_by, Dario Amodei. Dario Amodei, worked_at, OpenAI. Each triple is an edge in the future graph.

The first wall is **Entity Resolution** - the same company might appear as "OpenAI", "open ai", "Open AI Inc.". The graph fills with duplicates unless these are merged upfront. Approaches - from crude to precise:

**Embedding similarity** - if the embeddings of two entity names are close (cosine > 0.92), treat them as one entity
**LLM-based** - ask an LLM "are X and Y the same entity?" (expensive but accurate)
**Rule-based** - normalization (lowercase, remove Inc./Ltd.), fuzzy matching (Levenshtein < 3)
**Hybrid** - rule-based for simple cases + LLM for complex ones

**Entity extraction via LLM is expensive for large corpora.** 10,000 documents at an average of 2000 tokens each - that's ~20M input tokens. At USD 3 per 1M tokens (Claude Sonnet) = USD 60 just for indexing. Microsoft GraphRAG recommends batching documents and extracting incrementally. Delta updates are far cheaper than a full re-index.

What is a 'triple' in the context of a Knowledge Graph?

Neo4j + Embeddings: hybrid search over a graph

Neo4j is the de-facto industry standard graph database for Knowledge Graphs. Since version 5.x it added **vector indexes** - graph and embeddings now live in one place. No need to sync Qdrant with a separate graph store. One query, both data types.

The hybrid's strength: vector search as the entry point, graph traversal for expansion. A query for "AI safety research" first finds the nearest entities by embedding similarity. Then it walks edges 2 hops deep - and surfaces context that doesn't exist in any single chunk.

**Alternatives to Neo4j:** Amazon Neptune (managed, pricey), ArangoDB (multi-model), FalkorDB (Redis-based, fast to spin up). For prototypes - NetworkX (Python) or an adjacency list in PostgreSQL with JSONB. Move to Neo4j later, once the graph proves its value.

How does hybrid search work in Neo4j with a vector index?

GraphRAG Pipeline: full architecture from documents to answers

Microsoft GraphRAG is the reference implementation of the full pipeline. Published in 2024, it showed +20-40% accuracy on multi-hop questions. Architecturally it splits into two phases: **indexing** (build the graph) and **query** (answer questions). Indexing is expensive and happens once. Queries are cheap and fast.

**Community Detection** is GraphRAG's key innovation. The Leiden algorithm clusters related entities into "communities", and an LLM generates a **summary for each cluster**. This unlocks a new class of queries - **global questions**: "what are the main themes in the corpus?", "what connects all these companies?". Without community summaries the alternative is feeding the LLM the entire corpus at once.

**GraphRAG indexing costs are significantly higher than regular RAG.** For a corpus of 1000 documents: regular RAG (embed) ~USD 0.50, GraphRAG (extract + resolve + summarize) ~USD 50-150. A 100-300x gap. Only justified when multi-hop reasoning or global summarization is genuinely needed - and verified on actual data.

**When to use GraphRAG vs regular RAG:**

**Regular RAG** - simple document Q&A where the answer lives in one chunk. Fast, cheap, predictable.
**GraphRAG** - enterprise knowledge bases with cross-references, legal documents, scientific papers with citations, entity relationship analytics.
**Hybrid** - vector search for 90% of queries, GraphRAG for complex analytical ones. A router classifies the query type and routes to the right pipeline.

**Quick start without Neo4j:** Microsoft's `graphrag` library works with file-based storage (Parquet). For prototyping: `pip install graphrag && graphrag init && graphrag index` - the graph builds locally. Move to Neo4j later, once the proof-of-concept is solid.

What is Community Detection in GraphRAG and why is it needed?

Knowledge graph is legacy technology from before LLMs - the semantic web failed

Microsoft GraphRAG (2024) showed +20-40% accuracy vs naive RAG on multi-hop questions. Knowledge graphs are having a second life precisely because of LLMs - they can now be built automatically from unstructured text

The semantic web failed because it required manual annotation. GraphRAG solved that: an LLM extracts triples from any text automatically. Google, LinkedIn, and Amazon have been running production knowledge graphs with billions of nodes for years. The "legacy" label is a myth rooted in memories of RDF/OWL's failure in the 2000s - not the current state of the field.

Key Takeaways

Vector search finds by meaning - Knowledge Graph finds by relationships. GraphRAG combines both, delivering +20-40% accuracy on multi-hop questions
Entity extraction via LLM produces triples (Subject, Predicate, Object) - each triple becomes an edge in the graph
Entity Resolution is a mandatory step: without it the graph fills with duplicates (OpenAI vs open ai vs Open AI Inc.)
Neo4j 5.x supports vector indexes - graph and embeddings in a single DB, one query returns both data types
GraphRAG indexing is expensive (USD 50-150 per 1K documents, 100-300x more than naive RAG) - only justified when multi-hop reasoning is genuinely needed

What's Next

Knowledge Graphs are one component of a larger AI system architecture. The next step is learning to design the entire system end-to-end: from API gateway to LLM orchestrator, vector DB, caching, and monitoring.

AI System Design — Full architecture of a production AI application from scratch
Advanced RAG — Overview of advanced RAG techniques that GraphRAG builds upon

Связанные уроки

aie-13-advanced-rag — Graph retrieval extends advanced RAG
aie-42-ai-system-design — Knowledge graphs ground system-level retrieval
aie-09-embeddings — Embeddings link entities to graph nodes
aie-12-rag-fundamentals — Structured graph augments vector retrieval
db-29-graph — Same graph storage and traversal model
qd-01-intro