Knowledge Graphs

What Is a Knowledge Graph

Google Knowledge Graph - 500 billion facts. Wikidata - 100M+ entities. Neo4j in NASA for mission management. Microsoft Azure KG powering Copilot enterprise RAG. LLM + KG = neuro-symbolic AI: llamaindex KnowledgeGraphIndex makes GPT-4 reason over structured facts instead of hallucinating. This is not research - it's production 2024.

  • **Google Knowledge Panel**: direct answers from 500B facts - no page click required
  • **Microsoft Copilot enterprise**: Azure Knowledge Graph as structured memory for RAG, accuracy +40% vs plain embeddings
  • **NASA missions**: Neo4j component graph - instant failure cascade analysis for any spacecraft part
  • **Drug Discovery**: BioKG / OpenTargets link genes, proteins, diseases; $2B+ saved by filtering out dead-end molecules early
  • **llamaindex KnowledgeGraphIndex**: LLM reads the graph, not raw text - 3-5x fewer factual hallucinations

Nodes (Entities)

Google Knowledge Graph stores 500 billion facts about 8 billion entities. When the search engine returns a "Christopher Nolan - director of Inception" card without requiring a page click, it's not a fancy database table - it's a **knowledge graph**: a network of connected entities. Microsoft Copilot enterprise RAG and llamaindex KnowledgeGraphIndex (neuro-symbolic AI) run on the same principle. Each entity is a **node** (vertex) in the graph.

Google Knowledge Graph contains more than **800 billion** facts about 8 billion entities. Wikidata - the largest open knowledge graph - holds more than 100 million entities, each with a unique identifier (for example, Q42 = Douglas Adams).

Entities in a Knowledge Graph (KG) are not just text strings. Each entity has a **unique identifier** (URI), a **type** (class), and **properties**. "Moscow" in a KG is a specific entity distinct from "Moscow Mule" (cocktail) or "Moscow" (film). Wikidata holds 100M+ such unambiguous entities - which is precisely why it's valuable for training data annotation in LLMs.

Entities are grouped into **classes** (types): Person, Place, Organization, Event. This creates a hierarchy: Scientist is a subclass of Person, University is a subclass of Organization. This typing enables meaningful queries: "All scientists born in Germany".

How does an entity in a Knowledge Graph differ from a record in a traditional database?

Edges (Relationships)

Entities without relationships are just a list. The power of a Knowledge Graph lies in its **edges** - typed relationships between entities. Every edge has a **direction** and a **type**: "Einstein → born_in → Ulm" is not the same as "Ulm → born_in → Einstein".

Relationship types in a KG are standardized through **ontologies** (schemas). Wikidata defines thousands of relationship types (properties): P19 (place of birth), P69 (educated at), P106 (occupation). Standardization lets different systems exchange knowledge without ambiguity.

Relationship Type (Wikidata)IDExample
instance ofP31Moscow → P31 → City
subclass ofP279City → P279 → Settlement
countryP17Moscow → P17 → Russia
capital ofP36Russia → P36 → Moscow
occupationP106Einstein → P106 → Physicist
date of birthP569Einstein → P569 → 1879-03-14

Edges in a KG can carry **qualifiers** - metadata on the edges themselves. For example: "Einstein → citizenship → Germany [start_time: 1879, end_time: 1896]". This is impossible in a simple graph - it requires a **property graph** or **RDF reification**.

Relationships create **multi-hop paths**: "Einstein → worked_at → Princeton → located_in → New Jersey → part_of → USA". These paths enable answers to complex questions - "In which country is the university where Einstein worked?" - without storing that fact directly. NASA uses Neo4j on this same principle: a graph of spacecraft component relationships lets engineers instantly trace the failure cascade of any single part.

How do edges in a Knowledge Graph differ from foreign keys in a relational database?

Triples

Every fact in a Knowledge Graph is recorded as a **triple**: (subject, predicate, object). This is the atomic unit of knowledge. The entire graph is a collection of such triples. "Moscow - capital - Russia" = one triple. "Earth - orbits - Sun" = another.

The object can be one of two things: an **entity** (another node in the graph) or a **literal** (a value: string, number, date). "Einstein → born_in → Ulm" - object is an entity. "Einstein → birth_date → 1879-03-14" - object is a literal. Literals are the leaves of the graph.

Wikidata contains more than **1.7 billion** triples. DBpedia (extracted from Wikipedia) holds around 3 billion. YAGO - 120 million. Triples are a universal format: any fact in the world can be expressed as (subject, predicate, object).

The power of triples lies in **composability**. Complex questions decompose into chains: "In which country was the creator of the theory of relativity born?" = (?, created, Relativity) → (Einstein, born_in, ?) → (Ulm, country, ?) → Germany. Each step is one triple.

Which triple correctly represents the fact "Python was created by Guido van Rossum"?

RDF

Triples are a great model, but a standard format is needed. **RDF** (Resource Description Framework) is the W3C standard for representing triples. In RDF, everything is identified by a **URI** (Uniform Resource Identifier), which allows data from different sources to be merged without naming conflicts.

URIs in RDF solve the **disambiguation** problem: http://wikidata.org/entity/Q937 refers to a specific Einstein, not a namesake. URIs guarantee that two different sources referring to Q937 are talking about the same person.

RDF FormatExtensionCharacteristics
Turtle.ttlHuman-readable, compact
N-Triples.ntOne triple per line, easy to process
RDF/XML.rdfXML-based, legacy format
JSON-LD.jsonldJSON-compatible, for web APIs
N-Quads.nqN-Triples + named graph (context)

**Wikidata**, **DBpedia**, and **Schema.org** are the largest open KGs built on RDF. Google's Knowledge Panel (the info card in search results) is powered by a proprietary KG enriched with Schema.org markup from web pages. JSON-LD is the format through which websites pass structured data to search engines.

Tim Berners-Lee's Semantic Web

In 2001, Tim Berners-Lee (creator of the WWW) published "The Semantic Web" in Scientific American, describing a vision of the internet where machines understand the meaning of data. RDF (1999) and OWL (2004) are the standards created to realize this vision. Although the "semantic web" in its original form never fully materialized, its technologies live on in Knowledge Graphs, Schema.org, and Linked Data.

A Knowledge Graph is just a database with tables

A Knowledge Graph stores semantic relationships between entities, creating a network of facts that supports logical inference.

In a relational database, relationships are foreign keys with no semantics. In a KG, every edge is typed ("born_in" ≠ "worked_at"), entities have global URIs, and the graph supports multi-hop queries and automated reasoning. That's why Copilot through Azure KG answers complex enterprise questions accurately, while a plain SQL query over the same data requires dozens of JOINs and still can't do "find everything connected within 3 hops".

Why does RDF use URIs instead of plain strings?

Key Takeaways

  • **Nodes** - entities with a unique URI, type, and properties; 100M+ in Wikidata, 500B+ facts in Google KG
  • **Edges** - typed, directed relationships; multi-hop paths answer questions without storing every fact
  • **Triples** (subject, predicate, object) - the atomic unit of knowledge; Wikidata has 1.7B+ triples
  • **RDF** - W3C standard with global URIs; JSON-LD delivers KG data from millions of websites to search engines
  • **LLM + KG** - neuro-symbolic AI: llamaindex KnowledgeGraphIndex cuts hallucinations 3-5x on factual queries

Related Topics

Knowledge Graphs combine ideas from databases, the semantic web, and AI:

  • RDF and the Semantic Web — The next lesson covers RDF formats, OWL ontologies, and the SPARQL query language in depth
  • Property Graphs: Neo4j — An alternative graph model with native properties on edges

Вопросы для размышления

  • Pick any public scientist and write five triples about them in (subject, predicate, object) format. Which predicates had to be invented because they don't exist in Wikidata?
  • Why can't Google simply store all the world's facts in a single SQL table?
  • What is the advantage of URIs over plain strings when merging data from Wikipedia, IMDb, and academic papers?

Связанные уроки

  • db-01-intro — Relational and graph models are two alternative ways to store connected facts
  • nlp-01 — Knowledge graphs are used in NLP for entity linking, fact-checking, and knowledge-augmented LLMs
  • ds-01-arrays — Graph data structures (adjacency list, matrix) are the in-memory representation of a knowledge graph
  • st-01-feedback-loops — Ontologies and systems models both describe networks of interacting concepts
  • dm-01
What Is a Knowledge Graph

0

1

Sign In