Blockchain

Indexing: The Graph and Subgraphs

Ethereum generates roughly 1 million events per day: Transfer, Swap, Mint, Burn, Deposit. Each one is a record in transaction logs, accessible via a Bloom filter. But try querying "all of a user's swaps on Uniswap over the past year" directly through RPC - and you'll get a timeout, a bill for hundreds of dollars from Alchemy, and an empty page instead of a dApp. The problem is that an Ethereum node stores data in a format optimized for block validation, not for analytical queries. The Graph flips this model: the protocol pre-processes every block, organizes events into entities, and provides a GraphQL API that responds in milliseconds. How exactly does a subgraph turn a raw log stream into a structured database? How do you process events from thousands of contracts whose addresses are unknown in advance? And why, even with The Graph, do you still need a cache for RPC calls?

  • **Uniswap Analytics** - the Uniswap dashboard (info.uniswap.org) shows trading volumes, TVL, and top pools in real time. All this data comes from a subgraph that indexes Swap, Mint, and Burn events from thousands of pools. Without The Graph, every dashboard visitor would generate millions of RPC requests
  • **Aave and lending protocols** - when you see interest rates, loan volumes, and liquidation history on Aave, that's aggregated data from a subgraph. Block handlers update rates every block, event handlers track every Deposit, Borrow, and Liquidation
  • **OpenSea and NFT marketplaces** - sale history, collection rankings, floor price - all indexed from Seaport contract event logs. Without indexing, the query 'all BAYC sales in the past month' would take hours instead of milliseconds

Предварительные знания

  • Transactions and Receipts in Ethereum

The Graph Protocol and Subgraphs

Imagine you need to show a user's complete swap history on Uniswap. Call `eth_getLogs` across millions of blocks? That means hours of waiting and terabytes of traffic. **The Graph** solves this: the protocol creates **pre-indexed databases** from blockchain events, queryable via GraphQL in milliseconds.

A **subgraph** is the unit of indexing in The Graph. Each subgraph defines which contracts to listen to, which events to handle, and how to transform data into structured entities. A subgraph consists of three files:

The Graph protocol operates as a **decentralized network** of participants, each performing a specific role, with coordination happening through the **GRT** token:

A subgraph in The Graph protocol has three key components. Which one defines WHICH contracts and events to listen to?

GraphQL queries for blockchain data

Once a subgraph is deployed and an indexer has processed all the blocks, data is available through a **GraphQL API**. Unlike REST, where each endpoint returns a fixed structure, GraphQL lets the client request exactly the fields and relations it needs. For blockchain data this is critical: the same subgraph can serve both an aggregated-statistics dashboard and a detailed transaction explorer.

GraphQL in The Graph supports powerful filtering, pagination, and nested entity mechanisms:

The Graph uses **cursor-based pagination** through the `id` field. Maximum `first: 1000` entities per query. To retrieve all data, iterate with `where: { id_gt: "last_received_id" }`. Offset pagination (`skip`) is limited to 5000 and is inefficient for large datasets.

One of The Graph's unique capabilities is **time-travel queries**: querying data at a specific block. This allows retrieving the state of entities in the past, which is critical for analytics and auditing:

Why does GraphQL outperform REST API for blockchain data? Let's compare approaches using the task of "showing a DeFi protocol dashboard" as an example:

A dApp wants to get the liquidity state of a Uniswap V3 pool at block #18,000,000 (three months ago). What The Graph mechanism makes this possible?

Event indexing and mappings

The heart of a subgraph is its **mapping handlers**: AssemblyScript functions called for each matching event. Graph Node scans blocks, decodes event logs through the ABI, and passes structured data to the handler. The mapping's job is to transform a 'raw' blockchain event into entities for storage.

In addition to event handlers, The Graph supports **call handlers** (function calls) and **block handlers** (every block). Each type serves a specific purpose:

A separate challenge is indexing **dynamic contracts**. For example, the Uniswap Factory creates new pools via `createPool()`. The addresses of these pools are unknown at subgraph deploy time. For this, The Graph provides **Data Source Templates**:

**Chain reorgs** (chain reorganizations) are a serious problem for indexers. If two validators simultaneously create a block at the same height, one block will be discarded (uncle/ommer). Graph Node automatically handles reorgs: it rolls entities back to the last stable block and re-indexes. For critical data, check the confirmation depth: on Ethereum, 64 blocks (~13 minutes) are considered finalized.

Uniswap V3 Factory creates thousands of pools with unique addresses. How does a subgraph index events (Swap, Mint) from contracts whose addresses are unknown at deploy time?

RPC caching and alternative indexers

The Graph handles event indexing, but dApps interact with the blockchain not only for historical data. Calls like `eth_call` (reading contract state), `eth_getBalance`, `eth_blockNumber` - all of these are **RPC requests** to a node, each costing money and adding latency. Smart **caching** of these requests is the difference between a dApp that loads in 200ms and one that hangs for 5 seconds.

A key optimization is **Multicall** (batching multiple `eth_call` calls into a single RPC request). Instead of 50 separate `balanceOf()` calls for 50 tokens, a dApp sends a single call to the Multicall3 contract, which executes all 50 calls inside the EVM and returns the results in one transaction:

The Graph is not the only indexing solution. The ecosystem offers tools optimized for different tasks:

**CDN for static data**: token metadata (name, symbol, decimals), contract ABIs, token lists - all of this is immutable data that should be served through a CDN (Cloudflare, Fastly). One `eth_call` for `name()` costs ~$0.0001 on Alchemy, while a CDN response is free. For large dApps, the savings can reach thousands of dollars per month.

The Graph fully replaces RPC calls to a node - if a dApp has a subgraph, direct calls to a node are not needed

The Graph indexes only event logs (and optionally call/block data) - it does not replace eth_call for reading current contract state. To get the current balance, allowance, or the result of a view function, you still need an RPC call to a node. Subgraph and RPC complement each other: subgraph is for historical and aggregated data, RPC is for current state.

This misconception arises from conflating two types of data: event-driven (historical events, aggregates) and state-driven (current contract state). The Graph handles the former. When a user wants to see their current USDC balance - that's an eth_call to the contract, not a subgraph query. When they need the history of all their transfers over a year - that's a GraphQL query to the subgraph, not scanning blocks via RPC.

A dApp calls eth_call to read balanceOf(Alice) at a specific block #18,000,000 (not latest). What TTL is correct for caching this result?

Key ideas

  • **Subgraph = schema + manifest + mappings**: schema.graphql describes entities, subgraph.yaml specifies contracts and events, AssemblyScript mappings transform raw event logs into structured data. The Graph's decentralized network coordinates indexers through the GRT token
  • **GraphQL instead of REST** for blockchain data: one query with precise fields, nested entities, and filters replaces dozens of REST calls. Time-travel queries allow querying state at a specific block - a unique capability for analytics and auditing
  • **Three handler types**: event handlers (90% of cases - react to event logs), call handlers (intercept function calls via trace), block handlers (fire on every block for regular snapshots). Data Source Templates solve the problem of dynamic contracts (Factory pattern)
  • **Caching is multi-layered**: immutable data (specific block) is cached forever, latest data - for 12 seconds (one block). Multicall batches dozens of eth_calls into one RPC request. Alternative indexers (Ponder, Envio, Goldsky) offer TypeScript, speed, and SQL queries
  • At the start we asked: how do you show a year of swap history without killing the RPC node? The answer is pre-indexing. A subgraph processes each block once, turning an event stream into a database with a GraphQL API. RPC is for current state, subgraph is for historical and aggregated data. Together they cover all dApp needs.

Related topics

Blockchain data indexing ties together transaction mechanics, nodes, token standards, and DeFi protocols:

  • Transactions and receipts — Event logs from transaction receipts are the raw data that The Graph indexes. Bloom filters allow quickly filtering out blocks without the needed events, and the receipt trie guarantees their integrity
  • Nodes: Full, Light, Archive — Graph Node connects to an archive node for access to historical data. A full node stores state only for the current block, an archive node for all blocks. Time-travel queries require an archive node
  • Token standards (ERC-20, ERC-721) — Standardized Transfer and Approval events from ERC-20/ERC-721 are the most commonly indexed in the Ethereum ecosystem. Subgraphs for Uniswap, OpenSea, and Aave are built around these events
  • AMM: automated market making — DeFi protocol subgraphs index Swap, Mint, and Burn events from AMM pools. Aggregated data (TVL, volumes, prices) is computed in mappings and available through GraphQL

Вопросы для размышления

  • The Graph Network requires indexers to stake GRT. How does this affect data reliability? What happens if an indexer returns incorrect results for a GraphQL query?
  • Why did The Graph choose AssemblyScript (compiled to WASM) for mappings instead of JavaScript or Python? What advantages and limitations does this create? How do Ponder and Envio work around this decision?
  • You are designing a dApp that needs the current token balance (eth_call), transfer history for the past year (event logs), and the token price in real time (updated every block). What combination of tools (subgraph, RPC, cache, multicall) would you choose for each task and why?

Связанные уроки

  • db-09-indexes-btree
Indexing: The Graph and Subgraphs

0

1

Sign In