Backend Transport
Serialization: JSON, Protobuf, Avro
Google handles 10+ billion internal RPC calls per second. Switch all of that to JSON and server-to-server traffic balloons 5-10x - petabytes of extra data per day. That is the exact reason Google built Protocol Buffers in 2001, now battle-tested by Uber, Netflix, Spotify, and thousands more.
- **Kafka + Avro** at LinkedIn processes 7+ trillion messages per day - Schema Registry ensures 2,000+ services can update independently without breaking the data format
- **Discord** switched from JSON to MessagePack for WebSocket messages and cut traffic by 30-40%, critical at 200+ million users
- **gRPC + Protobuf** at Dropbox reduced inter-service message size by 8x compared to JSON REST API
Protobuf: from Google internal tool to industry standard
In 2001, Google faced a problem: hundreds of services exchanged data through plain text. Serialization speed and data size became bottlenecks. The team developed Protocol Buffers - a binary format with code generation. In 2008, Google open-sourced it. gRPC (2015) followed - a high-performance RPC framework on top of Protobuf. Today Protobuf is used in Kubernetes, Envoy, and Cloud Spanner. In parallel, Apache Avro (2009) emerged from the Hadoop ecosystem - Doug Cutting solved the same problem for Big Data, but focused on self-describing format for long-term storage.
Предварительные знания
Why Serialization Is Needed
TCP ships **bytes** - a sequence of zeros and ones. Code, on the other hand, deals in objects, structs, and classes. So how does `{ name: "Alice", age: 30 }` in one server's memory show up intact in another's?
**Serialization** turns a data structure into a format that travels over the network or persists to disk. **Deserialization** runs the trip backward: bytes/text back into an object in memory.
Serialization solves three problems: **1)** translating an internal memory representation into a portable format, **2)** crossing language boundaries (JS object → JSON → Python dict), **3)** compactness on the wire.
An object in memory carries pointers, virtual method tables, and alignment padding. Shipping raw memory between two services is impossible even when both run the same language: different runtime versions, different processor architectures (little-endian vs big-endian) - they will never agree.
Why can't raw object memory be sent from one server to another?
JSON and XML: Text Formats
**JSON (JavaScript Object Notation)** is the de-facto standard for web APIs. Douglas Crockford pulled it together in the early 2000s as a lightweight alternative to XML. Human-readable, supported by every language, minimal syntax.
**XML (eXtensible Markup Language)** is JSON's predecessor, still alive in SOAP, configurations (Maven, Android), and enterprise systems. More verbose, but it carries attributes, namespaces, and XSD validation.
| Characteristic | JSON | XML |
|---|---|---|
| Readability | High | Medium (verbose) |
| Data size | More compact (~30% less than XML) | Larger (opening/closing tags) |
| Data types | string, number, boolean, null, array, object | All strings (types via XSD) |
| Schema validation | JSON Schema (optional) | XSD, DTD (mature ecosystem) |
| Comments | Not supported | Supported <!-- ... --> |
| Parsing | Fast (built into browser/JS) | Slower (DOM/SAX parsers) |
| Usage in 2025 | REST API, configs, NoSQL | SOAP, Enterprise, Android, SVG |
JSON beat XML on the web not by being better but by being simpler. `JSON.parse()` is baked into every browser. XML is genuinely more expressive: namespaces dodge name collisions, XSLT transforms documents, XPath queries them precisely.
The Achilles heel of text formats is **size and speed**. `"age": 30` burns 8 bytes; the actual number 30 is just 1 byte. Field names repeat in every object across an array. Fine at 1,000 requests per second. At 1,000,000 it falls apart.
What is the main advantage of JSON over binary formats?
Protocol Buffers: Binary Efficiency
**Protocol Buffers (Protobuf)** is Google's binary serialization format. Built in 2001 to power Google's internal inter-service traffic; open-sourced in 2008. Today it is the foundation of gRPC and the de-facto standard for high-load backends.
The defining idea of Protobuf: **schema-first**. Describe the data structure in a `.proto` file, then a compiler generates serialization/deserialization code in any language. No hand-rolled parsing.
**Field numbers** (1, 2, 3...) are not indices - they are identifiers. They get encoded into the binary in place of field names. Instead of `"name":` (6 bytes), Protobuf writes a single byte holding the field number and type. Multiply that across thousands of messages and the savings stack up fast.
Protobuf encodes numbers as **varints**: small numbers eat fewer bytes. 1 - one byte, 300 - two, 1,000,000 - three. JSON always encodes numbers as text: '1' = 1 byte, '300' = 3 bytes, '1000000' = 7 bytes vs 3 in Protobuf.
Protobuf is no silver bullet. Binary data is unreadable without the .proto file. `curl` returns garbage instead of clean JSON. Debugging is harder. The trade: serialization is 5-10x faster than JSON and the payload is 3-10x smaller.
Why does Protobuf use field numbers (1, 2, 3) instead of field names?
Avro and MessagePack: Other Binary Formats
Protobuf is not the only binary format worth knowing. Two others stand out: **Apache Avro** (the Big Data and Kafka world) and **MessagePack** (fast binary JSON).
**Apache Avro** comes from Doug Cutting (Hadoop's author), built for the Apache ecosystem. The defining difference from Protobuf: **the schema travels with the data**. An Avro file leads with a schema header, followed by data blocks. A reader decodes the data without needing a separate .proto file.
**Kafka + Avro** is the gold standard for event streaming. Schema Registry (Confluent) stores all versions of Avro schemas. A producer publishes an event specifying the schema id. The consumer automatically retrieves the schema from the Registry and decodes the message.
**MessagePack** sells itself as "JSON, but faster and more compact". Unlike Protobuf and Avro, MessagePack **needs no schema** - it is schemaless like JSON. Field names stick around, just encoded more efficiently.
| Criterion | Protobuf | Avro | MessagePack |
|---|---|---|---|
| Schema | Required (.proto) | Embedded in file/Registry | Not needed (schemaless) |
| Field encoding | By numbers (1, 2, 3) | By order in schema | By names (like JSON) |
| Data size | Minimal | Compact | Medium (names preserved) |
| Ecosystem | gRPC, Google Cloud | Kafka, Hadoop, Spark | Redis, MessagePack-RPC |
| Code generation | Yes (protoc) | Yes (avro-tools) | No (schemaless) |
| Ease of adoption | Medium (needs compiler) | Medium (needs Registry) | High (drop-in JSON) |
Format choice tracks the use case. **Protobuf** - gRPC and strictly typed APIs. **Avro** - Kafka and Big Data. **MessagePack** - when faster JSON is needed without an architecture overhaul.
How does Avro differ from Protobuf in its approach to schemas?
Schema Evolution: Don't Break Production
Schemas change. A `phone` field gets added to `User`. Service A is updated and sends `phone`. Service B is not - it has never heard of `phone`. What happens?
**Schema evolution** is a format's ability to keep working when producer and consumer run **different schema versions**. Three flavors of compatibility:
Protobuf is engineered for schema evolution from day one. **Field numbers** carry the weight. Adding a new field hands it a **new number**. Old code does not know about number 6 - it skips it. New code finds no number 6 in old data - it falls back to the default.
**Protobuf golden rule: never reuse field numbers.** Once `phone = 4` is deleted, number 4 is permanently "taken". Add `reserved 4;` and the compiler refuses to let anyone reclaim it.
| Action | Protobuf | Avro | JSON |
|---|---|---|---|
| Add field | Safe (new field number) | Safe (default value) | Safe (new key) |
| Delete field | Safe (reserved) | Safe (ignored) | Dangerous (code may crash) |
| Change type | Breaks compatibility | Breaks compatibility | Silently breaks at runtime |
| Rename | Safe (wire = number) | Breaks (wire = name) | Breaks (wire = name) |
| Validation | Compile-time (protoc) | Registry (runtime) | None (runtime only) |
JSON has no built-in schema evolution. Service A adds a field, service B may crash - or silently ignore it. Zero guarantees. For critical systems with dozens of services and independent deploys, Protobuf and Avro are flat-out safer.
Back to that `{ name: "Alice", age: 30 }` from the top of the lesson - the full path is now clear: object -> serialization (JSON/Protobuf/Avro) -> bytes -> TCP -> network -> TCP -> deserialization -> object on the other service. And it is clear how that pipeline keeps working as data schemas evolve.
JSON is sufficient for any task - why complicate things with binary formats?
JSON is ideal for public APIs and debugging, but for high-throughput internal communication, binary formats give 3-10x savings in size and 5-10x in parsing speed
At 100 requests per second, the difference between JSON and Protobuf is invisible. At 100,000 requests/sec, every extra byte equals gigabytes of traffic per day. Google uses Protobuf internally for all services because at their scale, JSON would generate terabytes of extra traffic daily. Binary formats with schemas also catch errors at compile time, not at runtime in production.
Key Ideas
- **Serialization** converts an in-memory object to a format transmittable over a network. Deserialization is the reverse
- **JSON** - text, human-readable, universal. Ideal for public APIs and debugging
- **Protobuf** - binary, schema-first, compact (3-10x smaller than JSON). Standard for gRPC and internal APIs
- **Avro** - binary with embedded schema. Standard for Kafka and Big Data. **MessagePack** - binary JSON without schema
- **Schema evolution** is critical for microservices with independent deploys. Protobuf field numbers and Avro Schema Registry solve this
Related Topics
Serialization is the bridge between data in code and data on the wire:
- TCP/IP and OSI — Serialized bytes are transmitted through TCP/UDP at the transport layer
- gRPC — gRPC uses Protobuf as its default serialization format
- Kafka — Kafka most commonly uses Avro + Schema Registry for messages
Вопросы для размышления
- Designing an API for a mobile app with millions of users. Mobile data is expensive. Which serialization format to choose for API responses and why?
- Why does Kafka use Avro instead of Protobuf, even though Protobuf is more compact? Consider storing messages for months or years.
- 50 microservices across 5 different languages. How does schema evolution in Protobuf help deploy them independently of each other?
Связанные уроки
- it-03 — Data compression targets the same compactness problem via a different mechanism
- bd-01 — Database storage formats apply the same binary encoding principles
- se-03 — API design includes choosing serialization formats for public contracts
- sd-03-scalability — Format choice affects throughput at horizontal scale
- stream-01 — Kafka uses Avro + Schema Registry for event streaming
- comp-31-bytecode