Backend Transport

Serialization: JSON, Protobuf, Avro

Google handles 10+ billion internal RPC calls per second. Switch all of that to JSON and server-to-server traffic balloons 5-10x - petabytes of extra data per day. That is the exact reason Google built Protocol Buffers in 2001, now battle-tested by Uber, Netflix, Spotify, and thousands more.

**Kafka + Avro** at LinkedIn processes 7+ trillion messages per day - Schema Registry ensures 2,000+ services can update independently without breaking the data format
**Discord** switched from JSON to MessagePack for WebSocket messages and cut traffic by 30-40%, critical at 200+ million users
**gRPC + Protobuf** at Dropbox reduced inter-service message size by 8x compared to JSON REST API

Protobuf: from Google internal tool to industry standard

In 2001, Google faced a problem: hundreds of services exchanged data through plain text. Serialization speed and data size became bottlenecks. The team developed Protocol Buffers - a binary format with code generation. In 2008, Google open-sourced it. gRPC (2015) followed - a high-performance RPC framework on top of Protobuf. Today Protobuf is used in Kubernetes, Envoy, and Cloud Spanner. In parallel, Apache Avro (2009) emerged from the Hadoop ecosystem - Doug Cutting solved the same problem for Big Data, but focused on self-describing format for long-term storage.

Предварительные знания

TCP/IP and OSI: The Foundation of Everything

Why Serialization Is Needed

TCP ships **bytes** - a sequence of zeros and ones. Code, on the other hand, deals in objects, structs, and classes. So how does `{ name: "Alice", age: 30 }` in one server's memory show up intact in another's?

**Serialization** turns a data structure into a format that travels over the network or persists to disk. **Deserialization** runs the trip backward: bytes/text back into an object in memory.

Serialization solves three problems: **1)** translating an internal memory representation into a portable format, **2)** crossing language boundaries (JS object → JSON → Python dict), **3)** compactness on the wire.

An object in memory carries pointers, virtual method tables, and alignment padding. Shipping raw memory between two services is impossible even when both run the same language: different runtime versions, different processor architectures (little-endian vs big-endian) - they will never agree.

Why can't raw object memory be sent from one server to another?

JSON and XML: Text Formats

**JSON (JavaScript Object Notation)** is the de-facto standard for web APIs. Douglas Crockford pulled it together in the early 2000s as a lightweight alternative to XML. Human-readable, supported by every language, minimal syntax.

**XML (eXtensible Markup Language)** is JSON's predecessor, still alive in SOAP, configurations (Maven, Android), and enterprise systems. More verbose, but it carries attributes, namespaces, and XSD validation.

Characteristic	JSON	XML
Readability	High	Medium (verbose)
Data size	More compact (~30% less than XML)	Larger (opening/closing tags)
Data types	string, number, boolean, null, array, object	All strings (types via XSD)
Schema validation	JSON Schema (optional)	XSD, DTD (mature ecosystem)
Comments	Not supported	Supported <!-- ... -->
Parsing	Fast (built into browser/JS)	Slower (DOM/SAX parsers)
Usage in 2025	REST API, configs, NoSQL	SOAP, Enterprise, Android, SVG

JSON beat XML on the web not by being better but by being simpler. `JSON.parse()` is baked into every browser. XML is genuinely more expressive: namespaces dodge name collisions, XSLT transforms documents, XPath queries them precisely.

The Achilles heel of text formats is **size and speed**. `"age": 30` burns 8 bytes; the actual number 30 is just 1 byte. Field names repeat in every object across an array. Fine at 1,000 requests per second. At 1,000,000 it falls apart.

What is the main advantage of JSON over binary formats?

Protocol Buffers: Binary Efficiency

**Protocol Buffers (Protobuf)** is Google's binary serialization format. Built in 2001 to power Google's internal inter-service traffic; open-sourced in 2008. Today it is the foundation of gRPC and the de-facto standard for high-load backends.

The defining idea of Protobuf: **schema-first**. Describe the data structure in a `.proto` file, then a compiler generates serialization/deserialization code in any language. No hand-rolled parsing.

**Field numbers** (1, 2, 3...) are not indices - they are identifiers. They get encoded into the binary in place of field names. Instead of `"name":` (6 bytes), Protobuf writes a single byte holding the field number and type. Multiply that across thousands of messages and the savings stack up fast.

Protobuf encodes numbers as **varints**: small numbers eat fewer bytes. 1 - one byte, 300 - two, 1,000,000 - three. JSON always encodes numbers as text: '1' = 1 byte, '300' = 3 bytes, '1000000' = 7 bytes vs 3 in Protobuf.

Protobuf is no silver bullet. Binary data is unreadable without the .proto file. `curl` returns garbage instead of clean JSON. Debugging is harder. The trade: serialization is 5-10x faster than JSON and the payload is 3-10x smaller.

Why does Protobuf use field numbers (1, 2, 3) instead of field names?

Avro and MessagePack: Other Binary Formats

Protobuf is not the only binary format worth knowing. Two others stand out: **Apache Avro** (the Big Data and Kafka world) and **MessagePack** (fast binary JSON).

**Apache Avro** comes from Doug Cutting (Hadoop's author), built for the Apache ecosystem. The defining difference from Protobuf: **the schema travels with the data**. An Avro file leads with a schema header, followed by data blocks. A reader decodes the data without needing a separate .proto file.

**Kafka + Avro** is the gold standard for event streaming. Schema Registry (Confluent) stores all versions of Avro schemas. A producer publishes an event specifying the schema id. The consumer automatically retrieves the schema from the Registry and decodes the message.

**MessagePack** sells itself as "JSON, but faster and more compact". Unlike Protobuf and Avro, MessagePack **needs no schema** - it is schemaless like JSON. Field names stick around, just encoded more efficiently.

Criterion	Protobuf	Avro	MessagePack
Schema	Required (.proto)	Embedded in file/Registry	Not needed (schemaless)
Field encoding	By numbers (1, 2, 3)	By order in schema	By names (like JSON)
Data size	Minimal	Compact	Medium (names preserved)
Ecosystem	gRPC, Google Cloud	Kafka, Hadoop, Spark	Redis, MessagePack-RPC
Code generation	Yes (protoc)	Yes (avro-tools)	No (schemaless)
Ease of adoption	Medium (needs compiler)	Medium (needs Registry)	High (drop-in JSON)

Format choice tracks the use case. **Protobuf** - gRPC and strictly typed APIs. **Avro** - Kafka and Big Data. **MessagePack** - when faster JSON is needed without an architecture overhaul.

How does Avro differ from Protobuf in its approach to schemas?

Schema Evolution: Don't Break Production

Schemas change. A `phone` field gets added to `User`. Service A is updated and sends `phone`. Service B is not - it has never heard of `phone`. What happens?

**Schema evolution** is a format's ability to keep working when producer and consumer run **different schema versions**. Three flavors of compatibility:

Protobuf is engineered for schema evolution from day one. **Field numbers** carry the weight. Adding a new field hands it a **new number**. Old code does not know about number 6 - it skips it. New code finds no number 6 in old data - it falls back to the default.

**Protobuf golden rule: never reuse field numbers.** Once `phone = 4` is deleted, number 4 is permanently "taken". Add `reserved 4;` and the compiler refuses to let anyone reclaim it.

Action	Protobuf	Avro	JSON
Add field	Safe (new field number)	Safe (default value)	Safe (new key)
Delete field	Safe (reserved)	Safe (ignored)	Dangerous (code may crash)
Change type	Breaks compatibility	Breaks compatibility	Silently breaks at runtime
Rename	Safe (wire = number)	Breaks (wire = name)	Breaks (wire = name)
Validation	Compile-time (protoc)	Registry (runtime)	None (runtime only)

JSON has no built-in schema evolution. Service A adds a field, service B may crash - or silently ignore it. Zero guarantees. For critical systems with dozens of services and independent deploys, Protobuf and Avro are flat-out safer.

Back to that `{ name: "Alice", age: 30 }` from the top of the lesson - the full path is now clear: object -> serialization (JSON/Protobuf/Avro) -> bytes -> TCP -> network -> TCP -> deserialization -> object on the other service. And it is clear how that pipeline keeps working as data schemas evolve.

JSON is sufficient for any task - why complicate things with binary formats?

JSON is ideal for public APIs and debugging, but for high-throughput internal communication, binary formats give 3-10x savings in size and 5-10x in parsing speed

At 100 requests per second, the difference between JSON and Protobuf is invisible. At 100,000 requests/sec, every extra byte equals gigabytes of traffic per day. Google uses Protobuf internally for all services because at their scale, JSON would generate terabytes of extra traffic daily. Binary formats with schemas also catch errors at compile time, not at runtime in production.

Key Ideas

**Serialization** converts an in-memory object to a format transmittable over a network. Deserialization is the reverse
**JSON** - text, human-readable, universal. Ideal for public APIs and debugging
**Protobuf** - binary, schema-first, compact (3-10x smaller than JSON). Standard for gRPC and internal APIs
**Avro** - binary with embedded schema. Standard for Kafka and Big Data. **MessagePack** - binary JSON without schema
**Schema evolution** is critical for microservices with independent deploys. Protobuf field numbers and Avro Schema Registry solve this

Вопросы для размышления

Designing an API for a mobile app with millions of users. Mobile data is expensive. Which serialization format to choose for API responses and why?
Why does Kafka use Avro instead of Protobuf, even though Protobuf is more compact? Consider storing messages for months or years.
50 microservices across 5 different languages. How does schema evolution in Protobuf help deploy them independently of each other?

Связанные уроки

it-03 — Data compression targets the same compactness problem via a different mechanism
bd-01 — Database storage formats apply the same binary encoding principles
se-03 — API design includes choosing serialization formats for public contracts
sd-03-scalability — Format choice affects throughput at horizontal scale
stream-01 — Kafka uses Avro + Schema Registry for event streaming
comp-31-bytecode