Stream Processing

Kafka Connect and Schema Registry

LinkedIn: 200 engineers and 50 databases

2016. LinkedIn. An engineering team of 200 maintaining 50 different data sources: MySQL, Oracle, HDFS, Voldemort (an in-house KV store), Espresso, dozens of external APIs. Each Kafka integration - a separate script. No unified monitoring, no standard format, no fault tolerance. When something failed - no one knew which of the 200 scripts had broken. Kafka Connect was built inside LinkedIn as the answer to this chaos, open-sourced in 2015 and shipped in Apache Kafka 0.9. Today Confluent Hub offers 200+ ready-made connectors - MySQL, PostgreSQL, MongoDB, Salesforce, Snowflake, S3 - each with SMT and Schema Registry support.

Kafka Connect: the end of integration scripts

2016. LinkedIn. 200 engineers. 50 different databases, storage systems, and external services. Each integration - a separate Python or Java script, written by someone at some point, maintained by no one. When MySQL moved to a new host, 12 such scripts failed simultaneously. When a new field was added to a schema, someone had to find all those scripts and fix them manually. Kafka Connect emerged as the answer to this chaos: one JSON configuration instead of thousands of lines of code.

Kafka Connect is a framework for moving data between Kafka and external systems. Not a library, not a pattern - a framework with its own runtime, REST API, and ecosystem of pre-built connectors. Source connector reads from an external source (PostgreSQL, MongoDB, S3, HTTP API) and writes to Kafka. Sink connector reads from Kafka and writes to an external system (Elasticsearch, Snowflake, BigQuery, S3). One worker can run dozens of connectors in parallel.

Workers come in two flavors. Standalone - a single process, file-based config, no fault tolerance. Distributed - a cluster of workers, REST API configuration, automatic task rebalancing when a node dies. Production always means distributed: when a worker goes down, its tasks migrate to surviving peers without data loss. The same mechanism as Kafka consumer groups - just for connectors.

Single Message Transforms (SMT) are a pipeline of transformations applied to each message before it hits Kafka or after it leaves. Rename a field, add a timestamp, mask PII data, filter null values - all without writing code. Chain: `ReplaceField -> MaskField -> InsertField`. Key constraint: SMT is stateless and operates on one message at a time. For stream joins or aggregations - that is Kafka Streams or Flink territory.

Debezium is the most popular source connector for CDC (Change Data Capture). Instead of timestamp-based polling, it reads the PostgreSQL WAL or MySQL binlog directly. Latency drops from seconds to milliseconds. But it requires replication configuration at the database level and a replication slot. More in the CDC lesson (stream-13).

A connector runs in distributed mode. One worker crashes. What happens to its tasks?

Schema Registry: a contract that cannot be broken

Kafka stores bytes. Just bytes. Producers put them in, consumers read them out. This gives flexibility - but creates a disaster when a producer changes its message format and consumers have no idea. That is exactly what happened at Uber in 2017: a geolocation team added a field to a Protobuf schema without notifying downstream consumers. Three services crashed with `UnmarshallingException`. Schema Registry is a central schema store that makes such incidents structurally impossible.

Confluent Schema Registry is an HTTP service storing schema versions for each Kafka topic. When a producer publishes an Avro-serialized message, the client automatically registers the schema and prepends a 5-byte header to every message: magic byte (0x00) + schema ID (4 bytes, int32). The consumer reads the header, fetches the schema from Registry by ID, and deserializes. The schema travels once at registration - not in every message. At 1 million messages per second, this saves gigabytes of network traffic.

Avro is not the only format but the most widely used in the Kafka ecosystem. The schema is defined in JSON; data is stored in a compact binary format with no field names - just values in schema order. 100 bytes of JSON becomes 20-30 bytes of Avro. Protobuf is popular in gRPC stacks and has better cross-language support. JSON Schema is for cases where readability matters more than size, or when JSON producers already exist. Schema Registry supports all three.

Subject naming strategy determines how Schema Registry maps schemas to topics. TopicNameStrategy (default): one key-schema and one value-schema per topic. RecordNameStrategy: one schema per record type regardless of topic - useful when multiple event types share one topic. SchemaNameStrategy: explicit subject name. The chosen strategy directly affects how strictly compatibility is enforced.

A producer sends 1 million Avro messages per second. How many times does the client call Schema Registry for the schema?

Schema Evolution: changing contracts without outages

Schemas change. Products grow, requirements shift, old fields become obsolete. The question is not whether the schema will change - it is whether that change will break consumers reading old messages or producers writing with new schemas. Schema Registry solves this through compatibility modes, validated every time a new schema version is submitted.

BACKWARD is the most common mode. The new schema version must be able to read messages written with the old schema. This means: new fields must have a default value, only fields with defaults can be deleted. Deployment scenario: first deploy new consumers with the new schema - they can read both old and new messages. Then deploy the new producer. Order: consumers before producers.

FORWARD is the mirror guarantee. Old consumers must be able to read new messages. This means: new required fields are forbidden, any field can be deleted (consumers simply ignore unknown fields). Deployment scenario: producers first, consumers second. Order: producers before consumers. In practice FORWARD is less common - controlling consumer deployment order is usually preferable.

FULL requires both conditions simultaneously. The strictest mode: only fields with defaults can be added or deleted. Deployment order does not matter. The cost is reduced flexibility. TRANSITIVE variants (`BACKWARD_TRANSITIVE`, `FORWARD_TRANSITIVE`, `FULL_TRANSITIVE`) check compatibility not just with the previous version but with all historical versions. Used when a topic contains messages from multiple old schema versions - for example, 7-day retention with rare schema changes.

One-direction rule: with BACKWARD, always deploy consumers before producers. One way to enforce this is feature flags on the producer side. The producer continues writing with the old schema until monitoring confirms all consumers have been updated.

FULL compatibility is the safest mode and should be used everywhere

FULL restricts flexibility without real benefit in most cases. BACKWARD covers 90% of scenarios - controlled deployment of consumers before producers is sufficient.

FULL forbids deleting any field without a default and adding any required field. The real need for FULL arises only with chaotic deployments where ordering cannot be controlled, or with multiple independent producer teams without coordination. In all other cases FULL is an unnecessary constraint.

Schema Registry is in BACKWARD_TRANSITIVE mode. The topic contains messages from versions v1, v2, and v3 (current). A team wants to register v4. What does Schema Registry check?

Key ideas

**Kafka Connect** is a no-code integration framework: source connectors move data from external systems into Kafka, sink connectors move it out. Distributed mode provides fault tolerance through automatic task rebalancing.
**SMT (Single Message Transforms)** - a stateless per-message transformation pipeline at the connector level. Rename fields, mask PII, filter nulls - without writing code. For stateful operations (join, aggregation) - Kafka Streams or Flink.
**Schema Registry** stores Avro/Protobuf/JSON Schema versions. Each message carries a 5-byte header with schema_id. Consumers fetch the schema by ID once, then cache it.
**BACKWARD** - new consumer reads old messages (deploy consumers first). **FORWARD** - old consumer reads new messages (deploy producers first). **FULL** - both directions. **TRANSITIVE** - compatibility checked against all historical versions.
**Debezium** + Kafka Connect = CDC without polling: reads PostgreSQL WAL or MySQL binlog, millisecond latency.

Вопросы для размышления

In what scenario is FORWARD compatibility preferable to BACKWARD - when consumer deployment order cannot be controlled?
Why does Avro store only values in binary format (no field names), and why is this not a problem for deserialization?
Is it possible to use Kafka Connect without Schema Registry - and what is the cost?

Связанные уроки

stream-04 — Kafka architecture is the foundation for understanding Connect
stream-05 — Kafka Streams complements Connect in the ecosystem
stream-13 — CDC is built on top of Kafka Connect (Debezium)
db-31-migrations — Schema evolution mirrors the DB migration problem
db-34-lsm — Storage engine knowledge helps configure sink connectors
ds-05-replication — Replication context helps understand offset semantics
db-02-relational-model