Backend Transport
System Design: API Platform like Stripe
Stripe processes $1 trillion in payments per year. Every day millions of developers call stripe.charge(). A network timeout, client retries - without idempotency keys, this would mean millions of double charges. With idempotency keys: zero double charges, transparent retry experience. The infrastructure behind the simplest API call.
- **Stripe Idempotency Keys** handle thousands of duplicate requests daily - without this mechanism, double charges would be the norm due to network timeouts.
- **Twilio** generates SDKs for 9 languages from a single OpenAPI specification: a change in the API automatically propagates to all SDKs via CI/CD pipeline, without manual work.
- **GitHub API versioning** - froze /v3 for 10 years while developing GraphQL /v4, maintaining both versions in parallel for 2M+ integrations.
Idempotency Keys
An idempotency key is a client-provided UUID that guarantees repeated requests produce the same result. Stripe requires an Idempotency-Key header for all write operations. The server caches responses for 24 hours: a duplicate request returns the cached response without re-executing the operation.
Idempotency keys solve the double-charge problem: client sends charge request, network times out, client retries. Without idempotency keys: two charges. With idempotency keys: server returns cached first charge result. Stripe processes thousands of duplicate requests daily via this mechanism.
Client sent POST /charge with Idempotency-Key: abc123. Request timed out on client. Server actually processed it. Client retries with same key. What does the server return?
Webhook Delivery
A webhook is an HTTP callback: when an event occurs (payment.succeeded), the platform sends a POST to the client URL. The challenge: client servers can be slow, down, or return errors. Reliable delivery requires retry logic, exponential backoff, and idempotent handlers.
HMAC-SHA256 signature on webhook payload prevents spoofing: only the platform knows the secret. Client verifies: `crypto.timingSafeEqual(expected_sig, received_sig)`. Without signature verification, any HTTP client can send fake webhook events to the endpoint.
Client webhook endpoint returns 200 OK after 45 seconds (slow processing). Stripe timeout is 30s. What happens?
API Versioning
API versioning allows the platform to evolve without breaking clients. Stripe uses date-based versioning: Stripe-Version: 2024-06-20. Each account is pinned to the version active when they integrated. Breaking changes require a new version.
Stripe supports 50+ API versions simultaneously. Each version is a configuration layer that transforms the response. New fields are added to all versions; removed fields are shimmed for old versions. The cost: every response transformation is a potential bug.
API returns field 'amount' as integer (cents). Plan to change to string (for multi-currency support). Is this a breaking change?
Rate Limit Design
Rate limiting for an API platform works at multiple levels: per-API-key (100 req/s free tier, 1000 req/s enterprise), per-endpoint (POST /charges: 100/min), and per-IP for unauthenticated endpoints. Response: HTTP 429 with Retry-After header and jitter to prevent thundering herd.
Stripe recommends exponential backoff with jitter for all retries: sleep(min(2^attempt * 100ms + random(100ms), 30s)). This naturally distributes retries over time even when many clients hit rate limits simultaneously.
API returns 429 with Retry-After: 1. Client immediately retries after exactly 1 second. Why is this a problem?
SDK Generation
SDK generation automatically creates client libraries from OpenAPI/Protobuf specifications. Stripe supports official SDKs for 7 languages generated from one source of truth. Benefits: type safety, built-in retry logic, idempotency key management, pagination helpers.
Twilio generates SDKs for 9 languages from a single OpenAPI specification via a CI/CD pipeline. A change to the API spec automatically generates updated SDKs and opens pull requests for human review. This eliminates the risk of SDK drift from the actual API.
API versioning is just /v1/ and /v2/ in the URL, nothing else matters
Versioning is a complete strategy: semantics of breaking vs non-breaking changes, shadow testing new versions against old, migration guides, deprecation timeline (minimum 12 months), SDK updates. Stripe supports 50+ versions simultaneously.
Companies integrate APIs for years. A bank may not update its Stripe SDK for years. A public API is a contract; breaking it destroys trust and breaks production for clients.
An SDK generated from OpenAPI spec has a bug in pagination logic. All 9 language SDKs are affected. What is the correct fix process?
Summary
- **Idempotency keys** - first line of defense against duplication: client UUID + server-side cache allows safely retrying any write request even after network timeout.
- **Webhook delivery** requires the same reliability as regular APIs: persistent queue, exponential backoff up to 72 hours, HMAC signature verification, idempotent handlers on the client side.
- **Breaking change = new version** - strict rule. Non-breaking (adding fields) does not require versioning. Supporting old versions for years - the price of a public API.
Related Topics
API Platform combines patterns from several areas of the course:
- API Gateway — Rate limiting from the API Platform is implemented in the API Gateway layer: Kong/Envoy applies per-key limits before requests reach business logic
- Dead Letter Queue — Webhook delivery engine is the DLQ pattern: failed deliveries go to retry queue with exponential backoff, same algorithms from the DLQ lesson
- Transactional Outbox — Webhook events are generated via Outbox Pattern: payment.succeeded is written to DB atomically with the main transaction, worker reads outbox and delivers webhooks
Вопросы для размышления
- Stripe supports 50+ API versions simultaneously. What is the real cost: how many engineer-hours per year are spent supporting legacy versions vs the benefit of backward compatibility?
- Idempotency key scoped to user (userId + key). If the key is just a counter (1, 2, 3, ...), can it be attacked? How to make keys unpredictable without losing convenience?
- SDK generated from OpenAPI. A bug in business logic is found in production (incorrect calculation). Need to release a hotfix. Which SDK versions need updating: only the latest or all supported?