Real-Time Backend

Webhook Delivery

A payment provider sent a 'payment.failed' webhook. The endpoint returned 200 OK, but the code crashed on the next line after the response. The user did not get the payment failure notification. No retry happened - the provider thought delivery succeeded. A week later the customer calls: 'why did you charge my card but not create the order?' Without a DLQ and delivery log, this incident is impossible to investigate.

  • **Stripe:** retries for up to 3 days (5 attempts), dashboard with a full delivery log including response body. At-least-once semantics documented. Recommends checking event_id for idempotency.
  • **Svix (webhook infrastructure):** used by Clerk, Resend, Loops. 90-day audit trail of every attempt. Auto-disable endpoint at >50% failure rate. Bulk replay via API - leader in webhook-as-a-service.
  • **AWS SQS + Lambda:** the standard DLQ pattern in AWS. Lambda automatically pushes messages to the DLQ after maxReceiveCount attempts. CloudWatch Metrics deliver success rate out of the box.
  • **Shopify:** merchant-facing dashboard with 5 days of webhook delivery history. Manual retry button right in the UI. A 40% reduction in support tickets after adding that UI, per their internal data.

At-least-once Delivery

At-least-once delivery is a guarantee: an event will reach the receiver at least once, even if the first attempt fails. The alternative, exactly-once, is practically unachievable in distributed systems because of the two-generals problem. At-least-once is a reasonable compromise, provided the receiver is idempotent.

A classic problem: the server sends a webhook, the receiver processes the request and takes all the actions, but the connection drops while sending the 200 OK back. The sender did not get the confirmation - it decides delivery failed and retries. The receiver gets the same webhook a second time. Without idempotency that means duplicate in DB, double charge, double email.

Receiver idempotency is implemented via event_id: before processing, check whether this event_id has already been processed (Redis SET NX or a unique index in the DB). If yes, return 200 OK without re-processing. Stored IDs need long enough retention: Stripe retries for up to 3 days, so the TTL should be at least 72 hours.

A webhook receiver processed the event (charged the card, sent the email), but the connection dropped before sending 200 OK. The sender will retry. How do you prevent a double charge?

Event Ordering

Webhooks do not guarantee delivery order. An 'order.updated' event can arrive before 'order.created' because of network delays, parallel retry queues, or different workers. The receiver has to either process events idempotently regardless of order, or reconstruct order itself.

Stripe explicitly documents: 'Event delivery order is not guaranteed'. Their recommendation: use sequence numbers within a single object or always re-fetch via the API after receiving a webhook. Re-fetching guarantees fresh data regardless of event order but adds extra API calls.

Two Stripe webhooks arrive: 'payment_intent.processing' with sequence=2 and 'payment_intent.succeeded' with sequence=3. sequence=3 arrived first. What do you do?

Dead Letter Queue for Webhooks

A Dead Letter Queue (DLQ) is storage for events that failed to deliver after every retry. It is not a 'trash bin' - it is an operational observability and manual recovery tool. Without a DLQ, lost events vanish without a trace and the team only hears about the problem from customers.

AWS SQS DLQ automatically receives messages after N failed processing attempts. Shopify displays the DLQ in the merchant dashboard as 'Failed deliveries' with a manual retry button. That cuts support load: the merchant sees what went wrong and, once their endpoint is back, hits retry themselves.

A client endpoint was down for 6 hours due to a deploy. The DLQ collected 10,000 events. The endpoint is back. How do you replay?

Webhook Dashboard and Observability

A webhook dashboard is the operational UI for monitoring event delivery. Without it, webhook integration is a black box: the client does not know if they are receiving events, how many were lost, which endpoints are problematic. The dashboard turns invisible delivery into an observable process.

  • **Delivery log** - the history of every attempt: timestamp, status code, response time, payload. At least the last 72 hours (the full retry window)
  • **Success rate graph** - percentage of successful deliveries broken down by endpoint and event type
  • **Dead letter list** - DLQ events with a manual replay button and filters by type/time
  • **Latency percentiles** - p50/p95/p99 delivery time. Outliers signal problems with a specific endpoint
  • **Endpoint health** - automatic flagging of problematic endpoints (failure rate >10% over 1 hour) with client alerts
  • **Retry timeline** - visualizes when and why retries were attempted for a specific event

The Stripe Webhook Dashboard shows every delivery attempt with response body, headers, and timing. Svix (webhook-as-a-service, used by Clerk, Resend) stores a 90-day audit log. On a successful 200 OK it stores the receiver's response body, which helps debug why processing went sideways even on a successful delivery.

If a webhook returned 200 OK, the event was processed correctly and you can skip storing delivery history.

200 OK only means the HTTP request was accepted. Business logic on the receiver side might have errored, data might have hit an exception after the 200 response. Delivery log + audit trail are required for debugging real incidents.

In production, the gap between 'webhook delivered' and 'event processed correctly' is a source of the worst bugs. Shopify has cases where 200 OK was returned but a DB write on the merchant side failed due to deadlock. Without logs you cannot prove the issue is not on Shopify's side.

A webhook endpoint's success rate dropped from 99% to 60% over the last hour. What is the first diagnostic step?

Summary

  • **At-least-once + idempotency = a pair**: at-least-once guarantees no losses, idempotency by event_id guarantees no duplicates - together they deliver practical exactly-once at the business-effect level.
  • **DLQ is not a trash bin, it is a recovery tool**: events in the DLQ must be replayable with rate limiting, the client must see them via the dashboard and be able to retry manually.
  • **200 OK is not equal to successful processing**: delivery log with response body, latency percentiles, and endpoint health scoring turn webhook delivery from a black box into an observable process.

Related Topics

Webhook delivery guarantees rest on core patterns of reliable systems:

  • Webhooks (basics) — The previous lesson - HTTP mechanics, HMAC signatures, retry policies. This lesson covers the operational side of reliable delivery
  • Message Queues — DLQ is the direct analog of the Dead Letter Queue in Kafka/SQS/RabbitMQ. Same patterns: retry exhaustion, replay, monitoring
  • Idempotency — At-least-once delivery without receiver idempotency is useless - duplicates create double charges and inconsistent state

Вопросы для размышления

  • A client endpoint deliberately returns 200 OK even on processing errors (to avoid triggering retries). How can the webhook provider detect that real processing is failing?
  • How do you design a system where webhook events must be processed strictly in order for each order, but in any order between different orders?
  • A client wants webhook events delivered exactly-once. Explain why this cannot be guaranteed at the infrastructure level and what you can offer instead.

Связанные уроки

  • sd-17-rate-limiting
Webhook Delivery

0

1

Sign In