Node.js Internals

Error Handling: Error Processing

Imagine: your production service processes a million requests a day. One invalid JSON from a user - and the entire server crashes because you forgot `try/catch`. Or deploying a new version - and 10,000 active WebSocket connections are dropped because there was no graceful shutdown. Or an external API is unavailable for 5 minutes - and your service hangs because there is no timeout and circuit breaker.

**GitHub outage 2018:** A database failure led to a cascading failure of all services. There was no circuit breaker - each service tried to connect to the dead database, exhausted all connection pools, and crashed. Solution: circuit breaker + fallback to read-only mode.
**Slack incident 2020:** During deployment, there was no graceful shutdown. Active WebSocket connections were interrupted, and users lost messages. 100,000+ users were disconnected simultaneously. Solution: graceful shutdown with a drain period of 30s + health check.
**AWS Lambda cold start:** If unhandledRejection is not handled, the function crashes without logs. Debugging is impossible - no stack trace, no metrics. Solution: global handlers + structured logging + graceful error boundaries.

Types of Errors and the Error Class

Imagine: you are writing a server in Node.js. The user sent an invalid JSON - this is an **operational error** (expected error). You forgot to check `null` before `.toString()` - this is a **programmer error** (bug). The first one needs to be handled and return 400, the second one needs to be logged and the code fixed.

**Operational errors** are expected issues, part of the normal operation of an application: network is unavailable, file not found, database rejected the request, user entered incorrect data. They **need** to be handled. **Programmer errors** are bugs in the code: accessing an undefined property, passing an incorrect type of argument, infinite recursion. They **cannot** be handled - the code needs to be fixed.

**Golden Rule:** Handle operational errors, log programmer errors, and crash the process. Do not attempt to recover from a programming error - the application's state may be unpredictable.

**Error class** in Node.js contains: - `message` - error description - `stack` - stack trace (where the error occurred) - `name` - type of error (Error, TypeError, RangeError, ...) - Additional fields for specific errors: `code` (ENOENT, ECONNREFUSED), `syscall`, `errno`

Stack trace - crime map

Stack trace shows the code execution path up to the error: ``` Error: User not found at UserService.getUser (/app/services/user.service.ts:45:11) at UserController.getProfile (/app/controllers/user.controller.ts:23:28) at Layer.handle [as handle_request] (/app/node_modules/express/lib/router/layer.js:95:5) ``` **Read from bottom to top:** 1. Express called your controller 2. The controller called UserService.getProfile 3. An error was thrown on line 45 of user.service.ts **Important:** In production, minimize the stack trace (remove node_modules), but keep source maps for debugging.

**Antipattern:** `catch (err) { console.log(err) }` - this is NOT error handling! You logged it and continued as if nothing happened. As a result, the user receives a successful response with `undefined` data, and after 5 minutes the application crashes with a mysterious error.

Your API endpoint receives JSON, parses it, and saves it to the database. A SyntaxError was thrown during parsing. What type of error is this and how to handle it?

Asynchronous errors

In synchronous code, `try/catch` works perfectly. But in the asynchronous world of Node.js, errors can "get lost" and crash the process. Let's explore the evolution from callbacks to async/await and find out where to cushion the fall.

**unhandledRejection** - the most dangerous error in Node.js. Until version 15, the process did NOT crash, it just logged. From version 15+, the process **crashes**. Always handle rejected promises!

Real Case: Microservice Crashed Without Logs

**Problem:** The service processes a RabbitMQ queue. It crashes once an hour without logs. **Investigation:** ```typescript // The code looked fine async function processMessage(msg) { const data = JSON.parse(msg.content); await saveToDatabase(data); channel.ack(msg); } channel.consume(queue, processMessage); ``` **Issue:** `processMessage` does not catch errors. If `saveToDatabase` throws an error, it will turn into an unhandledRejection. **Solution:** ```typescript async function processMessage(msg) { try { const data = JSON.parse(msg.content); await saveToDatabase(data); channel.ack(msg); } catch (err) { console.error('Failed to process message:', err); channel.nack(msg); // Return to queue } } ``` Now errors are handled, the service does not crash, and messages are returned to the queue.

**Async wrapper for Express:** Standard Express does not catch errors from async middleware. A wrapper is needed: ```typescript const asyncHandler = (fn) => (req, res, next) => { Promise.resolve(fn(req, res, next)).catch(next); }; app.get('/user/:id', asyncHandler(async (req, res) => { const user = await db.getUser(req.params.id); res.json(user); // Errors will automatically go to the error handler })); ``` Or use the express-async-errors package.

You have an Express API with async/await handlers. One of the endpoints threw an error, but there was no try/catch. What will happen?

AsyncLocalStorage and Error Boundaries

**Domains API** (deprecated with Node.js 4) attempted to solve the problem: how to isolate errors from different requests? If one request crashes, the entire server should not go down. However, domains turned out to be complex and unreliable. They were replaced with **AsyncLocalStorage** for context and **error boundaries** for isolation.

**AsyncLocalStorage** (Node.js 12.17+) is a way to propagate context through the entire chain of async calls without explicitly passing parameters. Ideal for request ID, user ID, tracing.

**AsyncLocalStorage vs global variables:** Global variables are shared across all requests (race conditions). AsyncLocalStorage isolates the context of each async flow - even if 1000 requests are executed simultaneously, each sees its own requestId.

Request ID tracing in microservices

**Problem:** You have 5 microservices. An error occurs somewhere in the chain, but it's unclear in which request. **Solution:** Pass the request ID through all services: ```typescript // API Gateway creates requestId app.use((req, res, next) => { const requestId = req.headers['x-request-id'] || uuidv4(); requestContext.run({ requestId }, () => next()); }); // When calling another service async function callUserService(userId: string) { const ctx = requestContext.getStore(); const response = await fetch(`http://user-service/users/${userId}`, { headers: { 'X-Request-ID': ctx?.requestId // Pass it further } }); return response.json(); } // In the logs of all services, there is one requestId // You can trace the entire request path: // API Gateway [req-123] -> User Service [req-123] -> DB error [req-123] ```

**Performance overhead:** AsyncLocalStorage has a small overhead (~5-10% on async operations). In most cases, this is unnoticeable, but if you have millions of RPS and a tight latency budget - measure before implementation.

You have an Express API. You need to log the requestId in every error. How is it best to do this?

Graceful Shutdown

You are deploying a new version of the service. Kubernetes sends **SIGTERM**, gives 30 seconds to complete, and kills the process. If you simply call `process.exit()`, current requests will be interrupted, transactions will roll back, and clients will receive a 502. **Graceful shutdown** is a proper termination: wait for current requests to finish, close connections, and save the state.

**SIGTERM vs SIGKILL:** SIGTERM - a polite request to terminate (can be handled). SIGKILL - immediate process termination (cannot be handled). Kubernetes first sends SIGTERM, waits for terminationGracePeriodSeconds (default 30s), then SIGKILL.

Real Case: Rolling Deployment Without Downtime

**Problem:** During deployment, Kubernetes kills old pods. If they are processing requests, clients receive a 502 error. **Solution:** 1. **preStop hook** in Kubernetes - gives time before SIGTERM: ```yaml lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5"] ``` This gives time for Ingress/Service to update endpoints. 2. **Graceful shutdown** in Node.js - wait for requests to complete 3. **Health check** returns 503 during shutdown - load balancer stops sending traffic 4. **Timeout** - if requests are not completed within 30s, force shutdown Result: 0 lost requests during deployment.

**Keep-Alive connections:** HTTP keep-alive keeps the connection open. `server.close()` does NOT kill active keep-alive connections - they can hang for minutes. Use the `http-terminator` library or manually track connections and call `socket.destroy()`.

Your Node.js service processes long requests (up to 60 seconds). When deploying in Kubernetes, requests are interrupted. `terminationGracePeriodSeconds = 30s`. How to fix it?

Recovery Patterns

Errors are inevitable: the network is down, the database is overloaded, the external API is unavailable. **Recovery patterns** are strategies for recovery after failures. Instead of crashing immediately, the service tries to recover: it retries the request, switches to a fallback, isolates the broken component.

**Main Recovery Patterns:** **1. Retry** - repeat the operation after a delay. Use for temporary failures (network, overload). **2. Circuit Breaker** - if a service breaks, stop calling it ("open the circuit"). Try again after a timeout. **3. Fallback** - if the main path doesn't work, use a backup (cache, default data, simplified logic). **4. Timeout** - limit the waiting time. It's better to return an error quickly than to hang for minutes. **5. Health Check** - periodically check the state of dependencies. If the database is unavailable, the health check returns 503.

Real Case: Cascading Failure of Microservices

**Problem:** Service A calls service B, which calls C. Service C has crashed. Now: - C does not respond (timeout 30s) - B is stuck waiting for C, requests accumulate - A is stuck waiting for B - All services are overloaded, crashing in a chain **Solution:** 1. **Timeout:** Limit waiting time (5s instead of 30s) 2. **Circuit Breaker:** After 5 errors from C, service B stops calling it 3. **Fallback:** B returns cached data instead of fresh data 4. **Health Check:** B returns 503 if C is unavailable, A switches to another instance of B Result: C's failure does not bring down the entire system.

**Libraries for recovery patterns:** - `cockatiel` - circuit breaker, retry, timeout, fallback - `p-retry` - simple retry with backoff - `async-retry` - retry for async/await - `opossum` - circuit breaker for Node.js - `nestjs-resilience` - integration with NestJS

Retry solves all problems - if the request fails, I'll just repeat it.

Retry is suitable only for temporary failures (network hiccups). If the service is down, retry will worsen the problem by adding load to the dying service. Circuit breaker + fallback + timeout are needed.

**Thundering herd problem:** All clients simultaneously retry a failed service → it receives 10x the load and cannot recover. Circuit breaker isolates the broken component, giving it time to recover. Fallback returns at least some response to the client instead of endless waiting.

Your service calls an external API that sometimes goes down for 5-10 minutes. Clients are complaining about timeouts. What is the best strategy?

Key Ideas

**Operational vs Programmer errors:** Handle operational (network, files, input) errors, log and crash programmer (bugs) errors. Do not attempt to recover from bugs - the state is unpredictable.
**Async errors kill the process:** Forgot `.catch()` or `try/await` → unhandledRejection → crash (Node.js 15+). Use an async wrapper for Express, global handlers for logging.
**AsyncLocalStorage for context:** Store requestId, userId, tracing without explicit parameter passing. Isolated for each async flow.
**Graceful shutdown is mandatory:** SIGTERM → stop accept → drain connections → cleanup → exit. Without this, deployment = lost requests and 502 errors.
**Recovery patterns:** Retry (temporary failures), Circuit Breaker (isolate the broken), Fallback (cache/defaults), Timeout (don't hang forever), Health Check (return 503).

Вопросы для размышления

Check your code: are there any async functions without try/catch or .catch()? What will happen if they throw an error?
Does your service have a graceful shutdown? What will happen to active requests during deployment?
How does your service respond to the unavailability of a database or external API? Is there retry, circuit breaker, fallback?
Can you trace the request ID from login to the error in the logs? How long will it take to investigate the production incident?

Связанные уроки

comp-01-intro