Node.js Internals
Profiling & Debugging: Node.js Diagnostics
3 AM. The production server is consuming 100% CPU. Users can't log in. The PM writes in Slack: 'WTF???'. You open the monitoring - the graphs look like the cardiogram of a dying person. CPU spike. Memory is growing. Response time is 10 seconds. But WHO is to blame? WHICH function? WHERE in the code? Without a profiler, you are blind. With a profiler - one flame graph, 30 seconds of analysis, and you see: 85% CPU in the function validateRegex(). Hotfix: replaced the regex. CPU dropped to 10%. Incident closed. Profiling - the difference between 3 hours of debugging and 3 minutes.
- **E-commerce Black Friday:** The API started to crash under load. CPU at 100%, response time 5 seconds. Launched 0x profiler on the canary instance. Flame graph showed: 90% of the time in JSON.stringify() of large Product objects with circular references. Fix: serialization of only the necessary fields. CPU from 100% → 20%, withstood the peak without scaling.
- **SaaS platform memory leak:** Heap grows by 200MB/day. Out of memory in a week. Heap snapshot comparison: +500K EventEmitter objects in the old version of the library. Retention path: global → app → router → middleware → EventEmitter. It turned out that the middleware registered listeners on each request. Fix: move registration out of the middleware. Leak eliminated.
- **Real-time chat latency spike:** Users are complaining about message delays of up to 3 seconds. CPU is normal, memory is normal. Clinic.js Bubbleprof showed: 10 consecutive DB queries to load chat history (N+1 problem). Each query 250ms → total 2.5s. Fix: one JOIN query instead of N+1. Latency from 2.5s → 250ms.
Why Profile Node.js
Your API suddenly started consuming 90% CPU. Or the response time increased from 50ms to 2 seconds. Or memory is growing by 100MB per hour. **Without a profiler, you are blind** - you can only guess: is the database slowing down? Has V8 GC gone crazy? Is there an infinite loop somewhere? Profiling is the X-ray of performance: it shows exactly which function is consuming CPU, where memory is being allocated, and why the event loop is stalling.
**Types of Profiling:** **CPU profiling** - which function is executed for how long (flame graph). **Memory profiling** - who allocates objects and why they are not released (heap snapshots, allocation timeline). **Event Loop profiling** - why the event loop is slow (Clinic.js Doctor). Each type addresses its own class of problems.
**Node.js Inspector Protocol** is a standardized protocol for debugging and profiling. It operates over WebSocket and is compatible with Chrome DevTools, VS Code Debugger, and all profilers in the ecosystem. Run with the `--inspect` flag - you get full access to the V8 engine: breakpoints, CPU/memory profiling, heap snapshots, async stack traces.
What type of profiling is needed if your application responds slowly to requests, but CPU and memory are normal?
Inspector Protocol and Chrome DevTools
**Inspector Protocol is a bridge between Node.js and developer tools.** When you run `node --inspect app.js`, Node opens a WebSocket on port 9229 and waits for a client connection (Chrome DevTools, VS Code, etc). The protocol supports 100+ commands: from setting breakpoints to taking heap snapshots. This is the same protocol that Chrome uses for debugging browser JS.
**Launch Flags:** `--inspect` - starts the Inspector on 127.0.0.1:9229 (localhost only). `--inspect=0.0.0.0:9229` - access from any IP (dangerous without a tunnel!). `--inspect-brk` - starts with a pause on the first line (for debugging application startup). `--inspect-publish-uid=http` - publishes the URL in stdout for auto-connection.
**Chrome DevTools for Node.js:** Once connected, all tabs are available: **Console** (REPL in the application context), **Sources** (breakpoints, step-through debugging), **Memory** (heap snapshots, allocation profiling), **Profiler** (CPU flame graphs). You can even execute arbitrary code in the application context through the Console - a powerful tool for hotfixes in production.
Why is --inspect=0.0.0.0:9229 dangerous without an SSH tunnel in production?
CPU Profiling: Flame Graphs and V8 Profiler
**CPU profiling shows where the processor spends time.** The V8 profiler samples the call stack every ~1ms and builds a call tree with time percentages. **Flame graph** is a visualization: block width = time in function, height = stack depth. If you see a wide block, it's a hotspot consuming CPU. The flame graph instantly shows that `validateInput()` takes up 80% of the request time.
**How to read a flame graph:** The X-axis is not time, but alphabetical order (or sorting). The width of the block represents the percentage of CPU time. The Y-axis is the depth of the call stack (bottom - root, top - leaf functions). **Colors** are usually random for distinction, but some tools highlight: red - JavaScript, yellow - C++ (V8 internals), green - system calls.
**Interpretation of results:** If a function in the flame graph is **wide but shallow** (low in the stack) - it is slow itself (heavy computation). If it is **narrow but deep** - it is fast but called a million times. Optimization strategies differ: for the first - algorithm, for the second - caching or batching.
What does it mean if a function in a flame graph occupies 50% of the width but is deep in the stack (many functions below it)?
Memory Profiling: Heap Snapshots and Allocation Timeline
**Memory profiling is the search for what is consuming memory and why it is not being released.** The two main tools are: **Heap Snapshot** (a static picture of memory at a point in time) and **Allocation Timeline** (a record of allocations over time). Heap snapshot shows **what** is currently in memory, allocation timeline shows **when** and **where** objects were created. To find leaks, **snapshot comparison** is used - a diff between two snapshots.
**Heap Snapshot contains:** All objects in the heap with their sizes (shallow size - the object itself, retained size - the object + everything it retains). Retention paths (chains of references from the GC root). Grouping by constructor (how many objects of type Array, Map, User, etc.). **Allocation Timeline:** Records each allocation with a timestamp and call stack. You can see that in the last 10 seconds, 100K User objects were created in the handleRequest function.
**Retention Path - the key to understanding leaks.** If an object in a snapshot has a retention path through several intermediate objects to the GC root, it means it is reachable and the GC will not remove it. A typical leak path: `window/global → EventEmitter → listeners array → callback closure → captured variables → leaked object`. By breaking any link in the chain, you release everything downstream.
What is the difference between Heap Snapshot and Allocation Timeline?
Clinic.js: Doctor, Flame, Bubbleprof
**Clinic.js is a Swiss army knife for diagnosing Node.js.** Three tools: **Clinic Doctor** (detects issues: Event Loop delay, I/O, memory), **Clinic Flame** (CPU flame graphs with V8 optimization info), **Clinic Bubbleprof** (visualization of async operations). Doctor tells **what** is broken, Flame shows **where** in the code, Bubbleprof explains **why** asynchronous operations are slow.
**Clinic Doctor detects:** Event Loop delays (synchronous blocks), I/O bottlenecks (slow read/write operations), Memory issues (heap growth, frequent GC). **Clinic Flame:** Advanced flame graph with V8 optimization states annotations (optimized/not optimized/deoptimized). **Clinic Bubbleprof:** Visualization of async operations as bubbles - size = latency, color = type (I/O, timers, etc).
**When to use what:** **Doctor** - the first step in diagnostics, quickly shows the class of the problem (CPU/Memory/I/O). **Flame** - when Doctor indicates a CPU issue, the flame graph finds the specific function. **Bubbleprof** - when latency is high, but CPU is not loaded → the problem lies in async operations or their orchestration. Together they provide a complete picture.
Clinic.js Bubbleprof shows a large bubble (250ms) for the 'DB Query' operation, executed 10 times in succession. What does this mean?
Profiling in Production
**Profiling in production is the art of minimal overhead.** You can't just enable `--inspect` on all servers - it's a security hole. You can't run Clinic.js - it will slow down the application by 30%. You need **production-safe** tools: sampling profiling (minimal overhead), on-demand activation via signals, automatic metric collection without stopping the service.
**Production-safe strategies:** **Sampling CPU profiling** - sampling the call stack every 10ms (overhead <5%). **Heap snapshots on-demand** - take a snapshot on SIGUSR2, not continuously. **Continuous profiling** - sending profiles to a monitoring system (DataDog, Grafana Pyroscope). **Canary profiling** - profile 1% of instances, the rest operate normally.
**Security in Production:** Never open Inspector on 0.0.0.0. Use Unix sockets (`--inspect=/tmp/node-inspect.sock`) or SSH tunnels. Heap snapshots contain sensitive data (tokens, passwords in memory) - encrypt before sending. Rotate old profiles automatically (retention policy 7 days). Log all profiling actions in the audit log.
Profiling is only needed when something is broken.
Continuous profiling in production allows detecting regressions before they become critical.
If you enable the profiler only during an incident, you see the consequences, not the cause. Continuous profiling builds a baseline of performance and automatically detects deviations. For example, a new deployment introduced a regression - a function started taking up 15% CPU instead of 5%. Without continuous profiling, you would find out about this in a week when users start complaining. With continuous profiling - a minute after deployment through diff flame graphs between versions.
Why does continuous CPU profiling in production use sampling with a 10ms interval instead of 1ms?
Key Ideas
- **Inspector Protocol - a universal protocol for debugging and profiling Node.js.** Operates via WebSocket, compatible with Chrome DevTools. Flags: --inspect (localhost), --inspect-brk (pause on start). In production, only through SSH tunnel, never 0.0.0.0.
- **CPU profiling shows hotspots through flame graphs.** Block width = CPU time, height = call stack. Tools: Chrome DevTools Profiler, 0x (automatic flamegraph), Clinic Flame (with V8 optimization info). Look for wide blocks - that's where the problem is.
- **Memory profiling: Heap Snapshots (what's in memory) + Allocation Timeline (where objects are created).** Heap snapshot comparison finds leaks through diff. Retention path shows who keeps the object alive. WeakMap/WeakRef for automatic cleanup.
- **Clinic.js - three tools: Doctor (what's broken), Flame (where in the code), Bubbleprof (why async is slow).** Doctor detects Event Loop delay, I/O bottlenecks, memory issues. Bubbleprof visualizes async operations - large bubbles indicate slow operations.
- **Production profiling: on-demand via signals, continuous profiling (DataDog/Pyroscope), canary (10% of instances).** Sampling with 10ms = ~5% overhead. Heap snapshots via SIGUSR2. Never --inspect on 0.0.0.0. Encrypt profiles - they contain sensitive data.
Related topics
Profiling is related to all aspects of Node.js performance - from the Event Loop to Memory Management:
- Performance Hooks — Performance Hooks measure the time of operations (mark/measure), profiling shows where this time is spent (flame graphs). Use together: mark() for custom metrics + CPU profiler for detailed analysis.
- Memory Management — Memory profiling (heap snapshots, allocation timeline) are tools for finding leaks described in Memory Management. GC pauses are visible in the CPU profiler as V8 GC blocks.
- Event Loop — Event Loop delay indicates that the application is slowing down. The CPU profiler explains which function is blocking the loop. Clinic Doctor automatically links Event Loop metrics with hotspots.
Вопросы для размышления
- Your API is responding slowly, but CPU and memory are normal. What type of profiling will you run first and why? Hint: if the CPU is not loaded, where is the time being spent?
- Flame graph shows that 60% of the CPU is in the parseJSON function, but this function is called from 10 different places. How to find out which of the 10 calls is the most frequent? Hint: look at the call stack above parseJSON.
- Heap snapshot comparison shows +100K Promise objects with a retention path through global.pendingRequests. What architectural patterns could have led to this leak? How to fix it without changing the application logic?