Node.js Internals
Buffer: Working with Binary Data
Consider a server for an online game. Every second, a thousand TCP packets arrive - player coordinates, actions, chat. Storing each packet as a JavaScript string in UTF-16 leads to 10GB RAM consumption and severe lag. Buffer solves this: raw bytes, minimal copying, native speed.
- **Cryptography**: `crypto.createHash('sha256').update(Buffer.from(password))` - hashing works with bytes, not strings. Passing a string directly can allow encoding to break security.
- **WebSocket**: when receiving a binary frame (e.g., protobuf), the data comes as a Buffer. Parsing it through String → JSON is impossible - it's not text, but an encoded structure.
- **Images**: the `sharp` library (image processing) works with Buffer. Reading PNG → resizing → saving to JPEG - all through Buffer, without intermediate files on disk.
Why Buffer is Needed
JavaScript was created for browsers, where the main data unit is text (strings). But in Node.js we work with **networking, file system, cryptography** - there data is transmitted as **raw bytes**, not Unicode characters.
**Problem**: JavaScript stores strings in **UTF-16** encoding (2 bytes per character). For working with TCP packets, binary protocols, or images, this is inefficient and inconvenient.
**Buffer** is a Node.js class for working with a sequence of bytes directly in memory. It's like `uint8_t[]` in C or `byte[]` in Java, but with a convenient API.
**Where it is critical:** - **Cryptography**: encryption works with bytes, not characters - **Network**: HTTP/2, WebSocket, TCP packets - these are bytes - **Files**: images, videos, PDFs - not text - **Performance**: parsing JSON from Buffer is 2-3 times faster than from a string
Why can't regular JavaScript strings be used to work with TCP packets?
Creating a Buffer
In Node.js, there are three main ways to create a Buffer. Each is for its own task, and each has its pitfalls (especially with security).
**Main rule**: never use `new Buffer()` (deprecated since Node 6). Use static methods: `Buffer.from()`, `Buffer.alloc()`, `Buffer.allocUnsafe()`.
The diagram shows: `allocUnsafe` **skips the Initialize phase** - it's faster but dangerous (contains random data from memory).
**Security**: In 2016, a vulnerability was found in the `ws` (WebSocket) package, where `allocUnsafe` leaked private client data. **Rule**: use `allocUnsafe` only when all bytes will be immediately overwritten.
Why can Buffer.allocUnsafe() be dangerous for security?
Encodings
The buffer stores bytes, but often it is necessary to convert them to text (or vice versa). For this, **encodings** are used - rules for how bytes are transformed into characters.
**Important**: the same bytes can be read differently depending on the encoding. `0x48` in ASCII is 'H', while in UTF-16LE it is half of a character.
**UTF-8** - a standard for text. **Base64** - for transmitting bytes via JSON/HTTP. **Hex** - for debugging and hashes (SHA256 → 64 hex characters).
**Real Example: JWT Token**. A JWT consists of 3 parts, separated by a dot, each is Base64:
**Performance**: `toString()` with encoding is a native C++ function in V8, very fast. But converting Buffer → String → Buffer multiple times in a hot path is wasteful - working with Buffer directly is more efficient.
Why do JWT tokens use Base64 encoding instead of just UTF-8?
Operations with Buffer
Buffer is not just an array of bytes. It is a set of methods for efficient handling of binary data: copying, comparing, searching, concatenating.
**The main difference from arrays**: `.slice()` in Buffer returns a **view** (a reference to the same memory), not a copy! With Node 18+, use `.subarray()` - it's a more explicit name.
Shared Memory in subarray()
```typescript const buf = Buffer.from('Hello!'); // [0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x21] const view = buf.subarray(2, 6); // [0x6c, 0x6c, 0x6f, 0x21] - 'llo!' view[0] = 0x4A; // Change 'l' → 'J' console.log(buf.toString()); // "HeJlo!" - the original changed! ``` **Why?** `subarray()` returns a reference to the same memory. Changes in `view` affect `buf`.
**Key point**: modifying a subarray also changes the **original Buffer** (shared memory). This is efficient (no copying), but dangerous!
**Real Example: Parsing HTTP Chunked Transfer Encoding**. HTTP can send data in chunks, each starting with a size in hex:
**Performance**: `Buffer.concat()` creates a new Buffer and copies the data. When concatenating thousands of small pieces (for example, streaming), collecting them in an array and performing a single `concat()` at the end is far more efficient.
What happens to the original Buffer when a view obtained through .subarray() is modified?
Connection with TypedArrays
Buffer is not a unique feature of Node.js. It is a **wrapper over Uint8Array** from the JavaScript standard (ES6). Under the hood, Buffer uses the same mechanisms as TypedArrays in browsers.
**TypedArray** is a family of classes for working with binary data: Uint8Array, Int16Array, Float32Array, etc. They are all built on top of **ArrayBuffer** - a block of memory.
The diagram shows: **ArrayBuffer** is the foundation. All other classes are **views** (interpretations) of the same memory.
**Real Example: Parsing Binary Data (PNG Header)**. A PNG file starts with a signature + metadata in different formats:
**DataView**: another way to read an ArrayBuffer with explicit control over endianness. Useful for network protocols (network byte order = big-endian).
What is an ArrayBuffer and how does it differ from a Buffer?
Memory management
Buffer works with memory outside the V8 heap (in the case of large buffers) or uses **pooling** for small ones. This is critical for performance but creates peculiarities when working with the GC.
**Buffer pooling**: Node.js allocates large memory blocks (8KB) and slices them into smaller Buffers. This reduces fragmentation and speeds up allocation.
On the diagram: small Buffers (<4KB) are taken from the **pool**, large ones (>512KB) are allocated via malloc() **outside the V8 heap** to avoid bloating the GC.
**Security and pooling**: `allocUnsafe()` may return a piece of the pool where **foreign data** was previously stored. This can leak confidential information!
**Real example: memory leak during streaming**. Keeping references to small Buffers from a large stream can accidentally retain the entire stream in memory!
**Performance**: for high-throughput data processing (e.g., video streaming), a manually managed **Buffer pool** is the right approach - allocate large buffers once and reuse them.
Buffer.alloc() and Buffer.allocUnsafe() are the same thing, just different names.
Buffer.alloc() fills memory with zeros (safe, but slower). Buffer.allocUnsafe() returns uninitialized memory (fast, but may contain old data - risk of leakage)
allocUnsafe() skips the memory zeroing stage. Without immediately overwriting all the bytes, remnants of old data (passwords, tokens) can be read. Use alloc() by default, allocUnsafe() only when the Buffer will be filled completely right away.
Why can Buffer.allocUnsafe() lead to a leak of confidential data?
Key Ideas
- **Buffer is a Uint8Array with additional methods**: it works with raw bytes (1 byte = 8 bits), unlike String (UTF-16, 2 bytes per character). It is built on top of ArrayBuffer from ES6.
- **Security**: `Buffer.alloc()` is safe (fills with zeros), `Buffer.allocUnsafe()` may leak old data from memory. Use `alloc()` by default, `allocUnsafe()` only when all bytes are immediately overwritten.
- **Performance**: `.subarray()` creates a view (shared memory), not a copy. `Buffer.concat()` in a loop is an anti-pattern (quadratic complexity). For large data (>512KB) Node.js allocates memory outside the V8 heap.
- **Encodings**: UTF-8 for text, Base64 for transmitting bytes via JSON/HTTP, Hex for debugging. The same bytes are read differently depending on the encoding.
- **TypedArrays**: Buffer === Uint8Array + Node.js API. ArrayBuffer - a low-level block of memory on which all views (Uint8Array, Int16Array, DataView) are built.
Related topics
Buffer is the foundation for working with I/O in Node.js. Here's how it relates to other topics:
- Streams — Streams transmit data in chunks - this is a Buffer. The methods `stream.read()`, `stream.write()` work with Buffer, not strings.
- Crypto — All cryptographic operations (hash, encrypt, sign) require a Buffer. String encoding affects the hash result.
- File System — fs.readFile() returns a Buffer (by default). For text, encoding must be explicitly specified: 'utf8'
Вопросы для размышления
- When is Buffer.allocUnsafe() preferable over Buffer.alloc()? In what situations does the speed gain justify the risk of data leakage?
- Why do JWT tokens use Base64URL instead of plain Base64? What happens with the '+' and '/' characters in a URL?
- How is parsing of multipart/form-data (file upload via HTTP) typically implemented? How are boundaries between parts located using Buffer.indexOf()?