AI Engineering
Local LLM: Ollama, llama.cpp, vLLM - Running Models on Self-Hosted Hardware
Цели урока
- Understand when self-hosted LLM is justified: privacy, cost breakeven, latency
- Set up Ollama and integrate with NestJS via the OpenAI-compatible API
- Distinguish engines: Ollama (dev), llama.cpp (edge), vLLM (production)
- Understand continuous batching - why vLLM is 2-4x faster under concurrency
- Choose quantization (GGUF Q4_K_M vs AWQ vs GPTQ) for specific hardware
`ollama run llama3` - one command, model running locally. No API keys, no bills, no data leaving the machine. Q4_K_M quantization fits 70B parameters into a single A100. vLLM with continuous batching squeezes throughput from one card that used to require a cluster. Apple MLX runs Llama 8B on a MacBook Pro at 60 tok/s. Self-hosted LLM is no longer a hobby - it's production infrastructure at Cloudflare, Discord, and Apple.
- Cloudflare Workers AI: Llama 3 on edge servers in 300+ locations, latency <50ms globally
- Discord: self-hosted models for content moderation on 100M+ messages per day - without sending data to OpenAI
- Apple Intelligence: on-device LLM 3B parameters directly on iPhone - complete privacy, no cloud
- Ollama: 5M+ downloads, became the standard for local LLM development in 2 years
- vLLM from UC Berkeley: 2-4x throughput vs Ollama under concurrent load, production at dozens of companies
The Evening That Brought LLMs to the Laptop
In **early 2023**, Meta's LLaMA weights leaked, and a powerful model was suddenly in the wild - but running it meant PyTorch, CUDA and multi-GPU rigs. In **March 2023**, Georgi Gerganov released **llama.cpp**, a pure C/C++ implementation with no dependencies; his stated goal was to run the model with 4-bit quantization on a MacBook. It worked, and the local-inference movement took off. The project later introduced the **GGUF** single-file model format (August 2023) to package weights for easy distribution. On top of that engine, **Ollama** (2023) wrapped everything in a single binary with a model registry and an OpenAI-compatible API, so a developer could go from nothing to a running model with `ollama run`. Serious LLMs no longer required a data center.
Предварительные знания
Why Run LLMs Locally: Privacy, Cost, Latency
OpenAI sends every token to their servers. Anthropic does too. Google does too. For most tasks that's fine. But some scenarios leave no room for negotiation: medical records (HIPAA), financial transactions (PCI DSS), proprietary source code, government systems. In those cases, self-hosted LLM isn't an option - it's a requirement.
- **Healthcare (HIPAA)** - medical data cannot be sent without a BAA
- **Finance (PCI DSS)** - card data requires strict control
- **Government / Defense** - classified information cannot leave the perimeter
- **EU (GDPR / AI Act)** - data residency, transparency
- **Corporate IP** - source code, patents, internal documents
The second driver is economics. GPT-4o at real scale is not pocket change:
The third reason is latency. OpenAI lives on another continent. A server on the same network gives TTFT of 20-50ms versus 200-800ms from the cloud. In streaming UIs, that difference is felt physically.
| Setup | TTFT | Throughput | Notes |
|---|---|---|---|
| GPT-4o API | 200-800ms | 50-80 tok/s | Depends on OpenAI load |
| GPT-4o-mini API | 100-400ms | 80-120 tok/s | Also varies |
| Self-hosted 8B (A100) | 20-50ms | 100-150 tok/s | Stable |
| Self-hosted 8B (RTX 4090) | 30-80ms | 60-100 tok/s | Consumer GPU |
| Self-hosted 8B (M3 Pro) | 50-100ms | 40-60 tok/s | Dev, testing |
| Self-hosted 70B (2xA100) | 50-100ms | 30-50 tok/s | Production |
**Self-hosted is not free.** GPU servers cost USD 1-10/hr + DevOps. For a small project (<10K req/day), cloud API is almost always cheaper.
Local model means slow model
On modern hardware, a 7B model delivers 30-50 tok/s on CPU and 80-120 tok/s on an RTX 4090 - faster than GPT-4o over the internet
Speed is determined by memory bandwidth, not 'locality'. A100 - 2 TB/s. RTX 4090 - 1 TB/s. Even Apple M3 Pro - 150-200 GB/s. A network roundtrip to OpenAI adds 100-500ms regardless of how fast their servers generate. For TTFT, local infrastructure wins almost every time.
A startup: 5,000 requests/day via GPT-4o-mini. Monthly bill ~USD 50. Should it switch to self-hosted?
Ollama: Setup, API, and Node.js Integration
`ollama run llama3` - one command, model running locally. No API keys, no bills, no data leaving the machine. **Ollama** handles downloading GGUF weights, managing GPU memory, hot-swapping between models, and exposes an OpenAI-compatible HTTP server on port 11434. For development and private data - this changes everything.
The key detail: Ollama exposes an OpenAI-compatible API. That means zero code changes. Only `baseURL` switches:
**NestJS integration** with health check and fallback:
**Apple Silicon:** Ollama uses the Metal API. M3 Pro: ~50-70 tok/s for 8B. Unified memory (32-128 GB) allows loading even 70B models. Apple MLX is an alternative framework from Apple that gives another 20-30% speed boost on M-chips.
Docker-compose: Ollama + NestJS. On startup, Ollama hasn't loaded the model yet. What happens with the first requests?
llama.cpp: Inference on CPU and Edge Devices
Ollama is a wrapper. Inside it runs **llama.cpp** - a C/C++ engine written by Georgi Gerganov over a weekend in 2023. The goal was simple: run LLaMA on a MacBook without a GPU. The result was a project with 60K+ GitHub stars and the core of half the local LLM ecosystem.
The key feature of llama.cpp: runs **on CPU without GPU**, supports ARM (Raspberry Pi, Android), Apple Metal, Vulkan. Where there's no NVIDIA - llama.cpp is the only option.
- **Edge deployment** - Raspberry Pi, embedded, mobile
- **Maximum control** - batch sizes, threads, KV-cache
- **C/C++ integration** - direct calls without HTTP overhead
- **Custom builds** - SIMD optimizations for a specific CPU
| Device | RAM | Prompt (tok/s) | Generation (tok/s) | Suitability |
|---|---|---|---|---|
| Raspberry Pi 5 (8GB) | 8 GB | 2-3 | 3-5 | IoT demo |
| Intel i7-12700 | 32 GB | 25-35 | 15-20 | Development |
| Apple M3 Pro | 36 GB | 80-120 | 50-70 | Dev + small prod |
| RTX 4090 | 24 GB | 200-300 | 80-120 | Production, single user |
| A100 80GB | 80 GB | 300-500 | 100-150 | Production, multi-user |
**GGUF format** is a binary format for quantized models developed for llama.cpp. It contains metadata and weights in a single file. LM Studio, Ollama, Jan - all use GGUF under the hood. Hugging Face stores thousands of ready-made GGUF files.
An LLM is needed on a Raspberry Pi 5 (8GB) for IoT. What tool and model?
vLLM: Production Serving with Continuous Batching
Ollama serves requests one by one. Fifty simultaneous requests arrive - Ollama queues them up, the GPU sits idle while one finishes before the next begins. **vLLM** solves this differently.
Two key innovations: **PagedAttention** (KV-cache managed as virtual memory pages, no fragmentation) and **continuous batching** - dynamically inserting new requests into an already-running batch:
| Feature | Ollama | llama.cpp | vLLM | TGI |
|---|---|---|---|---|
| Setup | Minimal | Medium | Medium | Medium |
| GPU required | No | No | NVIDIA GPU | NVIDIA GPU |
| Continuous batching | No | No | Yes | Yes |
| Throughput (multi-user) | Low | Low | High (2-4x) | High |
| Quantization | GGUF | GGUF | AWQ, GPTQ, FP8 | AWQ, GPTQ |
| Best for | Dev | Edge/IoT | Production | Production |
A service: 50 concurrent requests to Llama 8B. Ollama: avg latency 800ms. What happens when switching to vLLM?
Quantization: GGUF, AWQ, GPTQ - Compressing Models
Llama 8B in float16 weighs 16 GB. Llama 70B - 140 GB. An RTX 4090 has 24 GB of VRAM. Simple math: most interesting models physically don't fit in a consumer GPU. **Quantization** is the engineering answer to this constraint.
The idea is simple: store 4 or 8 bits per weight instead of 16. Precision is lost, but neural networks are surprisingly resilient to this - proper quantization gives 1-3% quality loss at 75% memory savings:
| Format | Engine | Method | When to Use |
|---|---|---|---|
| GGUF | llama.cpp, Ollama | Post-training, CPU-friendly | CPU, macOS, edge, Ollama |
| GPTQ | vLLM, TGI, HuggingFace | Post-training, GPU | NVIDIA GPU production |
| AWQ | vLLM, TGI | Activation-aware, GPU | NVIDIA GPU (slightly better than GPTQ) |
| FP8 | vLLM, TensorRT-LLM | 8-bit float | Modern GPUs (H100, Ada) |
| BitsAndBytes | HuggingFace | On-the-fly | QLoRA fine-tuning |
Inside GGUF there are gradations. Q4_K_M means: 4-bit quantization, K-quant method (better preserves important weights), M-size (Medium - balance between quality and size). This is the golden standard for almost all tasks:
| Model | FP16 | Q8 (GGUF) | Q4_K_M (GGUF) | AWQ 4-bit | Min GPU |
|---|---|---|---|---|---|
| Phi-3 3.8B | 7.6 GB | 4.0 GB | 2.3 GB | 2.5 GB | RTX 3060 12GB |
| Llama 8B | 16 GB | 8.5 GB | 4.7 GB | 5.0 GB | RTX 4060 Ti 16GB |
| Llama 70B | 140 GB | 74 GB | 40 GB | 38 GB | 2x A100 80GB |
| Mixtral 8x7B | 94 GB | 50 GB | 27 GB | 25 GB | A100 80GB |
| Llama 405B | 810 GB | 428 GB | 237 GB | 220 GB | 8x A100 80GB |
**Quantization doesn't make a model smarter.** If a model can't handle a task in FP16, the quantized version won't either. Quantization is about deployment (less VRAM, faster), not about quality.
Local model means slow model
On modern hardware, a 7B model delivers 30-50 tok/s on CPU and 80-120 tok/s on an RTX 4090 - faster than GPT-4o over the internet
Speed is determined by memory bandwidth, not 'locality'. A100 - 2 TB/s. RTX 4090 - 1 TB/s. Even Apple M3 Pro - 150-200 GB/s. A network roundtrip to OpenAI adds 100-500ms regardless of how fast their servers generate. For TTFT, local infrastructure wins almost every time.
Key Takeaways
- Self-hosted is justified when: compliance (HIPAA/GDPR), >30-50K req/day, requirement of <100ms TTFT
- Ollama: `ollama run llama3` - 2 minutes from zero to a running LLM with an OpenAI-compatible API
- llama.cpp: C++ core under Ollama, the only option for CPU/ARM/edge without NVIDIA
- vLLM: continuous batching + PagedAttention = 2-4x throughput under concurrent load vs Ollama
- Q4_K_M is the golden standard: 75% memory savings, ~3% quality loss, 7B model = ~4-5 GB VRAM
- Apple MLX - for M-chips gives +20-30% speed boost over Ollama on the same hardware
What's Next
Running a model is the first step. Production requires serving infrastructure.
- Model Serving — TGI, vLLM in Docker, GPU autoscaling, monitoring
- Fine-tuning — A fine-tuned model is deployed via Ollama/vLLM
- Distillation — Distilled model → GGUF → Ollama
Связанные уроки
- aie-03-llm-fundamentals — Local serving needs model internals knowledge
- aie-37-open-source-models — Local inference needs open weights
- aie-40-model-serving — Local models scale via serving infrastructure
- aie-38-distillation — Distilled models fit consumer hardware
- ml-46-model-serving — Same self-hosted inference engineering
- sd-03-scalability