AI Engineering

Local LLM: Ollama, llama.cpp, vLLM - Running Models on Self-Hosted Hardware

Цели урока

Understand when self-hosted LLM is justified: privacy, cost breakeven, latency
Set up Ollama and integrate with NestJS via the OpenAI-compatible API
Distinguish engines: Ollama (dev), llama.cpp (edge), vLLM (production)
Understand continuous batching - why vLLM is 2-4x faster under concurrency
Choose quantization (GGUF Q4_K_M vs AWQ vs GPTQ) for specific hardware

`ollama run llama3` - one command, model running locally. No API keys, no bills, no data leaving the machine. Q4_K_M quantization fits 70B parameters into a single A100. vLLM with continuous batching squeezes throughput from one card that used to require a cluster. Apple MLX runs Llama 8B on a MacBook Pro at 60 tok/s. Self-hosted LLM is no longer a hobby - it's production infrastructure at Cloudflare, Discord, and Apple.

Cloudflare Workers AI: Llama 3 on edge servers in 300+ locations, latency <50ms globally
Discord: self-hosted models for content moderation on 100M+ messages per day - without sending data to OpenAI
Apple Intelligence: on-device LLM 3B parameters directly on iPhone - complete privacy, no cloud
Ollama: 5M+ downloads, became the standard for local LLM development in 2 years
vLLM from UC Berkeley: 2-4x throughput vs Ollama under concurrent load, production at dozens of companies

The Evening That Brought LLMs to the Laptop

In **early 2023**, Meta's LLaMA weights leaked, and a powerful model was suddenly in the wild - but running it meant PyTorch, CUDA and multi-GPU rigs. In **March 2023**, Georgi Gerganov released **llama.cpp**, a pure C/C++ implementation with no dependencies; his stated goal was to run the model with 4-bit quantization on a MacBook. It worked, and the local-inference movement took off. The project later introduced the **GGUF** single-file model format (August 2023) to package weights for easy distribution. On top of that engine, **Ollama** (2023) wrapped everything in a single binary with a model registry and an OpenAI-compatible API, so a developer could go from nothing to a running model with `ollama run`. Serious LLMs no longer required a data center.

Предварительные знания

How LLMs Work: Tokens, Embeddings, Attention

Why Run LLMs Locally: Privacy, Cost, Latency

OpenAI sends every token to their servers. Anthropic does too. Google does too. For most tasks that's fine. But some scenarios leave no room for negotiation: medical records (HIPAA), financial transactions (PCI DSS), proprietary source code, government systems. In those cases, self-hosted LLM isn't an option - it's a requirement.

**Healthcare (HIPAA)** - medical data cannot be sent without a BAA
**Finance (PCI DSS)** - card data requires strict control
**Government / Defense** - classified information cannot leave the perimeter
**EU (GDPR / AI Act)** - data residency, transparency
**Corporate IP** - source code, patents, internal documents

The second driver is economics. GPT-4o at real scale is not pocket change:

The third reason is latency. OpenAI lives on another continent. A server on the same network gives TTFT of 20-50ms versus 200-800ms from the cloud. In streaming UIs, that difference is felt physically.

Setup	TTFT	Throughput	Notes
GPT-4o API	200-800ms	50-80 tok/s	Depends on OpenAI load
GPT-4o-mini API	100-400ms	80-120 tok/s	Also varies
Self-hosted 8B (A100)	20-50ms	100-150 tok/s	Stable
Self-hosted 8B (RTX 4090)	30-80ms	60-100 tok/s	Consumer GPU
Self-hosted 8B (M3 Pro)	50-100ms	40-60 tok/s	Dev, testing
Self-hosted 70B (2xA100)	50-100ms	30-50 tok/s	Production

**Self-hosted is not free.** GPU servers cost USD 1-10/hr + DevOps. For a small project (<10K req/day), cloud API is almost always cheaper.

Local model means slow model

On modern hardware, a 7B model delivers 30-50 tok/s on CPU and 80-120 tok/s on an RTX 4090 - faster than GPT-4o over the internet

Speed is determined by memory bandwidth, not 'locality'. A100 - 2 TB/s. RTX 4090 - 1 TB/s. Even Apple M3 Pro - 150-200 GB/s. A network roundtrip to OpenAI adds 100-500ms regardless of how fast their servers generate. For TTFT, local infrastructure wins almost every time.

A startup: 5,000 requests/day via GPT-4o-mini. Monthly bill ~USD 50. Should it switch to self-hosted?

Ollama: Setup, API, and Node.js Integration

`ollama run llama3` - one command, model running locally. No API keys, no bills, no data leaving the machine. **Ollama** handles downloading GGUF weights, managing GPU memory, hot-swapping between models, and exposes an OpenAI-compatible HTTP server on port 11434. For development and private data - this changes everything.

The key detail: Ollama exposes an OpenAI-compatible API. That means zero code changes. Only `baseURL` switches:

**NestJS integration** with health check and fallback:

**Apple Silicon:** Ollama uses the Metal API. M3 Pro: ~50-70 tok/s for 8B. Unified memory (32-128 GB) allows loading even 70B models. Apple MLX is an alternative framework from Apple that gives another 20-30% speed boost on M-chips.

Docker-compose: Ollama + NestJS. On startup, Ollama hasn't loaded the model yet. What happens with the first requests?

llama.cpp: Inference on CPU and Edge Devices

Ollama is a wrapper. Inside it runs **llama.cpp** - a C/C++ engine written by Georgi Gerganov over a weekend in 2023. The goal was simple: run LLaMA on a MacBook without a GPU. The result was a project with 60K+ GitHub stars and the core of half the local LLM ecosystem.

The key feature of llama.cpp: runs **on CPU without GPU**, supports ARM (Raspberry Pi, Android), Apple Metal, Vulkan. Where there's no NVIDIA - llama.cpp is the only option.

**Edge deployment** - Raspberry Pi, embedded, mobile
**Maximum control** - batch sizes, threads, KV-cache
**C/C++ integration** - direct calls without HTTP overhead
**Custom builds** - SIMD optimizations for a specific CPU

Device	RAM	Prompt (tok/s)	Generation (tok/s)	Suitability
Raspberry Pi 5 (8GB)	8 GB	2-3	3-5	IoT demo
Intel i7-12700	32 GB	25-35	15-20	Development
Apple M3 Pro	36 GB	80-120	50-70	Dev + small prod
RTX 4090	24 GB	200-300	80-120	Production, single user
A100 80GB	80 GB	300-500	100-150	Production, multi-user

**GGUF format** is a binary format for quantized models developed for llama.cpp. It contains metadata and weights in a single file. LM Studio, Ollama, Jan - all use GGUF under the hood. Hugging Face stores thousands of ready-made GGUF files.

An LLM is needed on a Raspberry Pi 5 (8GB) for IoT. What tool and model?

vLLM: Production Serving with Continuous Batching

Ollama serves requests one by one. Fifty simultaneous requests arrive - Ollama queues them up, the GPU sits idle while one finishes before the next begins. **vLLM** solves this differently.

Two key innovations: **PagedAttention** (KV-cache managed as virtual memory pages, no fragmentation) and **continuous batching** - dynamically inserting new requests into an already-running batch:

Feature	Ollama	llama.cpp	vLLM	TGI
Setup	Minimal	Medium	Medium	Medium
GPU required	No	No	NVIDIA GPU	NVIDIA GPU
Continuous batching	No	No	Yes	Yes
Throughput (multi-user)	Low	Low	High (2-4x)	High
Quantization	GGUF	GGUF	AWQ, GPTQ, FP8	AWQ, GPTQ
Best for	Dev	Edge/IoT	Production	Production

A service: 50 concurrent requests to Llama 8B. Ollama: avg latency 800ms. What happens when switching to vLLM?

Quantization: GGUF, AWQ, GPTQ - Compressing Models

Llama 8B in float16 weighs 16 GB. Llama 70B - 140 GB. An RTX 4090 has 24 GB of VRAM. Simple math: most interesting models physically don't fit in a consumer GPU. **Quantization** is the engineering answer to this constraint.

The idea is simple: store 4 or 8 bits per weight instead of 16. Precision is lost, but neural networks are surprisingly resilient to this - proper quantization gives 1-3% quality loss at 75% memory savings:

Format	Engine	Method	When to Use
GGUF	llama.cpp, Ollama	Post-training, CPU-friendly	CPU, macOS, edge, Ollama
GPTQ	vLLM, TGI, HuggingFace	Post-training, GPU	NVIDIA GPU production
AWQ	vLLM, TGI	Activation-aware, GPU	NVIDIA GPU (slightly better than GPTQ)
FP8	vLLM, TensorRT-LLM	8-bit float	Modern GPUs (H100, Ada)
BitsAndBytes	HuggingFace	On-the-fly	QLoRA fine-tuning

Inside GGUF there are gradations. Q4_K_M means: 4-bit quantization, K-quant method (better preserves important weights), M-size (Medium - balance between quality and size). This is the golden standard for almost all tasks:

Model	FP16	Q8 (GGUF)	Q4_K_M (GGUF)	AWQ 4-bit	Min GPU
Phi-3 3.8B	7.6 GB	4.0 GB	2.3 GB	2.5 GB	RTX 3060 12GB
Llama 8B	16 GB	8.5 GB	4.7 GB	5.0 GB	RTX 4060 Ti 16GB
Llama 70B	140 GB	74 GB	40 GB	38 GB	2x A100 80GB
Mixtral 8x7B	94 GB	50 GB	27 GB	25 GB	A100 80GB
Llama 405B	810 GB	428 GB	237 GB	220 GB	8x A100 80GB

**Quantization doesn't make a model smarter.** If a model can't handle a task in FP16, the quantized version won't either. Quantization is about deployment (less VRAM, faster), not about quality.

Local model means slow model

On modern hardware, a 7B model delivers 30-50 tok/s on CPU and 80-120 tok/s on an RTX 4090 - faster than GPT-4o over the internet

Key Takeaways

Self-hosted is justified when: compliance (HIPAA/GDPR), >30-50K req/day, requirement of <100ms TTFT
Ollama: `ollama run llama3` - 2 minutes from zero to a running LLM with an OpenAI-compatible API
llama.cpp: C++ core under Ollama, the only option for CPU/ARM/edge without NVIDIA
vLLM: continuous batching + PagedAttention = 2-4x throughput under concurrent load vs Ollama
Q4_K_M is the golden standard: 75% memory savings, ~3% quality loss, 7B model = ~4-5 GB VRAM
Apple MLX - for M-chips gives +20-30% speed boost over Ollama on the same hardware

What's Next

Running a model is the first step. Production requires serving infrastructure.

Model Serving — TGI, vLLM in Docker, GPU autoscaling, monitoring
Fine-tuning — A fine-tuned model is deployed via Ollama/vLLM
Distillation — Distilled model → GGUF → Ollama

Связанные уроки

aie-03-llm-fundamentals — Local serving needs model internals knowledge
aie-37-open-source-models — Local inference needs open weights
aie-40-model-serving — Local models scale via serving infrastructure
aie-38-distillation — Distilled models fit consumer hardware
ml-46-model-serving — Same self-hosted inference engineering
sd-03-scalability