Guides: LLM hosting, perf, RAG & observability
What changed on the technical blog—and where to start reading

The updates on www.glukhov.org: LLMs to production metrics—four hubs, one site.
The technical blog at www.glukhov.org has been reorganised around four pillars that mirror how teams actually build: pick a runtime, understand limits, wire retrieval, then operate the thing with metrics and logs. This post is a short map of that release—what each pillar is for, and which deeper articles are worth opening first if you are designing or hardening an LLM stack.
1 LLM hosting: where models run
The LLM hosting hub compares local, self-hosted, and cloud paths in one place: Ollama and friends for quick iteration, heavier servers when throughput matters, and managed APIs when you want someone else to own GPUs. The point is not to crown a single winner—it is to match deployment style to control, cost, and operational appetite.
If you are still choosing a local stack, the wide Ollama vs LocalAI vs Jan vs LM Studio vs vLLM comparison walks through API shape, hardware expectations, and how far each option scales before you outgrow it. For Docker-centric shops debating the official route, Docker Model Runner vs Ollama focuses on integration, GPU support, and day-to-day ergonomics. When you already know you need a throughput-oriented server, the vLLM quickstart is a practical on-ramp to OpenAI-compatible serving with the optimisations that matter in production.
2 Performance: throughput, memory, and honest benchmarks
Hosting and performance are inseparable: the same model on two runtimes or two GPUs can tell completely different stories. The LLM performance pillar pulls that into one narrative—latency versus throughput, VRAM pressure, parallel requests, and where time actually goes during inference.
For a runtime shootout in one sitting, Ollama vs llama.cpp vs vLLM vs SGLang is a useful anchor. When you are tuning a single server rather than comparing frameworks, how Ollama handles parallel requests explains concurrency behaviour in terms you can turn into configuration and capacity planning.
3 RAG: retrieval design, not just “add a vector DB”
The RAG tutorial hub treats retrieval-augmented generation as an end-to-end system: embeddings, chunking, stores, reranking, and the failure modes that only show up under real document load.
Two articles pair well once you move past “hello world” retrieval. Chunking strategies in RAG compares fixed, semantic, and hierarchical approaches with trade-offs spelled out for evaluation—not just diagram aesthetics. Vector stores for RAG compared maps Pinecone, Chroma, Weaviate, Milvus, Qdrant, FAISS, and pgvector to the features that matter when you promote a prototype to something on-call engineers must reason about. When you need a wider architectural lens, advanced RAG: LongRAG, Self-RAG, and GraphRAG explains how far the pattern can stretch before you redesign.
4 Observability: metrics, logs, and LLM-specific signals
The observability guide connects classic monitoring material—Prometheus, Grafana, structured logging—to the kinds of questions LLM services raise: queueing, token rates, GPU saturation, and whether slowness is the model, the network, or the retrieval path.
For inference specifically, observability for LLM systems is the conceptual backbone: what to measure, how traces and logs complement metrics, and how to talk about SLOs when outputs are stochastic. When you are ready to paste queries into dashboards, monitor LLM inference with Prometheus and Grafana turns that into PromQL-oriented examples across common serving stacks.
5 How to use this map
Read the four pillar pages first if you want the curated table of contents; use the deep links above when you have a concrete decision or incident in front of you. Together they are meant to read as one stack story—from choosing where the model lives, to measuring it honestly once users depend on it.



