From Offline to Online Inference: Why Serving Is Hard and How vLLM Helps

Date:

Resources: [Slides]

Abstract: As large language models (LLMs) transition from research prototypes to production systems, inference becomes a primary bottleneck, and the gap between “it runs” and “it serves” widens quickly. While offline inference can be optimized for throughput, online inference must handle concurrency, variable-length prompts, and unpredictable workloads, making strong performance much harder to achieve.

In this talk, we begin with a quick introduction to LLM inference, then contrast offline and online settings, highlighting why online serving is challenging. We then compare Ollama and vLLM on a small benchmark of 50 requests sampled from the Alpaca dataset, showing how vLLM can achieve significantly faster runtime. We also explain the core features behind vLLM: PagedAttention and continuous batching, and relate them to hardware behavior such as GPU utilization and memory usage. Finally, we briefly explore additional techniques to accelerate inference in practice, such as prompt engineering and speculative decoding.