1.LLM推理2:vLLM源码学习
LLM推理2:vLLM源码学习
vLLM,游戏源码t源微信运动刷步数网站源码 developed at UC Berkeley, redefines LLM service efficiency with PagedAttention. This technology boosts throughput by times compared to HuggingFace Transformers without altering the model architecture, implemented in Python/C++/CUDA.
At the heart of vLLM lies PagedAttention, addressing the memory bottleneck in LLM services. In traditional self-attention, computation lags behind memory access, causing performance constraints. PagedAttention utilizes virtual memory and paging principles to store continuous keys and values in non-contiguous memory segments. By dividing each sequence's KV cache into blocks, PagedAttention facilitates efficient attention computations. With near-optimal memory usage, PagedAttention minimizes memory waste to under 4%, while also supporting efficient memory sharing to reduce overhead in complex sampling algorithms, thus enhancing throughput.
Continuous batching, initially unclear, was illuminated by @哦哦啊's insights. This technique optimizes system-level batch sizes to yield x or more performance improvements in real-world workloads. While most optimizations focus on model quantization and custom CUDA kernels, IO and memory issues typically outweigh compute concerns in LLM inference.
LLM inference is memory-bound, not compute-bound. It often takes longer to load data to GPU cores than the computations themselves. Thus, throughput largely hinges on the batch size that can fit into high-bandwidth GPU memory. As the batch size increases, especially when max tokens are high, the disparity in completion times across different batches can diminish GPU utilization.
vLLM stands out in benchmark tests, more than doubling performance over naive continuous batching. The dynamic space reservation capability of vLLM is suspected to significantly increase batch sizes, contributing to its superior performance.
In the llm.py file, the _run_engine() function iterates to generate results for any incomplete requests, while the self.llm_engine.step() function retrieves data for sequences needing inference from the _schedule() function, which moves waiting sequences to the running state.
To run vLLM, several methods are available, including adjustments for CUDA and PyTorch version mismatches in installation. Running examples/offline_inference.py provides a straightforward command-line interface.
The LLM class encapsulates model loading, tokenizer creation, worker and scheduler setup, and memory allocation, including the block-based allocation strategy enabled by PagedAttention. The embed, N decoders, and normalization in the LlamaModel class facilitate efficient inference. The RMSNorm class leverages CUDA acceleration, and the LlamaDecoderLayer integrates LlamaAttention and LlamaMLP for processing. PagedAttention is instrumental in optimizing memory usage during inference.
The sampling_params.py file contains default parameters for inference, generally not requiring modification. vLLM's core innovation lies in its PagedAttention technology, which optimizes memory management to enhance throughput.
While single-batch inference may not outperform HuggingFace Transformers, vLLM demonstrates significant advantages in multi-batch scenarios. The discrepancies in inference results between vLLM and HuggingFace (HF) can be explored further for a deeper understanding of the system's nuances.