Lesson 02: vLLM Configuration
Learn to deploy local models with vLLM, optimize inference, and connect AVA SDK
At GDC 2026, Razer unveiled new <strong>agentic</strong> capabilities for Project AVA. vLLM and local models remain the perfect foundation to run these new agentic workflows privately with zero recurring cost.Read more →
Introduction
vLLM is the inference engine that powers AVA SDK. Based on PagedAttention, it delivers industrial-grade performance for language models on local hardware. In this lesson you will learn to install, configure, and optimize vLLM to run models like Llama 3, Mistral, and Qwen directly on your GPU.
How vLLM works
vLLM uses PagedAttention, a novel memory management system that eliminates KV cache fragmentation. This enables:
Higher throughput
Serve more concurrent requests on the same GPU
Lower latency
Optimized inference loop for faster responses
Efficient memory
Near-optimal KV cache management thanks to PagedAttention
OpenAI-compatible API
Same /v1/chat/completions interface, same ergonomics
Step 1: Start the vLLM server
With Docker already configured (Lesson 01), launching vLLM is as simple as:
Make sure Docker Engine and NVIDIA Container Toolkit are running before continuing (Lesson 01).
1docker run --gpus all \
2 -v ~/.cache/huggingface:/root/.cache/huggingface \
3 -p 8000:8000 \
4 --ipc=host \
5 vllm/vllm-openai:latest \
6 --model meta-llama/Meta-Llama-3.1-8B-InstructThe first time, vLLM will download the model from HuggingFace (several GB). Subsequent runs use the local cache. Make sure you have ~16 GB free.
Step 2: Choose a model
vLLM supports dozens of open-source models. For AVA SDK we recommend:
| Model | Required VRAM | Recommended use |
|---|---|---|
| Llama 3.1 8B | 16 GB VRAM | Best quality/performance balance |
| Mistral 7B | 12 GB VRAM | Fast and efficient for simple tasks |
| Qwen 2.5 7B | 14 GB VRAM | Excellent for structured responses |
You need a HuggingFace account and must have accepted Llama 3 terms of use to download the model. Alternatively, use Mistral 7B which requires no authentication.
Step 3: Connect vLLM with AVA SDK
Once the vLLM server is running, configure AVA SDK to use it as the local inference backend:
1from openai import OpenAI
2
3# AVA SDK - Configuración de inferencia local
4client = OpenAI(
5 base_url="http://localhost:8000/v1",
6 api_key="not-needed" # vLLM no requiere API key
7)
8
9response = client.chat.completions.create(
10 model="meta-llama/Meta-Llama-3.1-8B-Instruct",
11 messages=[
12 {"role": "system", "content": "Eres AVA, un asistente táctico."},
13 {"role": "user", "content": "¿Qué comandos básicos de AVA conoces?"}
14 ],
15 temperature=0.7,
16 max_tokens=512
17)
18
19print(response.choices[0].message.content)Although vLLM does not require an API key, the api_key parameter is mandatory in the OpenAI client. Use any value.
Benchmark: vLLM vs OpenAI
Comparative results on an NVIDIA RTX 4090 (24 GB VRAM) with Llama 3.1 8B vs GPT-4o mini:
| Metric | vLLM (Local) | OpenAI API |
|---|---|---|
| First response latency | ~350ms | ~800ms |
| Throughput | ~120 req/s | ~50 req/s |
| Cost per 1M tokens | ~$0 | ~$0.15 |
| Data privacy | Total (local) | Cloud |
With vLLM you not only save costs — you also gain in latency and privacy. All data stays on your machine.
Performance optimization
Max Model Len
Reduce max_model_len if your GPU has limited VRAM. E.g.: --max-model-len 4096
Tensor Parallel
If you have multiple GPUs, use --tensor-parallel-size 2 to distribute the model
Quantization
Enable AWQ or GPTQ: vllm serve ... --quantization awq to reduce VRAM by 40%
GPU Memory Utilization
Adjust --gpu-memory-utilization 0.9 to reserve memory for other processes
Troubleshooting
Out of Memory (OOM)
Reduce max_model_len, enable quantization, or use a smaller model (7B instead of 13B)
Server won't start
Verify nvidia-smi works inside the container: docker run --gpus all nvidia/cuda nvidia-smi
Very slow responses
Check the GPU is being used (nvidia-smi) and no processes are competing for VRAM