Skip to main content
LESSON 02

Lesson 02: vLLM Configuration

Learn to deploy local models with vLLM, optimize inference, and connect AVA SDK

GDC 2026 Update

At GDC 2026, Razer unveiled new <strong>agentic</strong> capabilities for Project AVA. vLLM and local models remain the perfect foundation to run these new agentic workflows privately with zero recurring cost.Read more →

Introduction

vLLM is the inference engine that powers AVA SDK. Based on PagedAttention, it delivers industrial-grade performance for language models on local hardware. In this lesson you will learn to install, configure, and optimize vLLM to run models like Llama 3, Mistral, and Qwen directly on your GPU.

How vLLM works

vLLM uses PagedAttention, a novel memory management system that eliminates KV cache fragmentation. This enables:

Higher throughput

Serve more concurrent requests on the same GPU

Lower latency

Optimized inference loop for faster responses

Efficient memory

Near-optimal KV cache management thanks to PagedAttention

OpenAI-compatible API

Same /v1/chat/completions interface, same ergonomics

Step 1: Start the vLLM server

With Docker already configured (Lesson 01), launching vLLM is as simple as:

Advertencia

Make sure Docker Engine and NVIDIA Container Toolkit are running before continuing (Lesson 01).

Terminal
1docker run --gpus all \
2    -v ~/.cache/huggingface:/root/.cache/huggingface \
3    -p 8000:8000 \
4    --ipc=host \
5    vllm/vllm-openai:latest \
6    --model meta-llama/Meta-Llama-3.1-8B-Instruct
First download

The first time, vLLM will download the model from HuggingFace (several GB). Subsequent runs use the local cache. Make sure you have ~16 GB free.

Step 2: Choose a model

vLLM supports dozens of open-source models. For AVA SDK we recommend:

ModelRequired VRAMRecommended use
Llama 3.1 8B16 GB VRAMBest quality/performance balance
Mistral 7B12 GB VRAMFast and efficient for simple tasks
Qwen 2.5 7B14 GB VRAMExcellent for structured responses
Nota

You need a HuggingFace account and must have accepted Llama 3 terms of use to download the model. Alternatively, use Mistral 7B which requires no authentication.

Step 3: Connect vLLM with AVA SDK

Once the vLLM server is running, configure AVA SDK to use it as the local inference backend:

config.py
1from openai import OpenAI
2
3# AVA SDK - Configuración de inferencia local
4client = OpenAI(
5    base_url="http://localhost:8000/v1",
6    api_key="not-needed"  # vLLM no requiere API key
7)
8
9response = client.chat.completions.create(
10    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
11    messages=[
12        {"role": "system", "content": "Eres AVA, un asistente táctico."},
13        {"role": "user", "content": "¿Qué comandos básicos de AVA conoces?"}
14    ],
15    temperature=0.7,
16    max_tokens=512
17)
18
19print(response.choices[0].message.content)
API Key

Although vLLM does not require an API key, the api_key parameter is mandatory in the OpenAI client. Use any value.

Benchmark: vLLM vs OpenAI

Comparative results on an NVIDIA RTX 4090 (24 GB VRAM) with Llama 3.1 8B vs GPT-4o mini:

MetricvLLM (Local)OpenAI API
First response latency~350ms~800ms
Throughput~120 req/s~50 req/s
Cost per 1M tokens~$0~$0.15
Data privacyTotal (local)Cloud
Consejo

With vLLM you not only save costs — you also gain in latency and privacy. All data stays on your machine.

Performance optimization

Max Model Len

Reduce max_model_len if your GPU has limited VRAM. E.g.: --max-model-len 4096

Tensor Parallel

If you have multiple GPUs, use --tensor-parallel-size 2 to distribute the model

Quantization

Enable AWQ or GPTQ: vllm serve ... --quantization awq to reduce VRAM by 40%

GPU Memory Utilization

Adjust --gpu-memory-utilization 0.9 to reserve memory for other processes

Troubleshooting

Out of Memory (OOM)

Reduce max_model_len, enable quantization, or use a smaller model (7B instead of 13B)

Server won't start

Verify nvidia-smi works inside the container: docker run --gpus all nvidia/cuda nvidia-smi

Very slow responses

Check the GPU is being used (nvidia-smi) and no processes are competing for VRAM