Jan 3, 2026 · 1 min read · -- views

vLLM on RunPod: Quick Start

python ai vllm

Quick guide to running Llama 4 with vLLM on RunPod.

Create RunPod Account
Create a Pod
SSH In
Install vLLM
Run
My Setup

Create RunPod Account

Create a Pod

Go to Pods → + Deploy
Select GPU: 2x H200 SXM (140GB total VRAM)
Pick a template with CUDA (e.g., RunPod Pytorch)
Important: Increase volume storage to 100GB+ - models are large
Deploy

Why 2x H200? One 80GB GPU wasn’t enough. FP8 also needs newer hardware - A40s won’t work.

SSH In

Once the pod is running, grab the SSH command from the pod details:

ssh root@xyz.runpod.io -p 12345 -i ~/.ssh/id_rsa

Install vLLM

uv pip install vllm --system

Set your HuggingFace token for gated models:

export HF_TOKEN="hf_xxx"

Run

from vllm import LLM

llm = LLM(
    model="nvidia/Llama-4-Scout-17B-16E-Instruct-FP8",
    tensor_parallel_size=2,
    max_model_len=1024,
    enforce_eager=True,
)

output = llm.generate("Hello, how are you?")
print(output)

tensor_parallel_size=2 - splits model across both GPUs
enforce_eager=True - disables CUDA graphs for stability

My Setup

2x H200 SXM
vLLM 0.13.0
Llama-4-Scout-17B-16E-Instruct-FP8

Edit on GitHub