vLLM on RunPod: Quick Start
Quick guide to running Llama 4 with vLLM on RunPod.
Table of Contents
Create RunPod Account
Sign up at runpod.io. Add credits - GPU pods bill per hour.
Create a Pod
- Go to Pods → + Deploy
- Select GPU: 2x H200 SXM (140GB total VRAM)
- Pick a template with CUDA (e.g.,
RunPod Pytorch) - Important: Increase volume storage to 100GB+ - models are large
- Deploy
Why 2x H200? One 80GB GPU wasn’t enough. FP8 also needs newer hardware - A40s won’t work.
SSH In
Once the pod is running, grab the SSH command from the pod details:
ssh root@xyz.runpod.io -p 12345 -i ~/.ssh/id_rsaInstall vLLM
uv pip install vllm --systemSet your HuggingFace token for gated models:
export HF_TOKEN="hf_xxx"Run
from vllm import LLM
llm = LLM( model="nvidia/Llama-4-Scout-17B-16E-Instruct-FP8", tensor_parallel_size=2, max_model_len=1024, enforce_eager=True,)
output = llm.generate("Hello, how are you?")print(output)tensor_parallel_size=2- splits model across both GPUsenforce_eager=True- disables CUDA graphs for stability
My Setup
- 2x H200 SXM
- vLLM 0.13.0
- Llama-4-Scout-17B-16E-Instruct-FP8