16 GB VRAM LLM benchmarks with llama.cpp (speed and context)
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting. I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context wind...

Source: DEV Community
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting. I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context windows. For the broader performance picture (throughput versus latency, VRAM limits, parallel requests, and how benchmarks fit together across hardware and runtimes), see LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization. The quality of the response is analysed in other articles, for instance: Best LLMs for OpenCode - Tested Locally Comparison of Hugo Page Translation quality - LLMs on Ollama I did run similar test for LLMs on Ollama: Best LLMs for Ollama on 16GB VRAM GPU. In this post I am recording my attempts to squeeze as much performance in a sense of speed as possible. LLM speed comparison table (tokens per second and VRAM) Model Size 19K VRAM 19K GPU/CPU 19K T/s 32K VRAM 32K Load 32K T/s 64K VRAM 64K Load 64K: T/s Qwen3.5-35B-A3B-UD-IQ3_S 13.6 14.3GB 93%/100% 136.4 14.6G