Local code generation model (12GB VRAM, 32GB RAM)

jjoil · October 31, 2025, 8:46pm

I am trying to find a best model for my setup that i can run fully on GPU. I heard about vLLM that allows you to run models on RAM, that can adaptively load on VRAM only active weights, to run on GPU models that do not fit fully on VRAM. Do vLLM really work that well? If u have a recommendations about local models, i am very grateful for any help.

John6666 · November 1, 2025, 1:00am

Quantization is essential when using practical models on consumer-grade GPUs, but quantization and CPU offloading typically don’t mix well. GGUF is well-suited for GPU-CPU hybrid environments in this regard, so Ollama is likely a better choice than vLLM for handling larger models in this scenario. For models that fit entirely within VRAM, backends like vLLM, TGI, or SGLang should be faster.

Rzkoohi · November 1, 2025, 5:34am

agreed to john666. vllm is more for large GPUs. for local use in consumer grade go for ollama. safer. easier and more reliable.

jjoil · November 1, 2025, 8:07pm

Thanks for the info, you are so right, I tested Vllm anyway and it is such a pain in the ass

system · November 2, 2025, 9:18am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best LLMs that can run on 4gb VRAM Beginners	2	5570	January 22, 2025
Feature Suggestion! running large gguf models! Inference Endpoints on the Hub	0	534	December 3, 2023
Should I just get more RAM? Beginners	4	2849	December 22, 2024
Find LLM to run on single gpu with only 8 GB ram Models	10	8443	March 22, 2024
Host a Model with vllm for RAG Models	6	3764	September 12, 2024

Local code generation model (12GB VRAM, 32GB RAM)

Related topics