I am trying to find a best model for my setup that i can run fully on GPU. I heard about vLLM that allows you to run models on RAM, that can adaptively load on VRAM only active weights, to run on GPU models that do not fit fully on VRAM. Do vLLM really work that well? If u have a recommendations about local models, i am very grateful for any help.
1 Like
Quantization is essential when using practical models on consumer-grade GPUs, but quantization and CPU offloading typically don’t mix well. GGUF is well-suited for GPU-CPU hybrid environments in this regard, so Ollama is likely a better choice than vLLM for handling larger models in this scenario. For models that fit entirely within VRAM, backends like vLLM, TGI, or SGLang should be faster.
agreed to john666. vllm is more for large GPUs. for local use in consumer grade go for ollama. safer. easier and more reliable.
1 Like
Thanks for the info, you are so right, I tested Vllm anyway and it is such a pain in the ass
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.