How to Use the Model on Multiple GPUs?
#2
by
NaiveYan
- opened
Thank you for your quantized model!
The Llama-3_1-Nemotron-Ultra-253B-v1 model exhibits uneven VRAM consumption per layer, and the default multi-GPU partitioning in llama.cpp server frequently causes out-of-memory (OOM) errors.
For example, when launching the 92GB q2kxl model on a 10×Titan V setup (120GB total VRAM), OOM still occurs because one GPU was allocated a 13GB weight segment.
If possible:
- Could you provide per-layer memory usage to help users customize the -ts parameter?
- Directly sharing recommended -ts configurations for 8/9/10-GPU setups would be greatly appreciated.