How to Use the Model on Multiple GPUs?

by NaiveYan - opened May 5

May 5

Thank you for your quantized model!

The Llama-3_1-Nemotron-Ultra-253B-v1 model exhibits uneven VRAM consumption per layer, and the default multi-GPU partitioning in llama.cpp server frequently causes out-of-memory (OOM) errors.

For example, when launching the 92GB q2kxl model on a 10×Titan V setup (120GB total VRAM), OOM still occurs because one GPU was allocated a 13GB weight segment.

If possible:

Could you provide per-layer memory usage to help users customize the -ts parameter?
Directly sharing recommended -ts configurations for 8/9/10-GPU setups would be greatly appreciated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment