How to Use the Model on Multiple GPUs?

#2
by NaiveYan - opened

Thank you for your quantized model!

The Llama-3_1-Nemotron-Ultra-253B-v1 model exhibits uneven VRAM consumption per layer, and the default multi-GPU partitioning in llama.cpp server frequently causes out-of-memory (OOM) errors.

For example, when launching the 92GB q2kxl model on a 10×Titan V setup (120GB total VRAM), OOM still occurs because one GPU was allocated a 13GB weight segment.

If possible:

  1. Could you provide per-layer memory usage to help users customize the -ts parameter?
  2. Directly sharing recommended -ts configurations for 8/9/10-GPU setups would be greatly appreciated.

Sign up or log in to comment