CUDA acceleration not working (GPU utilization ~0% on dual A6000)
Hi,
I followed the instructions from the model card and cloned the code from PR #16095.
During the build, I explicitly enabled CUDA using the following command:
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON make -C build -j
However, when running the example programs, the inference speed is extremely slow.
My machine has two NVIDIA A6000 GPUs — memory usage appears normal (GPU0: 28455MiB / 46080MiB, GPU1: 25337MiB / 46080MiB), but the GPU utilization stays around 0% in nvidia-smi.
It seems that CUDA acceleration is not actually being used.
Could you please advise on what might be causing this?
The model is Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
use this code:
wget https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz
tar xf test.tar.gz
cd llama.cpp-test
export PATH=/usr/local/cuda/bin:$PATH
time cmake -B build -DGGML_CUDA=ON
time cmake --build build --config Release --parallel $(nproc --all)
it will give you the acceleration
the acceleration is not yet integrated in the PR 16095, nor in the main branch
I get about 50 tokens / seconds generation on an Nvidia L40S 48GB and 800 t/s for prompt eval
Hi, I have a RTX A6000 and using it with the test build, offloading all layers to the GPU. Using the MXFP4_MOE version but only getting 22 t/s. Since the L40S and the A600 should roughly be the same in performance, I dont understand why I get only half the speed :/
not sure :
I tested running a RTX 6000 48GB
and got slightly better token generation tok/s of: 60t/s
but less good prompt eval of :264 t/s
time wget 'https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz'
tar xf test.tar.gz
cd llama.cpp-test
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --parallel 15
pip install hf_transfer
pip install 'huggingface_hub[cli]'
cd build/bin
hf download lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF --include "MXFP4_MOE.gguf" --local-dir MXFP4_MOE
./llama-cli -ngl 100 -m MXFP4_MOE/*.gguf --no-mmap --prompt "explain quantum computing in a paragraph" -st
user
explain quantum computing in a paragraph
assistant
Quantum computing is a revolutionary approach to computation that leverages the principles of quantum mechanics—such as superposition, entanglement, and interference—to process information in ways fundamentally different from classical computers. While classical bits are either 0 or 1, quantum bits (qubits) can exist in a combination of both states simultaneously thanks to superposition, allowing quantum computers to explore many possible solutions at once. Entanglement enables qubits to be correlated in such a way that the state of one instantly influences the state of another, even over large distances, vastly increasing computational power for certain problems. Quantum interference is used to amplify correct computational paths and cancel out wrong ones. This enables quantum computers to solve specific problems—like factoring large numbers, simulating quantum systems, or optimizing complex processes—exponentially faster than classical machines, though they are not universally faster and remain highly sensitive to environmental noise, requiring extreme cooling and error correction. [end of text]
llama_perf_sampler_print: sampling time = 17.16 ms / 204 runs ( 0.08 ms per token, 11885.34 tokens per second)
llama_perf_context_print: load time = 5119.45 ms
llama_perf_context_print: prompt eval time = 53.00 ms / 14 tokens ( 3.79 ms per token, 264.13 tokens per second)
llama_perf_context_print: eval time = 3118.63 ms / 189 runs ( 16.50 ms per token, 60.60 tokens per second)
llama_perf_context_print: total time = 3223.29 ms / 203 tokens
llama_perf_context_print: graphs reused = 0
Voila
Thanks for getting back to me. The speed is back to normal now 🌹