What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395

#14

by akierum - opened Oct 6

Discussion

akierum

Oct 6

•

edited Oct 6

Hello, since running this Q8 on RTX 3090 needs 5 GPUS, I want to ask what speeds do you get with AMD AI Max+ 395 ?

At context start, and at say context filled up to 100k tokens?

akierum changed discussion title from What speed do you get at Q6 on AMD Ryzen™ AI Max+ 395 to What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395 Oct 6

liquidsnakeblue

Oct 6

I am getting around 200pp and 30tg at 0 context, it slows down to a crawl at around 20-30K context, less than 50pp and 7.5tg.

Very usable until 20-30k tokens.

akierum

Oct 6

Thank you. Coding requires at least 128k context as prompt in cline, roocode, kilocode is 30K minimum. Reading 4 files (js, html, css, json), and 4 documentation (md files like API_DOCUMENTATION.md README.md technical_specification.md) files. Then this AMD Ryzen™ AI Max+ 395 not usable for coding :(

CHNtentes

Oct 10

Sadly gpu is still the only way for long context. If you only have 100pp, it will take 5 min to process 30k prompt...

akierum

Oct 10

Well this new Kwaipilot KAT-Dev is superior to GLM4.5-Air or Qwen3-coder 30B seems like a toy. But it was hard to make it work.

V7 - is Kwaipilot / KAT-Dev , V2 - is GLM4.5-Air

bhamm-lab

Oct 30

I found this one is a bit more performant - https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF on my strix halo machine. I'm using Q6_K and it's my go-to for coding tasks at the moment.

jo987654321

Nov 1

how did you get Kwaipilot KAT-Dev up and running?

bhamm-lab

about 1 month ago

I was able to get it working, but had to use the prompt template here - https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp. It's decent, but is pretty bad at tool calling. It constantly fails with roo code and I found that glm air reap is much more consistent. Here are the llama server args I gave it: https://github.com/blake-hamm/bhamm-lab/blob/main/kubernetes/manifests/apps/ai/models/helm-green.yaml#L142 (probably still needs some tweaking; would love any feedback as I'm still new to llama.cpp in general).

11ds11

4 days ago

If you find this helpful for comparison, not Q8, but MXFP4 quant: I got roughly 13 tok/sec running glm-4.5-air (specifically, "noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF") with context 131K on Gmktec Evo X2 (AMD AI Max+ 395, Windows, ROCm) using LM Studio. Not fast, but quite usable.

akierum

about 17 hours ago

Q6 for coding is minimum requirement I think, after testing Q8 and Q8-XL of other LLM's

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment