What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395
Hello, since running this Q8 on RTX 3090 needs 5 GPUS, I want to ask what speeds do you get with AMD AI Max+ 395 ?
At context start, and at say context filled up to 100k tokens?
I am getting around 200pp and 30tg at 0 context, it slows down to a crawl at around 20-30K context, less than 50pp and 7.5tg.
Very usable until 20-30k tokens.
Thank you. Coding requires at least 128k context as prompt in cline, roocode, kilocode is 30K minimum. Reading 4 files (js, html, css, json), and 4 documentation (md files like API_DOCUMENTATION.md README.md technical_specification.md) files. Then this AMD Ryzen™ AI Max+ 395 not usable for coding :(
Sadly gpu is still the only way for long context. If you only have 100pp, it will take 5 min to process 30k prompt...
I found this one is a bit more performant - https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF on my strix halo machine. I'm using Q6_K and it's my go-to for coding tasks at the moment.
how did you get Kwaipilot KAT-Dev up and running?
I was able to get it working, but had to use the prompt template here - https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp. It's decent, but is pretty bad at tool calling. It constantly fails with roo code and I found that glm air reap is much more consistent. Here are the llama server args I gave it: https://github.com/blake-hamm/bhamm-lab/blob/main/kubernetes/manifests/apps/ai/models/helm-green.yaml#L142 (probably still needs some tweaking; would love any feedback as I'm still new to llama.cpp in general).
If you find this helpful for comparison, not Q8, but MXFP4 quant: I got roughly 13 tok/sec running glm-4.5-air (specifically, "noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF") with context 131K on Gmktec Evo X2 (AMD AI Max+ 395, Windows, ROCm) using LM Studio. Not fast, but quite usable.
Q6 for coding is minimum requirement I think, after testing Q8 and Q8-XL of other LLM's
