Great performance on 96gb VRAM setup

#2
by phakio - opened

I'm getting upwards of 2800t/s prompt processing and 110t/s generation on the Q4 version, 3x3090 and 1x4090 setup. (ik_llama.cpp)

I'll have to play with it more, but from my testings it seamlessly dealt with lewd requests without refusal, I think this model is very creative and offers good tradeoff of speed and knowledge.

Also, I noticed when generating, it doesn't create a lot of heat like other models do. I mean sure there's load on the gpus, but my fan speeds and temps are noticeably lower.

Prompt
- Tokens: 1802
- Time: 571.585 ms
- Speed: 3152.6 t/s
---
Generation
- Tokens: 814
- Time: 7246.257 ms
- Speed: 112.3 t/s

Sign up or log in to comment