From Ether to Syntax: A Meta-Analytic Exploration of Linguistic Algorithmic Landscapes

#6
by mradermacher - opened

continued....

mradermacher changed discussion status to closed

Here a compleate list of the newly added architectures.

The non-mm-archs are picked up automatically when llama is updated (rather, nothing checks for these archs, other than the script that shows me daily models).

Nice. Will do in caser you forgot any vision/audio architecture.

In case yopu need it, the list/regexc is currently in /llmjob/share/llmjob.pm - search for is_vision

Also, vision is mradermacher code for multi-modal from now on.

Bert based architectures seem to be incredible

I might exclude them from the daily list for that reason, and them being likely not popular with the people who consume ggufs. (and most fail because small models tend to have custom tokenizers).

Nice I just discover an easy way to requeue previously failed archidectures:

Yup, shell-greppable logs for the win.

Update: oh, it's not even the real log file, "just" the llmc why transform of it.

@RichardErkhov vision models should not be queued to rich1 unless they arte not being detected as such (and then no vision extraction should happen).

The non-vision jobs are limited to 32GB ram, too. No clue what happened. Very troubling.

However, this morning, only besteffort models were queued on rich1. Who knows what nico queued...

well, good to know. usually you take like 4-8gb, but something went wrong today. Peak recorded by proxmox was 24gb (so I assume it was even higher, but due to total OOM, it might not have recorded full number. I added swap on root just in case this happens again so at least other things on server dont die haha

llmc audit besteffort skips the besteffort models for me.

Please restart Audio-Reasoner imatrix computation. I killed it earlier today because it ran on CPU. I'm still not sure what makes GPUs occasionally temporary disappear but seams related to them being used on a different container.

llmc audit besteffort skips the besteffort models for me.

Right, arguments were not passed to llmjob audit. Should be fixed now.

@RichardErkhov

Peak recorded by proxmox was 24gb

Well, given that I was officially allowed to use 64GB, 24GB seems absolutely normal. So what is the new limit? 24GB will only allow one quant, and maybe not even that.

@nicoboss

So what do we do about this: -1800 1370 si DeepSeek-R1-0528-bf16 budget/hfd/3765G

(I assume it is the expanded version of DeepSeek-R1-0528)

So what do we do about this: -1800 1370 si DeepSeek-R1-0528-bf16 budget/hfd/3765G

We wait for FineTuned-Llama-3.1-Nemotron-Ultra-253B-v1 and Llama-3_1-Nemotron-Ultra-253B-CPT-v1` to be done, no longer shedule any models to nico1 and ignore the budget. It is the BF16 expanded version of DeepSeek-R1-0528 and so can be directly convearted into GGUF without any manual interference. However we for sure want the RPC setup for Q8 imatirx computation for this one as the model is amazing.

Please no longer shedule any models to nico1 untill we managed to conveart DeepSeek-R1-0528-bf16 to GGUF

Ah wow it has not even yet downloaded the SafeTensors model. So it will not fit for sure. I'm now downloading the official DeepSeek-R1-0528 non-upscaled version.

Well, given that I was officially allowed to use 64GB, 24GB seems absolutely normal. So what is the new limit? 24GB will only allow one quant, and maybe not even that.

I mean it really spiked. Usually it shows as 8gb in proxmox, (shows, not uses). that means spike was much more than that bc if proxmox dies, it might not have logged the usage.
Here's the maximum weekly graph, crash highlighted with circle

image.png

Like he is usually fine with this load, but something went really really wrong

I'm currently converting DeepSeek-R1-0528 to BF16 and then to GGUF. The resulting DeepSeek-R1-0528.gguf will be available under /cpool/DeepSeek-R1-0528.gguf in a few hours. Please make sure to whitelist the newly added cpool 4 TB NVMe SSD storage pool.

mmm just 685B model

@mradermacher @nicoboss

image.png

congrats on recognition lol

@RichardErkhov

only the best models!111

Like he is usually fine with this load, but something went really really wrong

W§ell, we will likely not find out. A large vision model in the past would explain it, because those were run without memory limits and loaded the whole model. But we neither quanted those, nor does current llama.cpp load the whole model into memory anymore. Furthermore, all jobs are currently limited to 32GB.

More evidence would be needed.

Also, I think somehow some jobs have been lost on rich1, any idea how that could have happened?

Also, I think somehow some jobs have been lost on rich1, any idea how that could have happened?

How does one loose a job? They got fired? I dont know, it's not my HR on my server managing jobs lol

More evidence would be needed.

He died lol, idk, if you need me to run commands to check lmk, otherwise idk

only the best models!111

Lol, and Im not on the list hah, oh well, my skill issue, should be better haha

Please make sure to whitelist the newly added cpool 4 TB NVMe SSD storage pool.

Lucky for us, /cpool was already whitelisted (if you wanted to,m you can check /llmjob/share/bin/llmjob - search for e.g. /bpool to get a list (in the safe-exec block))

Anyway, the situation is aggravated because we still have 1.6T of llama4. At some point, we might have to give up or make executive-level decisions.

@nicoboss xet maybe takes it's revenge, we have 2TB of files in ~/.cache/huggingface/xet

I have no clue what those files are, or when it is safe to delete them. Possibly xet is caching every download with no info on which files belong to which repo.

seems xet ignores --cache-dir completely, will look for a workaround (such as HF_XET_CACHE). OR maybe it happens on upload - no clue.-

It also means that xet is likely causing lots of extra I/O and disk usage :(

yeah, definitely happens during upload. xet looks like a disaster (implementation, not theory). why does it make copies of everything during upload.

I am trying to workaround it by creating a temporary directory for every upload and trying to delete it afterwards, but that means more manual cleanup if something goes wrong, lots more I/O and probably little reduction of uploads - I have not then slightest idea why xet does this at all - it should just split the file logically and then skip uploading parts that are already uploaded.

Trying HF_XET_CHUNK_CACHE_SIZE_BYTES=0 instead, maybe that helps.

Sorry I had to kill AlphaMed-3B-base-rl and Qwen3-8b-Base-RP-slerp imatrix computation because they were running on CPU instead of GPU. It would be awesome if you could add a command for me to restart imatrix tasks myself when this happens. Also maybe make it so pausing imatrix.nico1 doesn't block run/hfu of already generated imatrix quants.

Trying HF_XET_CHUNK_CACHE_SIZE_BYTES=0 instead, maybe that helps.

That should work. I'm a bit confused it uses this for upload. It should only be used for downloads and exists to make it so you need to download less data by only downloading the unique chunks.

Lucky for us, /cpool was already whitelisted (if you wanted to,m you can check /llmjob/share/bin/llmjob - search for e.g. /bpool to get a list (in the safe-exec block))

Which was grate as I started DeepSeek-R1-0528 yesterday and we already make quite some progress.

Anyway, the situation is aggravated because we still have 1.6T of llama4. At some point, we might have to give up or make executive-level decisions.

I agree. Let's finally do all this annoying stuck LLama 4 models. I will try to somehow get imatrix computation to work properly.

DeepSeek-R1-0528 is actually the first model we do with MLA so if it works it is safe to also do the other models waiting for MLA and start requiring the ones we did without MLA.

However, this morning, only besteffort models were queued on rich1. Who knows what nico queued...
More evidence would be needed.

I was asleep when this happened and did not queue any models in the hours prior to the rich1 incident from occurring.

It should only be used for downloads and exists to make it so you need to download less data by only downloading the unique chunks.

Sure, but even for downloading, just caching everything forever without apparent limit...

I was asleep when this happened and did not queue any models in the hours prior to the rich1 incident from occurring.

Right, but queuing immediately before is not needed. Anyway, it was humorous :) But I can't explain it easily.

But I can't explain it easily.

It's me, whatever can go wrong will go wrong

@nicoboss

please test llmc force-restart-imatrix jobname... to test job restarting

Also maybe make it so pausing imatrix.nico1 doesn't block run/hfu of already generated imatrix quants.

gave that a try as well. turns out, for testing i force-enabled hfd while paused, so i could just use that code for hfu instead. so it might actually work...

It's me, whatever can go wrong will go wrong

You are special1

You are special1

Special needs ?

that, too. very special. but in a totally endearing, adorable way.

So, HF_XET_CHUNK_CACHE_SIZE_BYTES=0 does not, as documented, disable the cache, apparently. Probably it means another cache. Also, ignoring --cache-dir is not nice. But maybe it's not growing as much(?) I'll have to watch that, and maybe wipe it every night, assuming that doesn't crash any uploads.

nico1 alone had 2.3TB of cache today.

Nope, still growing, but hopefully much slower than before. HF_XET_CACHE does not seem to affect it, so it really doesn't seem to be that cache. I'll experiment with deleting ~/.cache/huggingface/xet/ after every upload.

If XET talks about cache they usualy mean the download cash. I wasn't even aware that there is an upload cache.

they probably need at least some metadata to deduplicate chunks at client side. but it seems to be permanent.

Nope, deleting the cache creates a race condition. I'll probably have top create a per-request temporary directory. eff**

RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 2922.

Worse: deleting the "cache" can permanently break xet uploads.

Nope, deleting the cache creates a race condition. I'll probably have top create a per-request temporary directory. eff**

RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 2922.

Worse: deleting the "cache" can permanently break xet uploads.

This was honestly so expected. Obviously it can't work if you delete the shared cache while an upload is running. Upload cache is relatively small compared to uploaded data so maybe just put the cache into tmpfs so it hopefully stays in RAM.

I've disabled xet for uploads (hopefully), for the time being. This bullshit was taking way more time than I had.

I've disabled xet for uploads (hopefully), for the time being. This bullshit was taking way more time than I had.

Doing so is quite bad as this means we force huggingface to download and reupload every model we upload. It also is only a temporary solution as LFS will probably soon be descontinued. I also have much faster upload speeds with XET. Maybe instead ask them how to disable or limit the upload cache. They said we should ask them if we have issues with XET.

This was honestly so expected.

Only if you expect shoddy work (I don't).. Files in .cache are non-essential by definition and can be deleted at any time.

Upload cache is relatively small

The cache is actually very big, because it caches whole chunks of data. And is disabled already.

The problem we are dealing with here is something else, most likely metadata such as hashes. And most likely is not a cache at all.

cache into tmpfs

As I mentioned, it was 2.3TB within a few days, on nico1 alone. And is never cleaned up, afaics. The latter is the main issue.

A per-upload "cache" is the only way I can see that can work (we alreadyx use a per-download xet "cache"). The problem is cleaning it up.

Doing so is quite bad as this means we force huggingface to download and reupload every model we upload.

If they do that, then that is their problem. There is no technical need to download and reupload every model we upload. If there was, then we would have to do that client-side, too (since we deduplicate there). If their system is not totally broken, all that happens is slightly more upload bandwidfth. We can work around that another time.

If they do that, then that is their problem. There is no technical need to download and reupload every model we upload. If there was, then we would have to do that client-side, too (since we deduplicate there).

There is a technical reason for them to do so. For LFS we upload directly to S3 storage as a single file without any chunking/deduplication so the only way for them to migrate the file to XET is to download it and then reupload it to XET by chunking and dedublicating it client-side.

@mradermacher I think we are just experiancing this bug: https://github.com/huggingface/xet-core/issues/350

There is a technical reason for them to do so.

It is still their choice, is it not? The client side is not hardcoded to use S3, it's the server side dictating it. Furthermore, we are not using LFS, we are using their client library, so even more under their control.

I think we are just experiancing this bug

Good find, yes, that sounds like it could be the issue. Although our issue is not that it doesn't honor the setting (it clearly does in our case).

So just downgrate

"just". I'll just enable it once this is fixed.

it seems it indeed is neither a cache, nor does it store chunks, buit only metadata. That it goes so big is simply a bug caused by parallel uploads. I don't think we have good reason to believe that downgrading will even do anything about it (the single guy who claimed that might have just disabled xet).

Upgrade: the downgrading comment in that bug report is not about that bug report, so the only fix would be indeed a per-upload cache. And the only short-term workaround is to disable it.

@mradermacher Please update to latest llama.cpp and start DeepSeek-R1-0528 RPC imatrix computated. I already prepared everything including /tmp/DeepSeek-R1-0528.gguf pointing to /cpool/DeepSeek-R1-0528.Q8_0.gguf

Regarding the RPC arguments https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#677e2405978ed7e72c0a8738 should still be correct:

"extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204",
"force" : 1,
"llama" : "/root/cvs/llama.cpp-nocuda",
"ngl" : "10000",
"quant" : "Q8_0",

Slightly related only: Since rpc parameters are somewhat standardized by now, I could provide a llmc command to "reconfigure" an imatrix job into a "big rpc imatrix job", maybe also adds some more imatrix management commands. Of course, timing is a bit bad, but we can discuss - I love making you more independent.

Given how many imatrix RPC tasks we will have in the following weeks I recommend you implement this or coordination for each of them will be quite anoying.

If possible please set them all as RPC imatrix tasks but just block all except DeepSeek-R1-0528 for now and let me know how to unblock:

  • DeepSeek-R1-0528
  • DeepSeek-V3-abliterated
  • r1-1776

DeepSeek-V3-0324-Pruned-Coder-411B we should be able to do without RPC in Q8.

The imatrix RPC setup is still ready. Please start DeepSeek-R1-0528 imatrix computation. I will be asleep now but no manual interference should be needed from my side if everything goes as planned.

hello @mradermacher ! I wanted to ask if you want to reupload top 1000 models (by downloads, or whatever you want to reupload) to modelscope.cn . It's a relatively big platform, and as I see some people ask to reupload some models, like here https://huggingface.co/mradermacher/Ling-Coder-lite-i1-GGUF/discussions/1#68428e1884a27e0a79d4c10b . I think it's a good idea, what do you think? I can do it on rich1, so just need your permission to basically use your name and models on modelscope

@RichardErkhov I don't, no. But anybody can if they want to, at least, if the license allows it.

The question therefore is if we should use the official "mradermacher" name, i.e. make an official mradermacher presence. I don't think it is necessary, because, again, anybody can upload those and say these were "mradermacher quants". If we decide yes, and you upload it, then that's as official as it gets :-)

@nicoboss what are your thoughts?

A very minor side note, we can now officially make ternary quants, and out first model was https://hf.tst.eu/model#BitCPM4-0.5B-GGUF

Quantising worked without changes, but the readme generation didn't recognize it etc. etc. And we of course don't autodetect source models, so we almost certainly have quantized ternary models before, the "wrong" way.

(llama update etc. is running)

(llama update etc. is running)

Awesome. I'm still waiting for you to start DeepSeek-R1-0528 imatrix computation. The imatrix RPC setup is still ready for you to use. Now all the RPC imatrix quants seam to be gone from the imatrix queue which is a bit concerning.

@nicoboss what are your thoughts?

I support the idea of @RichardErkhov reuploading ouer most popular quants. I'm fine with him using the mradermacher brand for it. I feel its good to have a mirror on a diffrent platform in case anything happens with HuggingFace.

hmm, i don'T know how, but I somehow killed the imatrix queue.

@nicoboss how about the other models?

@richarderkhov mradermacher decided to give it a go, you'll be the official representative :)

@nico1 for future reference, the gguf filename name would have to be DeepSeek-R1-0528.Q8_0.gguf

@richarderkhov mradermacher decided to give it a go, you'll be the official representative :)

🎉🎉🎉🎉🚀🚀🚀🚀

@nico1 for future reference, the gguf filename name would have to be DeepSeek-R1-0528.Q8_0.gguf

Sorry for that. I named it like that on cpool but then thought I still have to use the same name as for non-quantized models for the imatirx computation task to find the file. I will keep the quantized name the next time.

@nicoboss how about the other models?

Feel free to first imatrix compute all the non-RPC ones and then make it so all the RPC one start automatically after that. I will prepare the Q8 quants of the other RPC imatrix models while the first one is running so you can already queue them with the exception that the Q8 quant will be there by the time thouse RPC imatrix tasks will start.

@richarderkhov mradermacher decided to give it a go, you'll be the official representative :)

🎉🎉🎉🎉🎉🎉🎉
🎉🚀Amazing! 🚀🎉
🎉🎉🎉🎉🎉🎉🎉

that's a lot of rockets, and I now know how to restore the imatrix queue manually.. I think I need the same typo prevention system as llmjob. In fact, I have no idea why I didn't implement it already :(

that's a lot of rockets

yeah, that's my server's fans and faint sounds of happiness that you can hear because of the fan sounds

Sorry for that. I named it like that on cpool but then thought I still have to use the same name as for non-quantized models for the imatirx computation task to find the file.

How would you know...

I'll add the other rpc jobs once you have the files (not quite sure what happens if I add them in advance when the file is gone, probably it will just not be ab le to detect the size)

Do we also want to do r1-1776?

oh, i forgot to mention, DeepSeek-R1-0528 is configured and should start once it's turn has come

oh, i forgot to mention, DeepSeek-R1-0528 is configured and should start once it's turn has come

Are you sure? It shows blocked/budget on the status page.

I'll add the other rpc jobs once you have the files (not quite sure what happens if I add them in advance when the file is gone, probably it will just not be ab le to detect the size)

Great I will let you know once the other Q8 files are ready. I'm waiting for the imatrix task to start before starting Q8 quantization to make sure I have enough RAM left.

Do we also want to do r1-1776?

Yes we also want to do r1-1776. DeepSeek-V3-0324-Pruned-Coder-411B maybe doesn't require the RPC setup in Q8 but it will be super tight so we will see.

yeah, the "force" will take care of budget issues :)

Are you going to provide the q8 for some/all the models, or should I do some?

@richarderkhov mradermacher decided to give it a go, you'll be the official representative :)

I created the following modelscope account and gave access to it to Richard: https://www.modelscope.cn/profile/mradermacher

yeah, the "force" will take care of budget issues :)

Awesome it worked. It just started. I saw it successfully connecting to all the RPC servers.

Are you going to provide the q8 for all the models, or should I do?

I will start generating them as soon the RPC imatrix computation finished loading the model and I know how much RAM is available.

yeah, that's my server's fans and faint sounds of happiness that you can hear because of the fan sounds

:)
Is poor rich1 really a good place to do this though? It has bandwidth issues on a good day... It is certainly the easiest option, though.

would it load faster if you set readahead to 32GB? :)

in any case, i think you can put up to 2tb of q8's on /tmp at the moment

@mradermacher They are all ready and located on /bpool with softlinks under /tmp:

RPC imatrix queue:

  • DeepSeek-V3-abliterated.Q8_0.gguf
  • 1-1776.Q8_0.gguf

Queue exclusive but without RPC as while tight it should fit on StormPeak:

  • DeepSeek-V3-0324-Pruned-Coder-411B.Q8_0.gguf

DeepSeek-R1-0528 is going great so far and should be done in a few hours:

2  713 DeepSeek-R1-0528                              run/imatrix (GPU-2d) / 270.38s/c 686.5/1419.5m(903.0-944.3) [229/315] 7.6604

I recommend we let all the other models have thair imatrix computed between diffrent imatrix RPC tasks. I also recommend that we do DeepSeek-V3-abliterated next as someone just requested around like 2 days ago and it also is the one I'm msot exited on from the remaining models.

I currently disabled quantisation tasks on nico1 as I was generating the Q8 quants but they could be reenabled now if you can make sure that no vision models will get processed.

configured as such. will not see when the first model finishes, so i can only hope everything goes as planned :)

if you can make sure that no vision models will get processed.

that should be the default config when imatrix-"force" is in effect - you should see that it works when you unblock it and the jobs go to "blocked/vision" (or block/admin/vision maybe). if they don't do that, then it doesn't work.

well, since no vision models are queued currently, the point is moot. i've reenabled llmjobs (one hfd, one quant while force is in effect). but when i queue some next, i'll check that they get paused.

the only concern is that after the big job, normal rules go into effect (such as quanting two models), and when the next big job is started, those jobs do not get stopped.

It skipped DeepSeek-V3-abliterated despite the lower nice level and started with DeepSeek-V3-0324-Pruned-Coder-411B. It also used RPC for it which I feel like is not at all needed as it should barely fit on StormPeak without RPC. We should at least try to run it without RPC first as if it fits it will be twice as fast. I killed it for now so we do DeepSeek-V3-abliterated RPC imatrix computation first.

Sometimes I really hate llama.cpp for not fixing known issues. Ouer DeepSeek imatrix is affected by https://github.com/ggml-org/llama.cpp/pull/12801#issuecomment-2824767949

ggml_cuda_init: failed to initialize CUDA: OS call failed or operation not supported on this OS
================================ Have weights data with 781 entries
[   1/1086]                        output.weight - [ 7168, 129280,     1,     1], type =    f16, 
====== llama_model_quantize_impl: did not find weights for output.weight
converting to q6_K .. load_imatrix: imatrix dataset='imatrix-training-full-3'
load_imatrix: loaded 781 importance matrix entries from DeepSeek-R1-0528-i1-GGUF/imatrix.dat computed on 315 chunks
prepare_imatrix: have 781 importance matrix entries
size =  1767.50 MiB ->   724.95 MiB
[   2/1086]                   output_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   3/1086]                    token_embd.weight - [ 7168, 129280,     1,     1], type =    f16, 
====== llama_model_quantize_impl: did not find weights for token_embd.weight
converting to q2_K .. size =  1767.50 MiB ->   289.98 MiB
[   4/1086]                blk.0.attn_k_b.weight - [  128,   512,   128,     1], type =    f16, 

llama_tensor_get_type : tensor cols 128 x 512 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl

====== llama_model_quantize_impl: imatrix size 128 is different from tensor size 16384 for blk.0.attn_k_b.weight
llama_model_quantize: failed to quantize: imatrix size 128 is different from tensor size 16384 for blk.0.attn_k_b.weight
main: failed to quantize model from './DeepSeek-R1-0528.gguf'
job finished, status 47
job-done<0 DeepSeek-R1-0528 imatrix 47>

I will look into it tomorrow. I wonder how bartowski fixed it as he was able to imatrix quant it under https://huggingface.co/bartowski/deepseek-ai_DeepSeek-R1-0528-GGUF

sad morning

ooh, i see a model at modelscope. shiny!

I wonder how bartowski fixed it

Well, he will likely tell us, but I suspect it's the way things apparently need to be done with llama.cpp these days, namely by overriding the quant type for the tensor.

ah, no that won't even work in this case.

Srry for the imatrix RPC OOM crashes. I tried moving 4 instead of 3 layers to the GPU as its GPU memory was only half full and apparently that was enough to make it OOM. I will reset and retry the imatrix tasks in half an hour.

no issue at all. good to hear that there is a simple explanation :)

Actualy did did it now to see it works and it did but will leave it paused for like 20 minutes while testing a model:

nico1 /tmp# llmc force-restart-imatrix DeepSeek-V3-abliterated
DeepSeek-V3-abliterated: cleared status
pushing...

I just realized that it takes almost an hour for it to even start using the GPUs so I just unpaused it now and will make sure to stop vLLM before it is done loading layers to CastlePeak which it always does first.

Here quite a good explenation why we are missing imatrix data for blk.0.attn_k_b.weight in ouer imatrix which we computed with MLA activated: https://github.com/ikawrakow/ik_llama.cpp/pull/250

I pushed a patch to ouer custom llama.cpp fork that permanently gets rid of the following issues:

  • llama_model_quantize: failed to quantize: imatrix size X is different from tensor size Y for Z
  • Missing importance matrix for tensor XXX in a very low-bit quantization The result will be garbage, so bailing out
  • This finally unblocks Llama-4-Maverick-17B-128E-Instruct and Llama-4-Maverick-17B-128E quantization

Those issues are connected because if you fix the first one by just ignore the imatrix you have to fix the second one. The way I fixed is by instead quantizing with the nearest static quant with better quality than the specified target quant so that there should not be any quality degradation and only a minor size quite increases. You can review the exact code under: https://github.com/nicoboss/llama.cpp/pull/4/files

@mradermacher Please update to latest llama.cpp on ouer fork so we can start generating all the quants where llama.cpp failed us in the past. This fixes all the annoying model specific llama.cpp issues we had and will have for future models. I'm so sick of constantly dealing with llama.cpp's garbage.

This should fix the following models based on llmc why:

cursed-ds-9b-ep2: imatrix size 128 is different from tensor size 2048 for blk.0.attn_k_b.weight
PowerMoE-3b: Missing importance matrix for tensor blk.1.ffn_gate_exps.weight in a very low-bit quantization
MoE-Girl-800MA-3BT : Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
Moe-3x7b-QA-Code-Inst: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
maid-yuzu-v5-extra : Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
maid-yuzu-v5 : Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
granite-3.0-3b-a800m-instruct: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
openmixtral-6x7b-v2: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
Azathoth-16x7B-bf16: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
Mixtral-8x22B-v0.1-resized-embeddings: Missing importance matrix for tensor blk.0.attn_q.weight in a very low-bit quantization
Lumosia-MoE-4x10.7 : Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
CollAIborate4x7B   : Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
Aura-MoE-2x4B: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
TinyJ.O.S.I.E.-3x1.1B-32k-Base: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
granite-3.1-3b-a800m-t1: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
BlackSheep-MoE-4x3B: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
granite-guardian-3.2-3b-a800m: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
Qwen3-30B-A3B-python-coder : Missing importance matrix for tensor blk.42.ffn_down_exps.weight in a very low-bit quantization

I remember we had some models in the past where we skipped some low-bit quants due to the Missing importance matrix for tensor XXX in a very low-bit quantization The result will be garbage, so bailing out issue. Because we did not nuke but instead skip them, they are not inside llmc why. @mradermacher Do you have any way to check which models the where so we can add the missing quants?

I'm not happy with these changes, but willing to discuss. The reason is that I see mradermacher as the boring "default" quant, i.e., whatever llama.cpp gives, with little room for "creative" solutions. I feel this way so we can do bulk quantisations, while leaving room for other quanters to do more experimental/non-standard quants.

Of course it's difficult to draw an exact line, but for example, using a less populated imatrix seems fine, as there is no choice involved. (Although there is an open issue that I tried to discuss with you, but you were too busy at the time, namely, it seems we change the imatrix, and then continue to use it as input for further chunks).

Selecting a specific quantisation seems to me to be crossing the line, as it's not the obvious only choice.

imatrix training data is another issue, but there doesn't seem to be an alternative, and I would love to have an independent, "standard" training set.

If this patch is not in upstream, I don't think we should have it. At the very least this should be discussed with upstream first, and reasons should exist why upstre4am does not want to fix it, while we would want to.

I remember we had some models in the past where we skipped some low-bit quants due

Practically all qwen 2/3 "A" models for example. It might be possible to find these models - I should have the old job files somewhere.

I haven't read your link yet, but will do so soon.

the explanation by ikawrakow modulates this a bit. But why would we want to solve the issue different than him? Again, I feel we shouldn't invent our own solutions. Although I am getting more and more open to ik_llama as upstream at some point :)

I'm not happy with these changes, but willing to discuss.

There are many reasons we encounter tensors without an important matrix. The most common one is experts not getting covered with our imatirx training data. Now that MLA got introduced, we are having a different one. There are now two path inference can take. One by using MLA and one without using MLA. No matter how we train our imatrix there will always be at least one path missing from the important matrix.

My changes are fixing the much more general issue of how to handle tensors lacking imatirx data for whatever reason. Currently llama.cpp statically quants them in this case. However they added a protection that when the target quant is too low for this approach so that the resulting model would be broken it bails out. With my changes it is now using the next higher quality static quant as target for cases where it would have bailed out before. This results in a working model without any quality degradation. This fixes all current and future issues of tensors without imatrix data.

Given that llama.cpp doesn't even offer an option to disable MLA the layers responsible for the non-MLA path really don't matter anyways as they are unused. bartowski1182 implemented an almost identical solution to solve this issue for his quants. bartowski1182's solution: "Ah yeah I basically just forced those tensors to Q8_0 and then also added an extra clause to that block that raises an error that, if it's targeting Q8_0, don't care if it's the wrong shape since it won't use imatrix to do it anyways"

The reason is that I see mradermacher as the boring "default" quant, i.e., whatever llama.cpp gives, with little room for "creative" solutions. I feel this way so we can do bulk quantisations, while leaving room for other quanters to do more experimental/non-standard quants.

We are just using static quants when there is no imatrix data available like llama.cpp already does by itself. We are only fixing the cases where it would otherwise bail out by implementing a similar solution to how it already uses to handle many cases where it can't quant to the requested destination quant due to tensor shape reasons. Even now our quants are not all the same as when llama.cpp encounters an edge case it already automatically changes the destination quant as can be seen in the follwoing excample:

llama_tensor_get_type : tensor cols 128 x 512 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl

Selecting a specific quantisation seems to me to be crossing the line, as it's not the obvious only choice.

They are already doing that themselves to handle edge cases like randomly choosing IQ4_NL if the shape is no good for Q2_K. Ouer implementation can be considered what makes the most sense. We just go with the smallest static quant that won't negatively affect the model’s quality. Not choosing one would mean us not being able to provide a quant at all. Choosing Q8_0 like bartowski1182 did seems like a massive waste of space for no reason. We could also always use IQ4_NL as fallback as upstream does for other edge cases if you prefer but it again kind of seams like a waste of space.

imatrix training data is another issue, but there doesn't seem to be an alternative, and I would love to have an independent, "standard" training set.

I rather properly handle all possible cases of tensors with no imatrix data as this is something that will keep happening for multiple reasons. It seems unavoidable that there will always be some expert on some layer with no imatrix data for models with 100+ experts no matter how good our imatrix dataset is. Our llama.cpp fork already implements many countermeasures to collect as much imatrix data as possible and it seems to work very well. I tested this fix for Llama-4-Maverick-17B-128E-Instruct and only maybe a hand full of tensors had to be staticaly quantized which is an almost insiginficant small portion of the model.

the explanation by ikawrakow modulates this a bit. But why would we want to solve the issue different than him? Again, I feel we shouldn't invent our own solutions.

Booth bartowski1182 and I don't trust ikawrakow's solution. It seems like some hacky way of somehow creating imatrix data out of thin air by only measuring a single path and then extrapolating the measurements to the other path. It feels less clean then to just statically quant the path we don't take during imatrix training.

Although I am getting more and more open to ik_llama as upstream at some point :)

This seems like a terrible idea. It differs far too much from llama.cpp upstream. For example for MLA they went with their own solution which is incompatible with the one of llama.cpp upstream so quants generated with ik_llama seams to not be guaranteed to be fully compatible with upstream llama.cpp

I feel this way so we can do bulk quantisations, while leaving room for other quanters to do more experimental/non-standard quants.

Please keep in mind that constantly having to deal with those rare edge cases caused by this fundamental issue is really time consuming. We probably wasted over 100 hours on issues caused by this within the past year. In my opinion getting rid of it so we can focus on more important things seams to far outweigh any negatives. As you mentioned we do bulk quantization so we cannot invest the time to carefully investigate and fix every single of those rare edge cases. I recommend we proceed with updating to the latest version of our fork and use this method to add all the quants we were unable to provide in the past because of this.

Thanks for explaining it in more detail, but I think that reinforces my unhappiness, as that was basiclaly my understanding. For excample:

Given that llama.cpp doesn't even offer an option

"yet", that might or might not change.

bartowski1182 implemented an almost identical solution to solve this issue for his quants

We now have two different quant formats for the same quant.

They are already doing that themselves to handle edge cases like randomly choosing IQ4_NL if the shape is no good for Q2_K

Exactly, that is why we shouldn't invent our own format.

I rather properly handle all possible cases of tensors with no imatrix

Me too, but bailing out is one way to handle such possible cases. And it's safer in that it doesn't create a new quant format.

Wouldn't removing those unneeded tensors be much more optimal? It makes the quant file smaller and avoids any issues later on, because lama.cpp will never accidentally use them.

It would require some way of deciding what is unneeded (probably by not converting the tensor(s) in the first place), and would be more acceptable, as it does not create incompatible quants, but simply non-working quants if these tensors are used.

But even then, I think this should be upstreamed.

Booth bartowski1182 and I don't trust ikawrakow's solution.

That is why we now have three. That is not the right path to take for mradermacher, IMHO. We should provide llama.cpp quants, not invent our own formats.

ik_llama seams to not be guaranteed to be fully compatible with upstream llama.cpp

These issues will have to be qualified, of course. I will not blindly switch to an incompatible llama implementation :) However, it would still be a more established format than cooking our own.

Please keep in mind that constantly having to deal with those rare edge cases caused by this fundamental issue is really time consuming.

That is a good point, but is also why we exist in the first place, dealing with these issues so others don't have to. I don't want to lower the quality of what we do, I'd rather not offer those quants.

(In case you want a summary - your arguments support some kind of change, but not necessarily this specific change)

How about we just continue instead of bailing out? Then we don't invent our own quants. Especially for MLA this shouldn’t matter as the tensors are unused. Really the only alternative solution would be to not provide any low bits per wight quants for all models using MLA which seams insane as we are talking about 685B models where the vast majority of users can only run low bits per wight quants.

I just realized that ouer issue seams to be attn_k_b.weight and not attn_kv_b.weight.
Booth ik_llama and bartowski1182 just staticaly quant attn_k_b.weight at Q8.
attn_k_b.weight is isignificant in terms of size.
So maybe we that is what we should do as well.

So I just opened our imatrix files inside a text editor and can confirm that attn_kv_b.weight is missing but those tensors are missing from DeepSeek-R1-0528.gguf as well and so no imatrix measurements are required. I can confirm that llama.cpp does not include the non-MLA path inside the GGUF anymore. I can also confirm that we do have imatrix measurements for the much more important attn_v_b tensors and partial measurements for the less important attn_k_b tensors.

From ikawrakow: "On the other hand, *attn_k_b.weight is stored as 128 x 8192 and is then viewed as 128 x 512 x 16 for multiplication with the query Q, so the imatrix data collection functions sees a matrix with just 128 columns, so quite useless to actually guide the quantization process." - so really we just don't have the data to make any useful decisions about this tensors so either just statically quant it with the target quant no matter how low bit per wight it is or always going for Q8 seems like the only reasonable options.

Wow so I just found this which will properly fix all ouer issues: "Fix imatrix calculation for MLA models": https://github.com/ikawrakow/ik_llama.cpp/pull/411

How about we just continue instead of bailing out? Then we don't invent our own quants. Especially for MLA this shouldn’t matter as the tensors are unused.

What does continue mean? We currently manually skip the low-bit quants, but that is not what you mena, right?

So I just opened our imatrix files inside a text editor

So... that means the theory that the tensor is stored and not used for inference, but causes problems for quantize, is wrong then?

so really we just don't have the data to make any useful decisions about this tensors so either just statically quant it with the target quant no matter how low bit per wight it is or always going for Q8 seems like the only reasonable options.

I would say the most reasonable choice would be to fix llama.cpp so it either properly converts the dimensions and/or fix imatrix generation to properly generate the weights as required by the model.

In other words, this sems to be simply a bug in llama.cpp, and doctoring around by reducing quant quality doesn't seem like anything more than a temporary hack, i.e. we'd have to redo the work once more once llama.cpp has fixed this.

My other concern is what this patch does to other models. From your description it seems to be rather generic, when the issues can be quite different - for example, the failing qwen a models don't use mla, but would they not also be affected by this patch?

What does continue mean? We currently manually skip the low-bit quants, but that is not what you mena, right?
My other concern is what this patch does to other models. From your description it seems to be rather generic, when the issues can be quite different - for example, the failing qwen a models don't use mla, but would they not also be affected by this patch?

We can just ignore the error that makes it stop if it tries to statically quant with too low bit per wight for attn_k_b which only exists for MLA

So... that means the theory that the tensor is stored and not used for inference, but causes problems for quantize, is wrong then?

Which is what I thought first untill I realize that it is actually the attn_k_b and not the attn_kv_b tensor affected by this issue. attn_k_b is indeed used but tiny in size which is why all other MLA imatrix quants on HuggingFace just stored it in Q8.

I would say the most reasonable choice would be to fix llama.cpp so it either properly converts the dimensions and/or fix imatrix generation to properly generate the weights as required by the model.
In other words, this sems to be simply a bug in llama.cpp, and doctoring around by reducing quant quality doesn't seem like anything more than a temporary hack, i.e. we'd have to redo the work once more once llama.cpp has fixed this.

Which is why I just discovered https://github.com/ikawrakow/ik_llama.cpp/pull/411 so we can just merge that into ouer fork and then redo all the MLA imatrix computation. Wow days of compute wasted because of this shit. We probably should cancel r1-1776

@mradermacher I now properly fixed imatrix calculation for MLA models in https://github.com/nicoboss/llama.cpp/pull/5/files based on https://github.com/ikawrakow/ik_llama.cpp/pull/411. I killed the running r1-1776 imatrix RPC computation as we will have to redo it anyways once we updated to a version containing this fix.

I tested and merged the above mentioned "Fix imatrix calculation for MLA models" fix I implemented in our llama.cpp fork. I also reverted yesterdays "Static quant fallback improvements" which you didn't like and is no longer needed now that MLA imatrix computatzion is fixed properly.

@mradermacher Please update to the latest version of our fork and reset all the MLA imatrix computation tasks so we can recompute them with the latest fixes apply. For a first test I recommend starting with DeepSeek-V3-0324-Pruned-Coder-411B as it is the only one not requiring the slow RPC computation and so is ideal to make sure those changes solve our issue.

Which is why I just discovered https://github.com/ikawrakow/ik_llama.cpp/pull/411

That sounds like a great solution. Maybe the third itertaion gives good results? :)

attn_k_b is indeed used but tiny in size which is why all other MLA imatrix quants

If it's tiny, why would anybody even bother to quantise it (maybe it's not that tiny).

We can just ignore the error that makes it stop if it tries to statically quant with too low bit per wight for attn_k_b which only exists for MLA

What will the code do if we just continue? This is for other models with imatrix problems, it woulod no longer apply to MLA models, right?

(updating llama.cpp and nuking repos)

I forgot, I don't think we want o maintain that patch, could you try pushing it into upstream? Shouldn't there be an open issue for that already? What are they doing up there???

ugh, i had a miscommunication between hand and brain, and nuked some non-imatrix repos :/ I wish it were more like the moon and the tides.

and you've seen it here first, wouldn't be surprised if doctoring patches from ik_llama.-cpp, which gets even less oversight than llama.cpp, eventually leads to more recalculations.

/me goes into hiding

DeepSeek-V3-0324-Pruned-Coder-411B imatrix will be next

That sounds like a great solution. Maybe the third itertaion gives good results? :)

I'm quite confident things will be perfect now but I tend to be too optimistic and usually everything that can go wrong will go wrong.

If it's tiny, why would anybody even bother to quantise it (maybe it's not that tiny).

We are talking about a 128 x 8192 matrix stored once for per layer. So 1048576 values per layer which is around 1 MB/layer in Q8. DeepSeek R1 has 29 layers so we are talking about 29 MB on a model that is 713.4 GB in Q8. It really doesn't matter much if we store them in the target quant or Q8 but I agree with always striving towards perfection and doing things properly even when recomputing imatrix for a few day just so safe a few MB seems a bit insane.

What will the code do if we just continue? This is for other models with imatrix problems, it woulod no longer apply to MLA models, right?

It will just do its job and statically quant the tensor with your target quant. The bailing out for low bit per wight check is just to prevent inexperienced users form generate bad quants. As long you have data for the vast majority of tensors it is unlikely that there will be much of a quality difference. Skipping this check can be applied to attn_k_b so it only affects MLA or every model so it can also be used for models where we lack data for certain experts. I have reverted this change for now but maybe something we need to revisit if we can't improve our imatrix training data enough for Llama-4-Maverick-17B-128E/Llama-4-Maverick-17B-128E-Instruct to work. I don't think we want to just always skip this working as we currently lack any decent log monitoring solution and so might miss some real issues but it might would make sense to add an environment variable to our llama.cpp fork so we can disable this check after we demand it safe to proceed generating low bit per wight quants despite lacking some imatrix data. For Llama-4-Maverick-17B-128E/Llama-4-Maverick-17B-128E-Instruct will likely be save to ignore this error as it only affects the first few layers.

(updating llama.cpp and nuking repos)

Awesome. Thanks a lot. Please bring back the imatrix tasks of DeepSeek-R1-0528 and DeepSeek-V3-abliterated.

I forgot, I don't think we want o maintain that patch, could you try pushing it into upstream? Shouldn't there be an open issue for that already? What are they doing up there???

I don't think that is possible because llama.cpp doesn't want developers to implement anything they have seen on ik_llama.cpp. There are apparently some legal reasons that prevent this despite booth projects being MIT licensed. It likely has something to do with who owns which code. I'm not a lawyer but I saw pull requests being rejected because of this in the past. I think someone who has never seen the ik_llama.cpp code for this need to independently implement and creating a PR for this unless we get written permissions from ikawrakow to use his code as inspiration for ouer PR.

Luckily that patch is really easy to maintain so I don't mind doing so. It’s also cool that this makes us the only ones on HuggingFace with proper GGUF quants fort models using MLA.

Shouldn't there be an open issue for that already? What are they doing up there???

They are aware of it and someone even linked https://github.com/ikawrakow/ik_llama.cpp/pull/411 and because nobody is allowed to look at it someone even summarized the required changes for them but nobody has done anything to fix this for a month.

ugh, i had a miscommunication between hand and brain, and nuked some non-imatrix repos :/ I wish it were more like the moon and the tides.

No problem. Can happen. Especially with the time pressure of making a decision before llmc audit times out. Just requeue them again.

and you've seen it here first, wouldn't be surprised if doctoring patches from ik_llama.-cpp, which gets even less oversight than llama.cpp, eventually leads to more recalculations.

We only have 2 patches from ik_llama.cpp so far. The one that lowers the allows storing imatrix with missing experts as long a certain threshold is exceeded and now the MLA imatrix computation fix. I really hope it stays at those two. The ik_llama.cpp is quite different from the llama.cpp code base so porting over those patches is not easy. I'm also somewhat limited with the amount of testing I can do. So far booth those ik_llama.cpp patches seams save and unlikely to break anything but there is always some risk involved but so is every time we get latest llama.cpp.

/me goes into hiding

No worries I will obviously do some more extensive testing and QA of the quants we create in the next few days.

DeepSeek-V3-0324-Pruned-Coder-411B imatrix will be next

Amazing. That well be a great test. I'm excited to see if that ik_llama.cpp patch actually does what it promised. It seems almost too good to be true that all that was needed was changing a few lines in the imatrix computation.

@mradermacher You accidentialy started DeepSeek-V3-0324-Pruned-Coder-411B using the nocuda build of llama.cpp - that took a while for me to notice.

How are they stuck in repo create? The repository already exists?!?

-7776  804 si Llama-4-Maverick-17B-128E-Instruct           error/255 repo create
-1999  216  I Llama-4-Scout-17B-16E-Instruct-abliterated-v2 error/255 repo create

Edit turns out that it was a HuggingFace issue: HfHubHTTPError('504 Server Error: Gateway Time-out for url: https://huggingface.co/api/repos/create') at /llmjob/share/bin/llmjob line 2889. - an llm audit retry after waiting a wile fixed it.

Despite DeepSeek-V3-0324-Pruned-Coder-411B running on CPU it made was able to let it run to the first auto-save before killing it. I have to say I'm quite impressed by ik_llama.cpp as thair fix indeed works! :D

I guess let's just do r1-1776 because DeepSeek-V3-0324-Pruned-Coder-411B tries to run on CPU and the other MLA RPC imatrix tasks are still not back so it is kind of is my only option until you are available again. Luckily doing r1-1776 now should also work great timing wise as I will probably be on a hike until mid-afternoon tomorrow which should approximately be the time when it completes. I started it. I have to say implementing force-restart-imatrix might be one of the greatest additions ever. It allows me to manage imatrix tasks in quite creative ways and is really powerful together with my ability to kill and pause them. Thanks a lot for adding it. It is working great!

using the nocuda build of llama.cpp

Yeah, i've never un-rpc'd a job before...

I've removed the llama:nocuda from the job and reset the fail state.

implementing force-restart-imatrix might be one of the greatest additions

I would have never predicted that. Of course, it would be even cooler if you wouldn't have to go through a scheduler on another node you don't have access to. Ah well, design decisions.

hike

At this time??? You swiss people are weird.

until mid-afternoon tomorrow which should approximately be the time when it completes

I'm sure it has finished loading by then, yes.

justbfyi, I've added an imatrix job for Llama-4-Maverick-17B-128E-Instruct and configured it for rpc

ik_llama.cpp as thair fix indeed works!

If anything, the imatrix guy should be in the best position to tackle imatrix issues like these. But good to know that's actually true for real :)

good night and, uh, nice hiking. Don't fall into a mountain cave or so - I hear sometimes they have surprise tanks inside :)

I only skimmed what's here but did want to quickly point out it's not so much that I don't trust IK's fix, I'm 95% sure it's completely correct, he's clearly very smart especially when it comes to imatrix, but considering the divergence between his branch and mainline I just wasn't positive that it would be 100% compatible, especially with MLA which seems like a can of worms

I may try to mainline my "fix" (which is not a great fix) but last time I tried to mainline llamacpp quantization fixes it was ignored so I'm not highly motivated to make a huge effort especially for these massive models I'm already annoyed at making haha..

@bartowski well, we will find out. It's not the first time we failed with MLA models :/ And yeah, these models are totally disruptive :)

Your experience with llama.cpp is, sadly, normal in my experience as well (if you don't get blamed for the bug...), but it's still the best we have. And the poor llama.cpp maintainers must feel a lot of pressure.

Essentially forking it with fixes, however, is worse for everybody in the long run.

@mradermacher Please delete the existing imatrix and reset DeepSeek-R1-0528 and DeepSeek-V3-abliterated imatrix computation tasks so we can redo them after r1-1776.

Yeah, i've never un-rpc'd a job before...
I've removed the llama:nocuda from the job and reset the fail state.

No problem. Glad it worked.

I would have never predicted that

Giving really powerful tools to someone open for creative solutions will always lead to remarkable results. I can now finally control the order of imatrix task by killing all the higher priority ones I want to skip. I can also decide to only run a single one by pausing and then killing the one I don’t want.

Of course, it would be even cooler if you wouldn't have to go through a scheduler on another node you don't have access to. Ah well, design decisions.

It’s fine as long it works design decisions don’t matter too much. It is actually quite insane how much our tooling improved in the past year. The number of features we got over time is insane. You did such an amazing job developing all of this.

At this time??? You swiss people are weird.

We started in the morning, reached our destination at noon where we ate lunch and came back home in the afternoon. If you mean time of year well until very recently the mountains where still covered in snow and now it is finally sunny and hot. While it is almost too hot in the low lands the temperature in the mountains is almost perfect for hiking.

good night and, uh, nice hiking. Don't fall into a mountain cave or so - I hear sometimes they have surprise tanks inside :)

Thank. It was awesome. Realistically it will be surprise fighter airplanes because unlike others Switzerland has military airports inside mountains. That way drones starting from a trucks can’t just bomb your airplanes while parked in the open. There are military installations everywhere in the Swiss alps but no worries you are not supposed to notice them. They are usually top secret but I have to say some are quite obvious to spot while others are impossible to notice without insider information. Its quite fascinating how much of the second world war fortification is still there. But I have to say they don't at all distract from nature and the amazing view.

I'm sure it has finished loading by then, yes.

Haha that had quite some truth to it. It obviously first failed by somehow choosing a really stupid way of distributing layers to RPC nodes. When I then woke up at 06:00 I started DeepSeek-V3-0324-Pruned-Coder-411B which surprisingly took less than 3 hours due to not using RPC. I then finally started r1-1776 again this time at around 09:30. It loaded really slow and then somehow decided to do 4 quantization tasks at once which made it almost run out of RAM on StormPeak and made RPC run even slower than it already is but it then later correctly switched to only running 1 quantization task and is currently going as fast as expected and should hopefully be done slightly after midnight. It was quite fun to check do this all on my phone while on my hike.

justbfyi, I've added an imatrix job for Llama-4-Maverick-17B-128E-Instruct and configured it for rpc

Shouldn't we first find an imatrix training dataset that covers all the tensors before we do this one? I will likely try to find one once we are done with the current imatrix tasks or did you already find one?

If anything, the imatrix guy should be in the best position to tackle imatrix issues like these. But good to know that's actually true for real :)

I'm very relieved to know the fix worked. We already have multiple successful DeepSeek-V3-0324-Pruned-Coder-411B quants. It seems like the MLA nightmare is finally over. I'm looking forward testing if those quants actually work once the r1-1776 imatrix task is done.

I'm 95% sure it's completely correct, he's clearly very smart especially when it comes to imatrix

So far imatrix computation and quantization worked as expected so I can say almost for certain that his fix indeed works. His fix makes a lot of sense from a logical/mathematical perspective.

but considering the divergence between his branch and mainline I just wasn't positive that it would be 100% compatible, especially with MLA which seems like a can of worms

Just using IK_llama.cpp is almost certainly not an option. It is far too different from mainline llama.cpp. This is why I ported his MLA imatrix fix to ouer fork of llama.cpp which is an exact copy of llama.cpp with only a hand full of fixes and qualty of life features added. In case you want merge the MLA imatrix fix here you can find the diff: https://github.com/nicoboss/llama.cpp/pull/5/files

I may try to mainline my "fix" (which is not a great fix) but last time I tried to mainline llamacpp quantization fixes it was ignored so I'm not highly motivated to make a huge effort especially for these massive models I'm already annoyed at making haha..

Good luck with that. Our experience trying to contribute to llama.cpp was so terrible that we decided to fork it just so we no longer have to deal with it.

It's not the first time we failed with MLA models :/ And yeah, these models are totally disruptive :)

They are extremely disruptive. Running imatrix computation on such massive models requires all my 3 servers connected over llama.cpp RPC just to compute them in Q8 which not only a massive pain to setup but also makes imatrix computation take around 15 hours during which all our 4 GPUs and 896 GiB of RAM is occupied by it. During imatrix computation those servers can also not host any services like my development environment or the VM I use as my PC that is attached to physical screens. This is especially devastating if the resulting imatrix turns out to now work. In addition, downloading the source model, converting to BF16 and then converting to the source GGUF requires a lot of storage. Waiting half a year for MLA actually meant for me storing all MLA models released in the meantime just so we can finally do them now. We probably should also redo all the MLA capable models we did in the past which are all the DeepSeek v2 and v3 based models which seems like a massive pain as well. In any case I'm honestly just really relieved that we finally found a proper solution to finally do them all.

@mradermacher I successfully tested DeepSeek-V3-0324-Pruned-Coder-411B and r1-1776. They booth worked flawlessly. Not only that I have to say I'm super impressed by the inference performance I'm now getting thanks to MLA. It feels like the model would run on GPU despite it clearly running on CPU. So it definitely seams to be worth it to requant important model that support MLA.

@mradermacher Can you please imatrix queue DeepSeek-R1-0528 and DeepSeek-V3-abliterated. Regarding Llama-4-Maverick-17B-128E-Instruct take a look at https://github.com/ggml-org/llama.cpp/issues/12913 so unless you found imatrix training data that that actives all experts I recommend delaying the imatrix computation for it so I can try that "Syth data generated by actual model at high temp / high rep pen" method that was mentioned.

well, llama-4 had a different issue:

........................./llmjob/llama.cpp-nocuda/ggml/src/ggml-rpc/ggml-rpc.cpp:563: GGML_ASSERT(status) failed

well, llama-4 had a different issue:

........................./llmjob/llama.cpp-nocuda/ggml/src/ggml-rpc/ggml-rpc.cpp:563: GGML_ASSERT(status) failed

Don't worry about it. It is just because I still had the DeepSeek RPC layer distribution settings and so the RPC server on CastlePeak OOM crashed. Please let's do the DeepSeek MLA models first.

[415176.609571] Out of memory: Killed process 2474276 (rpc-server) total-vm:275836788kB, anon-rss:10579580kB, file-rss:67584kB, shmem-rss:8192kB, UID:100000 pgtables:21200kB oom_score_adj:0

Please delete the existing imatrix and reset DeepSeek-R1-0528 and DeepSeek-V3-abliterated imatrix computation

Were there any?

We started in the morning, reached our destination at noon where we ate lunch and came back home in the afternoon. If you mean time of year well until very recently the mountains where still covered in snow and now it is finally sunny and hot.

Hmm, I am pretty sure it was 01:00 or so when I replied, and that was maybe 1h after your message. So it was the middle of the night when you said you will now go hiking. But maybe I was wrong.

Giving really powerful tools to someone open for creative solutions will always lead to remarkable results.

That's not how... ugh.... OK.

Realistically it will be surprise fighter airplanes

I was referring to https://www.nzz.ch/maurers_schimmelnde_panzer-ld.989355 but, yeah, wow, that's already 15 year old news. How did I even remember that.

But yeah, realistically, that's not what you commonly find :)

4 quantization tasks at once which made it almost run out of RAM on StormPeak and made RPC run even slower than it already is but it then later correctly switched to only running 1 quantization task and is currently going as fast as expected and should hopefully be done slightly after midnight. It was quite fun to check do this all on my phone while on my hike.

Yeah, because there were 3 background jobs, and they run out of limits. I interrupted them and reclassified them, that's why only one was running later.

And nico1 did run out of memory, but nothing crashed - merely the imatrix ETA was 60000 minutes.

Our experience trying to contribute to llama.cpp was so terrible that we decided to fork it just so we no longer have to deal with it

It was terrible, but our original motivation was to add some hack patches and some cosmetic details for out own use. Don't demotivate him further :)

@nicoboss DeepSeek-V3-abliterated should be ready to go

@nicoboss DeepSeek-V3-abliterated should be ready to go

Awesome. I will start it as soon the current normal sized models got thair imatirx computed to not repeat the situation of yesturday where everyone had to wait for over 12 hours.

And yes, we should redo the other mla models. Have a list? And a plan? Want to archive the non-MLA versions somewhere?

@mradermacher XET fixed the upload cache bug a week ago in hf-xet 1.1.3 and we since then got a confirmation that it is fixed so we probably should reenable it and see how it goes: https://github.com/huggingface/xet-core/issues/350

And yes, we should redo the other mla models. Have a list? And a plan? Want to archive the non-MLA versions somewhere?

Can you please rsync over all the imatrix files ever created and I can tell you for all the ones importent enough that we ended up generating imatrix quants. But in general we need to requant every DeepSeek v2 and DeepSeek v3 based model. The performance difference between MLA and non-MLA is massive so doing so is probably worth it.

Want to archive the non-MLA versions somewhere?

We for sure want to archive the non-MLA imatrix. The quants itself are likely not worth it as there are too many massive models affected by MLA and we don't want to waste HuggingFace storage. I think for the original DeepSeek R1 we should try to perserve by renaming then recreating the reporitory with the original name and then nuking it so we don't loose a repository of such historic value.

Can you please rsync over all the imatrix files ever

Not quite, but I am syncing the non-archived ones from nico1 over to /apool/imatrix-remote

Will take a bit (~300G).

doing so is probably worth it.

Yup :(

we should try to perserve by renaming then recreating the reporitor

I think we should/need to create a separate account for that and move such models there. There is probably a single-line llmjob invocation that renames a model, if needed.

Ah yes, while we are chatting, Deepseek-R1 failed:

load_tensors: loading model tensors, this can take a while... (mmap = true)
imatrix failed
STATUS: failure
Terminated

as for xet, i am considering it right now, but in the long term, i think we want a separate cache. i really don't see why we need the cache at all. but if it's just a few mb, procrastination will win.

Not quite, but I am syncing the non-archived ones from nico1 over to /apool/imatrix-remote

Thanks a lot! I assume the archived ones are not currently in use by any model uploaded to HuggingFace so we probably don't care about them to determine if we need to requant a model.

I think we should/need to create a separate account for that and move such models there. There is probably a single-line llmjob invocation that renames a model, if needed.

We could do so. For me the original R1 seems to be the most important to preserve. I can't really think of a reason why anyone would want to run the model without MLA. Some claim MLA made quality worse but I didn't notice any quality degradation but a massive performance improvement based on my limited testing. Even with the new quants I think doing so is still possible to run them without MLA using IK-llama.cpp but I will test this.

Ah yes, while we are chatting, Deepseek-R1 failed:

DeepSeek-V3-abliterated finished successfully around 1.5 hours ago. I then killed DeepSeek-R1-0528 on purpose so it processes the backlog of normal sized models and the daily normal sized models you usually add at around noon. I plan on starting DeepSeek-R1-0528 in the early afternoon so it completes tomorrow morning when I wake up.

I assume the archived ones are not currently in use by any model uploaded

Yes, those are from nuked models, so generally trash. Whats also missing from there is early models, before nico1.

I plan on starting DeepSeek-R1-0528 in the early afternoon

good, good :)

imatrix files are synced over and xet is enabled for the moment.

not sure where i got the 300g from, they're 126GB

Thanks a lot for synching over the imatirx files. Thanks to them I was able to determine which models we need to requant for MLA.
The following models contain the attn_kv_b layer in their imatrix and so should support MLA but were converted without MLA:

  • DeepSeek-V2-Chat-0628
  • DeepSeek-V2-Lite-XiaAi
  • DeepSeek-Coder-V2-Instruct
  • DeepSeek-V3-Base.Q8_0
  • DeepSeek-V2-Lite-Chat-Uncensored-Unbiased
  • DeepSeek-Coder-V2-Instruct-0724
  • DeepSeek-V2.5-1210
  • DeepSeek-R1.Q8_0
  • DeepSeek-V2
  • DeepSeek-V2-Lite-Chat
  • DeepSeek-Coder-V2-Lite-Base
  • DeepSeek-R1-Pruned-Coder-411B.Q8_0
  • DeepSeek-V3-Pruned-Coder-411B.Q8_0
  • DeepSeek-V2.5-236B
  • DeepSeek-V3.Q8_0
  • DeepSeek-V3-0324.Q8_0
  • DeepSeek-R1-Zero.Q8_0
  • DeepSeek-V2-Lite-Chat-Uncensored
  • DeepSeek-V2-Chat
  • DeepSeek-Coder-V2-Lite-Instruct
  • DeepSeek-Coder-V2-Base
  • DeepSeek-V2.5
  • MiniCPM3-4B
  • PLM-1.8B-Instruct
  • KwaiCoder-DS-V2-Lite-Base
  • whale-v3-base-merged
  • hpc-coder-v2-16b
  • tiny-random-minicpm3

Please keep in mind that I was only able to check for models where we did imatrix quants. Models like Zireal-0 from which we only did static quants cannot be automatically detected using this method. I know that Zireal-0 is the only DeepSeek-V3 based model from which we only did static quants. There is a possibility that there are some DeepSeek-V2 based ones but even if they are they were not even important enough for imatrix quants so they might also not be important enough to warrant MLA.

Anyway, the situation is aggravated because we still have 1.6T of llama4. At some point, we might have to give up or make executive-level decisions.

Today I tied the syntactic data generation idea from David. I created a 700 KB dataset using the following configurations:

{
    "model": "Llama-4-Maverick-17B-128E-Instruct.Q5_K_M.gguf",
    "prompt": "",
    "max_tokens": 128,
    "temperature": 2.0,
    "top_p": 1,
    "top_k": 1000000,
    "presence_penalty": 1.5,
    "frequency_penalty": 1.5
}

It resulted into the following after imatrix training:

inal estimate: PPL = 3.0097 +/- 0.02713

save_imatrix: entry '             blk.45.ffn_down_exps.weight' has partial data (32.03%)
save_imatrix: 87 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '               blk.45.ffn_up_exps.weight' has partial data (32.03%)
save_imatrix: 87 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '             blk.45.ffn_gate_exps.weight' has partial data (32.03%)
save_imatrix: 87 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '              blk.3.ffn_down_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '              blk.1.ffn_down_exps.weight' has partial data (92.19%)
save_imatrix: 10 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '              blk.1.ffn_gate_exps.weight' has partial data (92.19%)
save_imatrix: 10 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '                blk.1.ffn_up_exps.weight' has partial data (92.19%)
save_imatrix: 10 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '              blk.3.ffn_gate_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '                blk.3.ffn_up_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 423 out of 432 entries

You can download the imatrix training dataset I generated from https://www.nicobosshard.ch/llama4-imatrix-train_v1.txt

I might try some other imatrix datasets but in the likely case I don't find any dataset able to cover them enough to exceed our current thrashed of 95% of all experts covered in each tensor, I recommend we temporary lower this threshold to 30% for this model. Only 3 layers are affected by this from which only layer 45 has tensors below 90% expert coverage. Those experts in those layers never activating are good indications that they are not really needed in reality so I'm perfectly fine with them having no imatirx data.

i've added that imatrix set to out repertoire and will configure the llama jobs to use it (as soon as I found out how to do this again - we surely have a conf variable for it). i assume the lowering of the threshold has to be done in code (maybe we can give it an env variable).

and wow @mla - thats a lot of big models

note to self, the key is "training_data", and the filename must be imatrix-training-.txt, which is a weird thing for me to do, but I guess it was a safety thing (so one cannot specify arbitrary filenames).

in theory, both llama 4 jobs are configured to use it. i am not happy about using a different imatrix data set, but it's likely a very small thing, so let's roll with it.

force-restart-imatrix should now remove the "override" status form a job as well. not sure if it will stay that way. note that this is a different operatioon than clearing the status, so if the model is both overriden and errored-out, it needs to be forced twice (and will give appropriate messages).

@mradermacher Please do the following to start Llama-4-Maverick RPC imatrix computation:

  1. Update to the latest version of our fork.
  2. Revert the Llama-4-Maverick RPC imatrix computation jobs back to ouer official i1 imatrix dataset. We do not want to use the imatrix training data I posted yesterday.
  3. Set the REQUIRED_GOOD_EXPERT_PERCENTAGE=5environement variable when executing llama-imatrix for the Llama-4-Maverick RPC imatrix computation jobs
  4. Add the --output-frequency 9999 command line argument to llama-imatrix so we don't have any intermediate imatrix safes so we know how many experts we covered once it is done. Keep all other arguments as usual for the RPC setup.
  5. Pause all running quantization tasks and wait for them to finish. One running might be fine but a gamble and could very well OOM. Keep in mind that memory will be far tighter than for DeepSeek V3 based models as Llama-4-Maverick is 80 GB larger. I already set and override and interrupt for all massive quantization tasks so you don't have to wait forever for them to pause. Please delete all the quantisation job overrides on nico1 if you don’t want to do Llama-4-Maverick RPC imatrix computation now.
  6. Start the RPC imatrix tasks. All the RPC servers are ready and all other services are turned off.

PS: Please fix the Magistral-Small-2506-abliterated imatrix task first if you can. It somehow failed to transfer from rich1 to nico1. Not sure why. Maybe because it tries to hfd from HuggingFace and doesn't use the token I specified?

It would be a great time to start Llama-4-Maverick RPC imatrix computation once you are ready now that all other imatrix tasks are done. If you don't want to start it now, we likely should resume quantization tasks on nico1.

That Accurate GGUF VRAM Calculator tool is quite nice: https://huggingface.co/spaces/oobabooga/accurate-gguf-vram-calculator - likely not as acurate as dryrun but very accurate.

@mradermacher Please update to latest llama.cpp from ouer fork for dots.llm1 architecture support. I'm so looking forward to try this amazing model.

Please fix the Magistral-Small-2506-abliterated

Yes, the token is not available to the imatrix scheduler, so it couldn't download. I think a simple fix would be to somehow pass the nohfdprep flag when automatically passing jobs to the imatrix scheduler. But at the moment, I guess I can do it manually.

I originally thought it became gated between initial hfd and imatrix hfd, but that was another model.

likely not as acurate as dryrun but very accurate.

Well, it has "accurate" in its name.

     "env":{"REQUIRED_GOOD_EXPERT_PERCENTAGE":5},

I think that should do it (not sure I ever tested that). llama is updating and i will configure the jobs. I might not have time to start them. I am not 100% sure propagating env varts to nico1 has been implemented, but it seems so.

I might not have time to start them.

No problem. I will pause the llmjob tasks and then start them by my own once you propperly configured everything.

For reference (it might not be fully meaningful to you :), the jobs now look like this:

      "Llama-4-Maverick-17B-128E" : {
         "created" : 1749808316,
         "env" : {
            "REQUIRED_GOOD_EXPERT_PERCENTAGE" : 5
         },
         "extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204 --output-frequency 9999",
         "force" : 1,
         "llama" : "nocuda",
         "name" : "Llama-4-Maverick-17B-128E",
         "ngl" : "10000",
         "nice" : 3,
         "disabled-training_data" : "llama4-imatrix-train_v",
         "path" : "nico1",
         "proto" : "rsync",
         "quant" : "Q8_0",
         "status" : "blocked/override"
      },
      "Llama-4-Maverick-17B-128E-Instruct" : {
         "created" : 1749424396,
         "curphase" : "imatrix",
         "env" : {
            "REQUIRED_GOOD_EXPERT_PERCENTAGE" : 5
         },
         "extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204 --output-frequency 9999",
         "force" : 1,
         "gguf_size" : "801469240064",
         "llama" : "nocuda",
         "name" : "Llama-4-Maverick-17B-128E-Instruct",
         "ngl" : "10000",
         "nice" : "2",
         "disabled-training_data" : "llama4-imatrix-train_v",
         "path" : "nico1",
         "proto" : "rsync",
         "quant" : "Q8_0",
         "start" : 1749530127,
         "status" : "error/1"
      }

@nicoboss should all be updated and ready.

Expected VRAM usage: 31366 MiB
Safe estimate: 31943 MiB - 95% chance the VRAM is at most this.

measured:

31908

So, yeah, pretty good. But none of the existing calculators solve my problem - I don't want to know how much memory it uses. I know how much memory I have and want to use what fits. Ideally, llama.cpp shopuld figure it out on it's own. There have been attempts to do so, but they, too, fail - instead of being able to tell them how much memory to use, they just fill all available vram. But I want to watch 4K videos and play factorio as well...

DRYRUN gave me 30450MiB btw., so it's actually less accurate. But it adapts to llama.cpp versions, which that formula does not.

Yeah, even that one fails (text-generation-ui):

automatically sets gpu-layers for every GGUF model if you have an NVIDIA GPU (based on the free VRAM reported by nvidia-smi).

programs that automatically fill all available ram are rarely useful.

For reference (it might not be fully meaningful to you :), the jobs now look like this
"disabled-training_data" : "llama4-imatrix-train_v",

@mradermacher We don't want to use llama4-imatrix-train_v for Llmama 4 but the normal imatrix training dataset we always use.

Yeah, sorry, I've put a "disabled-" in front of the key to disable it, so it's not used.

Yesturday I had to ikill static Q5_K_M hfu of dots.llm1.base because it got stuck. How can I make it to retry the upload?

you would have to requeue the job, or edit it (llmjob edit - not recommended :) and re-add a "want_static":1, so it would redo it.

Or upload it manually with e.g. hfu dots.llm1.base-GGUF --include "*Q5_K_M*" - that's what I did.

How would a static job get stuck, btw.? If you mean the upload, you should need to kill the python process only to get e retry.

How would a static job get stuck, btw.? If you mean the upload, you should need to kill the python process only to get e retry.

It got stuck during upload. It looks as if we might still be using the broken hf_xet 1.1.2 version according to /llmjob/share/bin/huggingface-cli env on nico1. HuggingFace fixed the XET shared cache limit issue and implemented resuming on failed uploads in hf_xet 1.1.3 and fixed the DNS Resolution & Network Connectivity issues in hf_xet 1.1.4. Please update to XET hf_xet 1.1.4 and delete the shared cache on all the workers that where still runninghf_xet 1.1.2. I was under the impression that we upgraded to hf_xet 1.1.3 when we reenabled XET as the reason was disabled was the shared cache limit issue.

That means 1.1.3 wasn't released at the time you informed me about it, because every update simply reinstalls the whole env from scratch. I don't really want to force versions against either explicit dependencies, or get it from non-standard sources.

I was under the impression that we upgraded to hf_xet 1.1.3 when we reenabled XET as the reason was disabled was the shared cache limit issue.

I don't hand-update environments with manually downloaded versions. When you informed me, I simply wiped and reinstalled the env from the then current version. This will not pick up pre-releases or github-only releases. But indeed, 0.5TB is already used on nico1 once more. Sucks.

Indeed, 1.1.2 is still the latest version, according to pip, or some dependency prevents upgrading.

I've disabled xet again.

Nuking the pip and wheel cache seems to fetch 1.1.4 now. I really don't understand pip. Why would anybody prefer an old version just because it is already downloaded. I'll never understand python philosophy.

I'm running pip3 cache purge now when rebuilding the env. I hope that is enough. Why would anybody design things like this. Frustrating.

Ok, I already did run pip3 cache purge in my script, so that's not it. I guess rm -rf'ing various directories is what you have to do. Effing broken.

Hmm, and now there is another issue. xet is still 1.1.2 on nico1, but 1.1.4 elsewhere. The easy explanation is that everybody gets a copy of the env, but nico1 then gets a bunch of vision/cuda modules installed after getting a copy. So something in this process downgrades xet again. I now manually force-upgrade on nico1, but sheesh they make this hard. Probably goes hand in hand with not testing it more than superficially.

Ah right, I probably have to manually nuke the cache directories on nico1 as well. Holy frigging pile of https://github.com/AasishPokhrel/shit/issues/1

also, good catch, nico... :) I would have been vastly more frustrated (if this was possible) to find out by my own because the disk is full.

@mradermacher Please start Llama-4-Maverick-17B-128E and ignore any time of day. It somehow expected Llama-4-Maverick-17B-128E.Q8_0.gguf instead of Llama-4-Maverick-17B-128E.gguf which I fixed but for some reason the job disapeared and I was unable to reset this failure.

nico1 ~# llmc force-restart-imatrix Llama-4-Maverick-17B-128E
Llama-4-Maverick-17B-128E: cleared override
pushing...
nico1 ~# llmc status
stat: cannot statx '/tmp/Llama-4-Maverick-17B-128E.Q8_0.gguf': No such file or directory
Llama-4-Maverick-17B-128E: cannot find gguf size
nico1 /tmp# ln -s /dpool/Llama-4-Maverick-17B-128E.gguf /tmp/Llama-4-Maverick-17B-128E.Q8_0.gguf
nico1 /tmp# llmc force-restart-imatrix Llama-4-Maverick-17B-128E
Llama-4-Maverick-17B-128E: can't find slog and status files
pushing...

It shomehow started even before the time of day deadline (which I had to meet due to it beeing nice 3):

3  801 Llama-4-Maverick-17B-128E                     run/imatrix (GPU-2d)

I would assume the force flag should override any day of time concerns.

It expected the Q8_0 because the job had a "quant": "Q8_0". All these big models are confusing to me :)

@RichardErkhov am I still limited to 1TB on rich1? I'd probably have to limit rich1 to small models if it's going to be permanent.

@mradermacher Llama-4-Maverick-17B-128E imatrix computation completed around an hour ago but it did not upload the computed imatrix and does not generate most imatrix quants as https://huggingface.co/mradermacher/Llama-4-Maverick-17B-128E-i1-GGUF was never nuked. For Llama-4-Maverick-17B-128E-Instruct you first nuked booth repositories. I personally don't think it makes sense to redo the static quants (no idea why we nuked them for Llama-4-Maverick-17B-128E-Instruct) but it probably would make sense to nuke Llama-4-Maverick-17B-128E imatrix quants and upload the latest imatrix and then redo all imatrix quants. I killed and resetted it as it started on quant 8 out of 12. I know I could just nukerepo it myself but because this is a quite unique situation, I'm awaiting your judgement about what to do. If you decide to nuke we likely need to manually upload the new imatrix.

Is it correct that we currently don't keep imatrix computation logs anywhere unless it fails? It would have been interesting to check how many tensors got data.

even with xet 1.1.4 we already have double-digit caches again. if it grows further, i'll disable xet until we have per-upload caches.

Llama-4

Well, the good news is that the imatrix was downloaded, so it's safe. Indeed all you'd need to do is to nuke the repository and restart the job. The imatrix will be uploaded automatically, each time a quantize job starts it checks form scratch.

Currently it is running, though, I assume it started again and will kill/nuke/restart.

Is it correct that we currently don't keep imatrix computation logs anywhere unless it fails?

Yes. I assumed that you do as you always do and hardlink the log file as it unfolds...

We could move the logfiles to some dir and clean it up after a week or so, as with the uploads.

When in doubt, you can even delete the *-i1-GGUF/imatrix.dat file after killing the job. The job will then go into blocked/imatrix until it gets it's imatrix file resupplied by kaos, which is the next time the scheduler runs (e.g. llmc push).

It's restarted, and all should be fine once it really starts (currently waiting for diskspace budget).

Well, the good news is that the imatrix was downloaded, so it's safe. Indeed all you'd need to do is to nuke the repository and restart the job. The imatrix will be uploaded automatically, each time a quantize job starts it checks form scratch.
It's restarted, and all should be fine once it really starts (currently waiting for diskspace budget).

I can confirm that Llama-4-Maverick-17B-128E imatrix/wighted quants are nuked now. It currently runs. I wonder why it shows run/imatrix 8/24 insteads of run/imatrix 1/24:

        -3998  804  I Llama-4-Maverick-17B-128E                    run/imatrix 8/24,IQ2_M [129/531]

Or is it normal that it starts with 8 out of 24? I never really looked at the order in which it usually does quants.

Currently it is running, though, I assume it started again and will kill/nuke/restart.

Great. Sorry for this mistake. I removed all overrides and probably forgot to keep this one blocked. kill/nuke/restart is exactly the way to go.

even with xet 1.1.4 we already have double-digit caches again. if it grows further, i'll disable xet until we have per-upload caches.

Low double digit is expected. The default shared cache limit is 16 GB.

Yes. I assumed that you do as you always do and hardlink the log file as it unfolds...

Which is what I intended to do but then it completed much faster than anticipated. It completed minutes before I planned on creating the hardlink. Next time I will create one immediately after starting it.

We could move the logfiles to some dir and clean it up after a week or so, as with the uploads.

That would be amazing.

Or is it normal that it starts with 8 out of 24? I never really looked at the order in which it usually does quants.

No, it would never start with 8, but if the previous quants are skipped (e.g. because they exist), you might not see any previous status strings. But now it failed, and I don't know why. I suspect I forgot to delete the status and log files. I've restarted it properly now.

Low double digit is expected. The default shared cache limit is 16 GB.

The cache size is explicity set to zero, so I expect 0, not 27GB. It's clearly growing :(

We could move the logfiles to some dir and clean it up after a week or so, as with the uploads.

Untested, but see /tmp/imatrix-log from now on.

btw., we don't keep logs even of failures in general. the only time logs are retained is when a job is "nuke"'d either via llmjob/llmc, or imatrixjob-remote (the hack that does the imatrix scheduling on kaos). it's only a slight difference in practise.,

/tmp/imatrix-log seems to work (example: Josiefied-Qwen3-30B-A3B-abliterated-v2):

save_imatrix: entry '             blk.46.ffn_down_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data - 121 out of 122 required
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '               blk.46.ffn_up_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data - 121 out of 122 required
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '             blk.46.ffn_gate_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data - 121 out of 122 required
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 381 out of 384 entries

/tmp/imatrix-log seems to work (example: Josiefied-Qwen3-30B-A3B-abliterated-v2):

Wow amazing. Thanks a lot for implementing this so quickly.

save_imatrix: entry ' blk.46.ffn_down_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data - 121 out of 122 required
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.46.ffn_up_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data - 121 out of 122 required
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.46.ffn_gate_exps.weight' has partial data (94.53%)
save_imatrix: 7 out of 128 experts are missing data - 121 out of 122 required
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 381 out of 384 entries

I feel we should lower the default threshold to 90%. It seems quite stupid that we don't store the imatrix for the entire tensor if we have 94.53% expert coverage. This would fix all Qwen3-30B-A3B based models and if we decide to requant improve the quality of current Qwen3-30B-A3B imatrix quants as entire tensors beeing static is obviously worse thsan just haveing one expert in a tensor beeing static.

I updated our llama.cpp for to latest llama.cpp and changed the default value for REQUIRED_GOOD_EXPERT_PERCENTAGE from 95 to 90 as it seems more reasonable and fixes any issues with Qwen3-30B-A3B based models. It might make sense to recompute the imatrix and requant the weighted/imatrix quants of popular existing Qwen3-30B-A3B based models so we have all quants and no missing tensors. I'm also thinking if if would make sense to always only safe the imatrix at the very end using --output-frequency 9999 so we get a detailed report on how many experts are missing in the final imatrix.

Important updates in latest llama.cpp:

  • They made duplicate key name inside add_key_value a warning so I was able to remove ouer changes to suppress this error.
  • Support for the following new archidectures:
    • ArceeForCausalLM (Arcee AI's upcoming AFM model)
    • NeoBERT
    • NeoBERTLMHead
    • NeoBERTForSequenceClassification
  • llama.cpp support for IBM Z & LinuxONE mainframes got polished (documentation, SIMD for s390x)
  • Rework embeddings logic
  • Making SentencePiece optional due to the project being abandoned but still required for hf_to_gguf

I updated our llama.cpp fork again. This time to include https://github.com/ggml-org/llama.cpp/pull/14311 which fixes the Llama 4 mmproj extraction so we can finally provide the vision capability. For some reason booth https://huggingface.co/mradermacher/Llama-4-Maverick-17B-128E-GGUF and https://huggingface.co/mradermacher/Llama-4-Maverick-17B-128E-Instruct-GGUF seam to be missing the mmproj file. Did we ever mark Llama 4 as vision model?

It seems quite stupid that we don't store the imatrix for the entire tensor if we have 94.53% expert coverage.

I agree, although it's a slippery slope. I am not sure coverage is all that important, though - even 50% coverage should be fine, if we chose suitable default values for the missing ones. Hope that's the case.

Did we ever mark Llama 4 as vision model?

Not sure, but there is Llama4ForConditionalGeneration - can't access the repo, so don't know what they use. But if it's marked, and it fails, it would not have quantised. (the list is in is_vision_arch in llmjob.pm)

fixes any issues with Qwen3-30B-A3B

Surprisingly, nobody has complained (that e.g. all the low-bit quants are also missing for qwen-a models). It might also fix similar issues with older qwen-a models.

--output-frequency 9999

Well, I would prefer fixing the implementation to not patch the imatrix in-place and then re-use it. Then we would get these messages every time. But so far you have not reaqcted to any attempt of mine to discuss this issue :)

However, it was clearlöy the intent to not save anymore - there is even a comment in the script that I removed the "-ofreq 10". Too bad that's actually the default (and I wonder if that has changed).

In any case, we now have an "-ofreq 55555". I'd be slightly more happy with -ofreq 0, if that disables it?

They made duplicate key name inside add_key_value a warning so I was able to remove ouer changes to suppress this error.

oooh :)

llama.cpp support for IBM Z & LinuxONE mainframes got polished

That's great... but... what makes it so important?

Making SentencePiece optional due to the project being abandoned but still required for hf_to_gguf

Hmm... does this means the python package is not pulled in anymore by the requirements.txt?

llama has been updated

day, i just wanted ot make sure that patchreadme (out readme genertaor/updater) is still running, and I saw this - 6PB.

total TB 6000.162, uploads 60
Sat 21 Jun 2025 03:23:58 CEST

wow :)

I agree, although it's a slippery slope.

I agree. Ideally we would obviously cover them all but doing so is close to impossible or for some models likely even completely impossible.

I am not sure coverage is all that important, though - even 50% coverage should be fine, if we chose suitable default values for the missing ones. Hope that's the case.

I would argue that if we not even a single time manage to activate a specific expert in a specific tensor it is unlikely for a regular user to ever activate it. Even if our users activate it then likely only for a few tokens at most and even then, it would not perform worse than when statically quantized.

Maybe we should make the default threashold 74% so it works if we have at least 3 out of 4 experts covered.

Not sure, but there is Llama4ForConditionalGeneration - can't access the repo, so don't know what they use. But if it's marked, and it fails, it would not have quantised. (the list is in is_vision_arch in llmjob.pm)

It uses Llama4ForConditionalGeneration. I think I know what happened. It's because I probably manually provided GGUFs because the model is gated and was too massive to be handled automatically with the backlog of history models at the time. Can I just copy the source model which I still have locally to /tmp/quant and requeue it to generate the mmproj files? Ideally it would do so without also generating the GGUF but even if it does storage is currently not that big of an issue.

Surprisingly, nobody has complained (that e.g. all the low-bit quants are also missing for qwen-a models).

I think we should at least requant some popular ones. We only want to redo imatrix quants as the static ones are fine. We probably want to redo all imatrix quants as even the ones that were not skipped are of lower quality due to having entire static tensors inside. Is there a way for me to nuke the imatrix repo in a way that the imatrix itself gets recomputed if queued again? If I just nukerepo it I believe it would reuse the existing imatrix.

It might also fix similar issues with older qwen-a models.

It for sure will.

Well, I would prefer fixing the implementation to not patch the imatrix in-place and then re-use it. Then we would get these messages every time. But so far you have not reacted to any attempt of mine to discuss this issue :)

The intermediate save implementation is quite terrible. I not even sure if the intermediate save could affect the final result but they for sure hide the warning about missing tensors for any future saves making it impossible to tell how well the final imatrix turned out and if it still contains any tensors with uncovered experts. I might one day look into it but there are currently more urgent things on my to do list like find a way to get a decent version number into metadata.

However, it was clearlöy the intent to not save anymore - there is even a comment in the script that I removed the "-ofreq 10". Too bad that's actually the default (and I wonder if that has changed).

I remembered so as well. Intermediate safes are completely useless for us as we will never use them anyways.

In any case, we now have an "-ofreq 55555". I'd be slightly more happy with -ofreq 0, if that disables it?

It is supposed to disable it but based in the code it looks like ofreq=0 would happily compute mod 0 and crash but feel free to test it:

if (m_last_call % m_params.n_out_freq == 0) {
    save_imatrix();
}

That's great... but... what makes it so important?

IBM Z & LinuxONE mainframes are the only big-endian devices supported by llama.cpp as far I'm aware. We only create little endian GGUFs so they rely on the big-endian compatibility layer to on the fly convert our GGUFs to big-endian. This means we need to be careful not to do anything stupid that would break their endianness conversion layer. This is mainly important for me who maintains our llama.cpp fork as I don't do any testing on big-endian devices and so better be careful should I ever need to change something that could affect them.

Hmm... does this means the python package is not pulled in anymore by the requirements.txt?

So far they only made it optional inside pyproject.toml and vocab.py so for now it doesn't affect us but something to be aware of in case they decide to remove it from other components. I really don't see how they could get rid of it from convert_lora_to_gguf.py as it is quite essential for many models.

llama has been updated

Awesome. Thanks a lot.

day, i just wanted ot make sure that patchreadme (out readme genertaor/updater) is still running, and I saw this - 6PB.
total TB 6000.162, uploads 60
Sat 21 Jun 2025 03:23:58 CEST

6 PB is so cool and insane. As some quick stats from my side I uploaded 960 TB since I last rebooted Threadripper 112 days ago.

It's because I probably manually provided GGUFs

Ah yes, exactly. mmproj extraction is a conversion step (for deep reasons).

Can I just copy the source model which I still have locally to /tmp/quant and requeue it to generate the mmproj files?

That should work, and unfortunately, it will generate the gguf, again, for somewhat deep reasons.

Is there a way for me to nuke the imatrix repo in a way that the imatrix itself gets recomputed if queued again?

Not yet. There is no 1:1 correspondence between imatrix files and the repo name. nukeall (not sure if exposed via llmc) uses a glob expression that I could use, in, say, a nukeimatrix command. I'll look into it later.

but they for sure hide the warning about missing tensors

Are you sure? I thought that was a result of your patch that fixes the missing "imatrix" weights. I thought I remember that the warnings did appear on every save before. It's indeed not urgent to look into as we don'T dso intermediate saves.

Intermediate safes are completely useless for us as we will never use them anyways.

Indeed, but I did use it in the past when llama-imatrix would crash on every second model at some point :)

code it looks like ofreq=0 would happily compute mod 0 and crash but feel free to test it:

Classic :) Nope, why gamble the astronomically unlikely odds :)

As some quick stats from my side I uploaded 960 TB since I last rebooted Threadripper 112 days ago.

whoa! and that's during "off-season" (last month nico1 uploaded 166TB, and this month vnstat estimates 140TB).

Regarding https://huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-i1-GGUF/discussions/1

I remember the 8x22bs had to be redone due to incompatibilities, you think this is the same problem? I am pretty sure this worked back then.

There is now a new "llmc nukeimatrix model..." command. It more or less does this:

mv -v --backup=numbered /root/imatrix{,-remote}/$model{,.[iQ]*}.imatrix /root/imatrix-archive/.

You should be able to see if it succeeds. And there might be a way that it moves the wrong imatrix file (globs are so limited), but it's what "nukeall" also uses. Untested. Of course. The imatrix and imatrix-remote (nico1) dirs
are where llmjob looks for them, while imatrix-archive has any old ones that should not be visible anymoree.

hmm, it seems rich1 memory has silently been reduced to 44gb? i'll reduce the number of jobs to one, because i get memory allocation failures.

maybe I overlooked it, but I can't seem to see where this has been mentioned. it's fine to do it, but silently doing it is a really really really bad idea.

llmc audit broke because of rich1:

nico1 ~# llmc audit
llmjob worker nico2 disabled, skipping.
poll: protocol failure in circuit setup
rich1: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at /llmjob/share/bin/llmjob line 630

That should work, and unfortunately, it will generate the gguf, again, for somewhat deep reasons.

Cool. In the meantime I generated the missing mmproj files for Llama 4 mavericks.

There is now a new "llmc nukeimatrix model..." command. It more or less does this:

Awesome! Thanks a lot for implementing this!

I remember the 8x22bs had to be redone due to incompatibilities, you think this is the same problem? I am pretty sure this worked back then.

Yes exactly. I forget exactly why but at some point llama.cpp broke backwards compatibility with some old MoE models. I see you already requantized it so thanks to dryrun we know the new one will almost certainly work.

hmm, it seems rich1 memory has silently been reduced to 44gb?

No worries he will probably increase it again after the next server reboot. He is currently having some memory issues with his server. Basically many of his services keep getting OOM killed because he accidentally assigned half of his RAM as ARC cache. He though reducing RAM is fine because he added enough swap to compensate for the RAM he removed. He is currently traveling around the world so hard to say when he has time to care about this issue. I would assume soon. Until then just only run one concurrent task.

llmc audit broke again because of rich1:

nico1 ~# llmc audit
llmjob worker nico2 disabled, skipping.
poll: protocol failure in circuit setup
rich1: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at /llmjob/share/bin/llmjob line 630.

Hopefully rich1 gets more stable again once Richard reboots the server today or more likely tomorrow.

I started requanting whale-v3-base-merged as first of the old MLA models. It surprisingly failed with error/1 TypeError not a string based on the status page. I will investigate once llmc audit is working again.

Richard finnaly had time to reboot rich1. It now again has 56 GiB of RAM and there should no longer be any OOM kills as we reduced ZFS ARC Cache to 1 GiB. The reboot also fixed above mentioned llmc audit issue.

I also finally was able to see the whale-v3-base-merged issue and it is quite a strange one:

WARNING:gguf.gguf_writer:Duplicated key name 'deepseek2.attention.key_length', overwriting it with new value 576 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'deepseek2.attention.value_length', overwriting it with new value 512 of type UINT32
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
Traceback (most recent call last):
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 6560, in <module>
    main()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 6554, in main
    model_instance.write()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 404, in write
    self.prepare_metadata(vocab_only=False)
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 517, in prepare_metadata
    self.set_vocab()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5168, in set_vocab
    self._set_vocab_gpt2()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 838, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 603, in get_vocab_base
    tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 1032, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2025, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2063, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2278, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 171, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 198, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/llmjob/share/python/lib/python3.11/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: not a string

Yeah, if only python tracebacks included the actual values. Clearly files were missing that are in the repository - did you delete any to try to fix this? I've forces a (R)edownload and restarted it.

Nope, missing files were the problem. No clue how huggingface managed to download it without errors but leaving out files.

The issue with less ram on rich1 is not that there is less ram, but that it broke stuff and I had to debug it, not knowing what the heck is going on.

It probably makes a lot of sense to only use one job on rich1 permanently, thus freeing resources for other things, as we probably don't need more than one job at the moment.

I just updated ouer llama.cpp fork adding support for the following 3 archidectures:

done

@nicoboss am I blind or did we never make quants for all llama-3 and 3.1?

@nicoboss am I blind or did we never make quants for all llama-3 and 3.1?

We probably did. I have them all localy and so cold easely copy the base models to your container if needed. Alternatively ouer friend requested access to all of them so just use his token. I think the issue was you searching for Llsma instead of Meta-Llama as we seem to have them as for excample https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-i1-GGUF

@nicoboss what are your thoughts about https://huggingface.co/bullerwins/DeepSeek-TNG-R1T2-Chimera-BF16 ?

Sure lets do it. I see you already downloaded it and are currently out of quota. I recommend you just source GGUF the model to one of the many storage pools attached to your container most of which currently have plenty of storage for you to use. I updated the RPC setup to the same version you currently use and will start the RPC servers for this one as soon you are ready.

Sure lets do it. I see you already downloaded it and are currently out of quota.

No, the quota currently prevents it, because I'd need (potentially) 3.7TB of space. I will force it and see where we go.

I noticed I only queued it statically (because that was my original plan). I assume Q8_0 fits nicely?

I configured it like this, it's on soverride state, you cna force-enable it with force-restart-imatrix

     "extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204",
     "force" : 1,
     "llama" : "nocuda",
     "ngl" : "10000",
     "quant" : "Q8_0",

wow, job well done :) already generating imatrix quants

@mradermacher Please setup and start imatrix RPC for DeepSeek-V3-0324 once you are ready. I cloned the non-MLA quants to my own account before requanting it. All the RPC servers are ready.

started!

Sorry we forgot about Q8. Please reconfigure the job to use /tmp/DeepSeek-V3-0324.Q8_0.gguf and I will restart the imatrix RPC job once the Q8 quant is generated. If you can't reconfigure I will just softlink. It is the same Q8 we currently generate for static quants so we can check the status psge to see once it is done.

Some updates:

  • DeepSeek-V3-0324 rpc imatrix computation compleated successfully
  • UniReason-Qwen3-14B-no-think-SFT stuck at hfd on marco

Everything regarding llama-imatrix is about to change very soon:

  • The legacy imatrix.dat file will get replaced by imatrix.gguf which uses a compleately new file format
  • Official support for 3D tensors so ouer MLA patch will no longer be needed
  • Official support for tensors with partial data so ouer MeE fix will no longer be needed
  • imatrix computation with higher batch size heavely improfing imatrix computation speed offloaded to SSD/RAM/RPC now potentially beeing much faster than prompt processing but comes at the cost of no popeline paralelism.
  • The author of the PR is aware of ouer fork and looked at it.
  • Insane amount of other breaking imatrix changes so please take a look at the original PR under https://github.com/ggml-org/llama.cpp/pull/9400

On some unrelated note llama.cpp removed the entire kompute backend. Luckely this doesn't affect us.

UniReason-Qwen3-14B-no-think-SFT stuck at hfd on marco

python happily waiting for more data on a closed connection...

The legacy imatrix.dat file will get replaced by imatrix.gguf which uses a compleately new file format

I assume that means we can throw away all out work, too? regardless, doesn't strike me as the change we were urgently waiting for.

imatrix computation with higher batch size heavely improfing imatrix computation speed

It was an improvement before (not heavily though), but the reason it wasn't done is bcause the last result was it decreases quality.

speed offloaded to SSD

what the fuck? wow!

Insane amount of other breaking imatrix changes

Will have a look and report back. I'm already scared.

There are multiple problems with imatrix which this is addressing:
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)

Yes. I always calculate imatrix files twice just to see if the checksum matches. This looks like a solution looking for a problem to solve.

Well, we can't stop "progress", so we'd better get used to it.

Can't use bigger batch size than chunk size

So probably my concern with larger chunk sizes are irrelevant.

Overall, not too many changes it seems. Maybe there aren't even user-visible changes, other than maybe we have to stop the world completely because I'd guess the new quantize can't read old imatrix files, so we first have to finish all quantizes with old imatrix files before we can switch. Should fortunately be an easy option at the moment, with practically no current releases anymore.

./requirements/requirements-convert_legacy_imatrix_to_gguf.txt

No issue at all

As a stopgap measure, I've implement a link from the imatrix repo tot he static repo for vision model, and also added a link to out overview page for new README's.

Example: https://huggingface.co/mradermacher/Qwen2.5-VL-32B-Instruct-abliterated-i1-GGUF

pages will be updated over the next days.

This implements the versioning logic, and also the database/cahcing logic for all the upstream metadata, but hopefully in a limited enough way for it not to explode. This will make the real change easier.

Ah, and the fact that new mmproj files weren't listed in the table was actually a bug.

and llama.cpp has been updated, too.

The ssh forwarding doesn't seem to work anymore:

ssh: connect to host castle.nico.re port 2108: Connection timed out

also, more importantly:

ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected. nvidia smi does see both cards though.

Update: llama-imatrix sees only device 18*. I will pause the other one for the time being.

Update: found out how to pausde it,. the pause can be undone with llmc flags pause.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-

because I'd guess the new quantize can't read old imatrix files,

@mradermacher actually, the old format can still be read by llama-quantize in https://github.com/ggml-org/llama.cpp/pull/9400.

There's even support for bi-directional conversion of the imatrix files, by not specifying a dataset and using ./bin/llama-imatrix -m /path/to/model.gguf --in-file some.imatrix -o some-converted-imatrix.gguf. Usually conversion is not necessary (since they can still be read directly).

Old imatrix files can still be produced even after the PR is merged, the GGUF-based format is only used when the output filename ends with .gguf.

@compilade not sure why you picked an old message to reply to - we are already aware, also that it can be converted.

Old imatrix files can still be produced even after the PR is merged, the GGUF-based format is only used when the output filename ends with .gguf.

I didn't know that, though. It probably won't affect us, but this kind of shockingly bad design is going to bite somebody for sure. Sigh.

Update: Ok, shockingly bad is too harsh for this, specifically, but we have been bitten many times in the past by magic-argument-guessing problems in llama tools. If there are choices, they should make it a switch and not magically guess from filenames, which are arbitrary user-defined strings.

Update 2: especially if the new format actually changes interpretation of the imatrix data, because now we are forced to rename all imatrix files to .gguf. won-der-ful.

@nicoboss I've further reduced the model size limit on rich1 to 50B due to lack of diskspace.

this kind of shockingly bad design is going to bite somebody for sure. Sigh.

@mradermacher That was mostly done in an attempt to address concerns by ikawrakow who has scripts which read the legacy format directly.

Not sure how to make that less confusing while still allowing conversion to the old format.

EDIT:

If there are choices, they should make it a switch and not magically guess from filenames, which are arbitrary user-defined strings.

I can change that. Do you have a suggestion for the name of the switch?
(HuggingFace cannot guess that a file is GGUF when it doesn't have the .gguf extension, though, so for the best user experience, using .gguf for imatrix files in the GGUF format will be necessary anyway)

@mradermacher That was mostly done in an attempt to address concerns by ikawrakow who has scripts which read the legacy format directly.

But of course ikawrakows concerns are not solved by this - anything that allows both formats would allow this, even sane solution, so I don't think this really addresses his concerns specifically.

Not sure how to make that less confusing while still allowing conversion to the old format.

I am not an expert in llama switch design, but "-of gguf/imatrix" or the like, as practically any other program in this world handles file formats (well, at least those who have backwards compatibility concerns).

As I said, it's not the worst misdesign. Things like quantize insisting that filenames must not start with 0x or so are far worse.

It might be overall less confusing to always output gguf, and simply have a converter command for those (very few!) who need the old format, or those (more common) who want the new format and only have the old one (and only need a converter anyway). That breaks backwards compatibility, but in a clean way, without people far in the future having to worry about the old format all the time. The current design is so bad because it both breaks backwards compatibility and carries legacy concerns into the indefinite future, as when people wonder why they silently get worse quality or quantisation fails because they didn't name their file .gguf in years to come.

(HuggingFace cannot guess that a file is GGUF when it doesn't have the .gguf extension, though, so for the best user experience, using .gguf for imatrix files in the GGUF format will be necessary anyway)

That's the same same illogical argument as with ikawrakow earlier - just because a bad design can solve a problem does not mean that the problem must be solved by bad design. Just because all humans are mammals does not mean all mammals are human. Just because ikawrakow needs the old format does not mean everybody should have to work around his problem.

Having said that, I am happy that llama.cpp didn't simply invalidate all existing imatrix files, as usually happens.

@compilade actually ikawrakows concerns were not backwards compatibility but code bloat and needless complication. his preferred solution would be to not have imatrix files as gguf anyway (a wasteful and complicated ad-hoc file format).

(i'm not ikawrakow btw..)

@nicoboss ... and I just found out that a seemingly innocuous bugfix caused thousands of model page updates. that, even worse, might need to be undone.

bleh.

@mradermacher (sorry for the long answer)

I am not an expert in llama switch design, but "-of gguf/imatrix"

I'm not an expert in that either. -of <type> seems good. Or maybe -ofmt to match with the style of -ofreq. What I'm not sure about with such a flag (and partly why I didn't go with that approach (at least for now)) is whether or not it would cause more confusion than the file suffix.

For a proper format flag, the legacy format should have a name, but I'm not sure what it should be. Naming it imatrix could potentially confuse new users into using it, legacy doesn't feel satisfying, and dat isn't descriptive enough (although it may be appropriate).

The harder requirement to fulfill with a format flag is to not break ikawrakow's scripts. That might prevent using GGUF by default, since that could break his scripts until he uses the format flag.

A lot of other projects use the filename extension for the format of the output file, like ffmpeg, ImageMagick, etc.

Some use a format flag (e.g. pandoc), but also guess from the filename extension when no flag is provided.

It might be overall less confusing to always output gguf, and simply have a converter command for those (very few!) who need the old format

The main use case I was thinking of when adding back support for the old format (because I did make it use only GGUF at some point) was any preexisting scripts generating and reading/parsing imatrix files directly, without passing through llama-quantize, but still using a recent version of llama-imatrix.

I realize that may not be a very popular use case.

I think the use case you're most interested in is between llama-imatrix and llama-quantize, where the actual format or compatibility across versions isn't particularly important (apart maybe from reading because quantizing again can be useful after some time).

Just because ikawrakow needs the old format does not mean everybody should have to work around his problem.

Agreed. Reading the legacy format will be useful for most, but writing, maybe not so much.

actually ikawrakows concerns were not backwards compatibility but code bloat and needless complication

You're right, it does sound like that. Generating the imatrix files certainly requires (the "bloated") libggml.so anyway to run inference, which makes his concerns about bloat confusing. But in the context of reading imatrix files, it makes sense, and I think the compromise of allowing to continue generating/converting to legacy imatrix.dat files somewhat addresses that.

The current design is so bad because it both breaks backwards compatibility and carries legacy concerns into the indefinite future, as when people wonder why they silently get worse quality

It doesn't break backwards compatibility, quite the opposite, you're complaining that it's too backwards compatible. Newly generated legacy-format imatrix files can still be read by older llama-quantize versions. However, those who use the default output filename will get imatrix.gguf instead of imatrix.dat, and those using a specific file name will need to opt-in to using the GGUF format (by using the appropriate .gguf filename extension) for imatrix at their own pace. It's certainly not as much a breaking change as silently replacing the format would be. Deprecation of the legacy format will come later, it doesn't have to be all at once.

I think what bothers you is that the new format is opt-in, while you would prefer it to be opt-out (since it's strictly better than the old format (because you're not ikawrakow)).

Currently, the only difference in quality is when there is partial data for MoE tensors.
This was already broken on master, and can't be fixed satisfyingly because the legacy format is not extensible enough (the workaround of adding fake 1 values would work, but would not allow merging properly (you know what, since the current behavior of dropping the data also makes merges weird, I guess the better solution is to add those fake 1. I will change this.)).

The change from that last parenthesis would make the old format completely equivalent to the new format in quality (for now), and the remaining advantages of GGUF-based imatrix files would be extensibility, the ability to sanely merge multiple imatrix.gguf files using different chunk sizes and/or with MoE tensors, and interoperability. (with gguf-dump, the GGUF previews on HF, etc.)

Having said that, I am happy that llama.cpp didn't simply invalidate all existing imatrix files, as usually happens.

Extensibility is one of the major things storing imatrix files with the old format does not allow. It's not versionned, the file type can't be uniquely identified from the content (no magic first bytes), stacked MoE support is a workaround, the chunk size isn't stored, but it's used to pre-scale the values, the number of chunks is stored per tensor instead of storing the number of tokens per 2D slice, there's no optional fields except if they're at the end, new fields can't be added unless they're at the end, types of the fields cannot be changed in any backwards compatible way, extra tensor data (e.g. the sums of activations (in addition to the currently-stored sums of squared activations)) cannot be added cleanly without breaking older files, etc. All of that won't be a problem anymore with the GGUF-based imatrix format.

A lot of other projects use the filename extension for the format of the output file, like ffmpeg, ImageMagick, etc.

Yes, but all of them have a way to override it, and none of these have backwards compatibility problems.

The harder requirement to fulfill with a format flag is to not break ikawrakow's scripts.

ikawrakow's concerns were that there is no old style imatrix output available. The requirement to make life difficult for everybody else so ikawrakow has to change nothing is not something he asked for, afaics.

His requirement can be fulfilled the way I outlined it, without creating action-at-a-distance hell for every other user forever in the future.

It doesn't break backwards compatibility, quite the opposite,

Of course it does break backwards compatibility. That's why ikawrakow complained.

you're complaining that it's too backwards compatible.

No, you didn't read what I wrote. I was complaining about guessing from the file name, which is non-obvious and will cause issues, and then guessing in the wrong way, by defaulting to the legacy format, causing issues for everybody in the future.

This is just extremelyx bad design, and I explained why.

I think what bothers you is that the new format is opt-in, while you would prefer it to be opt-out (since it's strictly better than the old format (because you're not ikawrakow)).

No, I precisely explained it. What bothers me is that this design is a trap for users. It's not an issue for me, I have worked around every single idiotic design decision in llama.cpp without issues other than grumbling to nicoboss about it.

I also can't see why you are defending it so much. User uses llama-imatrix, it works. Then she gives an output filename of her chosing, and boom, no error, but suddenly quality degrades or it fails toi quantize, with no obvious reason why other than magic filename guessing. I just can't fathom why anybody thjinks this is a good thing, or even a reasonable compromise. It's just bad design.

Anyway, I will not discuss this further, llama.cpp devs do what llama.cpp devs want to do. and I have explained enough already. If you want to insist on/defend some extremely bad user interface, sure, thatr's par for the course for llama.cpp - I have seen my share of llama.cpp devs shitting on users.

All of that won't be a problem anymore with the GGUF-based imatrix format.

Yeah, if the user does not fuck up naming her files, because, if she fails to give the "correct" filename, things silently go wrong. Great design.

@mradermacher Please update to latest llama.cpp of ouer fork. Sorry for the insane pace of updates in the past days. llama.cpp progress is going absolutely crazy at the moment. I'm sure it will slow down again soon. Suport for so many models got added that it probvably is worth to update so please update and queue the following models:

@compilade and mradermacher Sorry for my late and long replay. I was quite busy dealing with the insane llama.cpp progress in regards of newly supported models in the past few days. I just went through all your messages chronologically and answered them.

I have to say big thanks to @compilade for finally putting in the effort llama-imatrix deserved for quite a while. Not only did you implement an amazing new GGUF based imatrix file format but also fixed the issue with partial data for MoE tensors and added missing support for 3D tensors finally allowing imatrix computation of MLA models using mainline llama.cpp. Thank you so much for your great effort! I highly appreciate it.

I assume that means we can throw away all out work, too? regardless, doesn't strike me as the change we were urgently waiting for.

Not at all as there luckily is backwards compatibility. Once merged llama-imatrix should still be able to read the legacy format. Old imatrix files can even be converted into the new format which might be something worth considering to future proof our imatrix files as I have some doughs llama.cpp will forever mention the legacy format as it seems like a massive maintenance burden with almost no benefits to me.

It was an improvement before (not heavily though), but the reason it wasn't done is because the last result was it decreases quality.

With the legacy file format larger batch size was totally broken and resulted in terrible quality. This is an issue addressed with the new imatrix file format so we should be able to increase it for a speed improvement. For larger models that don't fit into GPU memory and especially for models so large they need RPC or SSD streaming the speed improvement is likely massive. Doing so is not really in in the spirit of our i1 standard unless increasing the batch size has no or close to no impact on the imatrix quality so extensive testing once this is merged is required. I will also do some tests to see if it speeds up RPC imatrix computation which currently is painfully slow.

Yes. I always calculate imatrix files twice just to see if the checksum matches. This looks like a solution looking for a problem to solve.

Having to order in which data is stored depend on the undefined unordered_map iteration order is quite a bad design choice in my opinion. I don't think anyone cares about checksums but it just is one of many examples why the legacy imatrix format is not at all what I would consider a well-designed file format and instead something someone just throw together with as little effort as possible to somehow make it work. It doesn't meet the high-quality standard requirements put on an open-source project as massive and widely used as llama.cpp in my opinion.

Overall, not too many changes it seems. Maybe there aren't even user-visible changes, other than maybe we have to stop the world completely because
I'd guess the new quantize can't read old imatrix files, so we first have to finish all quantizes with old imatrix files before we can switch.

It can read them but I highly recommend not to ever relay on the backwards compatibility layer whenever possible as given past experience I wouldn't expect it to be well tested especially for the initial release and I would prefer not having to betatest it on production which in worst case would lead to an unknown amount of unknown corrupted imatrix quants so let's stop the world before we update once this gets merged. We have to rename the imatrix file to end with .gguf anyways.

Should fortunately be an easy option at the moment, with practically no current releases anymore.

Wow that statement aged so poorly. It went from a super calm period where I even started doing requantization of alld models to give them MLA support to this crazy flood of newly released and mainly newly supported models some of which are absolutely massive in size.

./requirements/requirements-convert_legacy_imatrix_to_gguf.txt
No issue at all

I probably would have preferred a conversion script than them having kept in the legacy imatrix format inside llama-imatrix. It being able to read the legacy imatrix files without conversion is supper nice but I'm just happy I'm not the one having to maintain this in the future. A conversion script would probably have been so much easier to maintain.

As a stopgap measure, I've implemented a link from the imatrix repo to the static repo for vision model, and also added a link to out overview page for new README's.

Wow that is super nice!

This implements the versioning logic, and also the database/caching logic for all the upstream metadata, but hopefully in a limited enough way for it not to explode. This will make the real change easier.

That's so cool.

Ah, and the fact that new mmproj files weren't listed in the table was actually a bug.

I always thought you just never automated that after the initial mmproj README.md experimentation. I actually found it missing from the table in the README.md super annoying but didn't want to bother you too much about it as I thought it will all be fixed with the new readme anyways.

and llama.cpp has been updated, too.

Thanks, and sorry for the many updates in the past few days. What a crazy time.

The ssh forwarding doesn't seem to work anymore:
ssh: connect to host castle.nico.re port 2108: Connection timed out

Sorry for that. My ISP did some maintenance on their TV network and it seems like it didn't go according to plan as apparently this also affected the internet and resulted in public IP addresses to rotate despite them claiming that the maintenance will only affect the TV network. I updated all the DNS entire so SSH forwarding should be working anymore. For next time it probably would just be enough to obtain the public IP and add use it instead of the DNS until I update the DNS entries. I will also soon setup DynDNS on my router again so it should update automatically in the future. I still have not configured DynDNS and Let's Encrypt since I switched from the Raspberry Pi 4 to Threadripper to host the OpenWrt router booth of which are on my ToDo list.

ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected. nvidia smi does see both cards though.
llama-imatrix sees only device 18*. I will pause the other one for the time being.

I have no idea what caused/causes this. I rebooted StormPeak and the issue persisted but now it seems like it somehow fixed itself. I’m quite confused what happened and doesn’t really see anything in the logs but the issue seems to be gone now at least on my container. Before the issue occurred in all LXC container and with all software using CUDA.

Update: found out how to pausde it,. the pause can be undone with llmc flags pause.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-

I recommend you resume if once have some low priority models to test if it works again for you.

@mradermacher actually, the old format can still be read by llama-quantize in https://github.com/ggml-org/llama.cpp/pull/9400.
There's even support for bi-directional conversion of the imatrix files, by not specifying a dataset and using ./bin/llama-imatrix -m /path/to/model.gguf --in-file some.imatrix -o some-converted-imatrix.gguf. Usually conversion is not necessary (since they can still be read directly).

Which is very nice but a potential maintenance burden.

Old imatrix files can still be produced even after the PR is merged, the GGUF-based format is only used when the output filename ends with .gguf.

This is so ugly. I would prefer a flag instead of solely relying on the filename. To be fair one really should name the imatrix file with a .gguf extension as it is a GGUF file but how should users know about this? Most will still use the old command and unintentialy still use the legacy format. If you go this route at least show a big warning to warn users about this behavior.

I didn't know that, though. It probably won't affect us, but this kind of shockingly bad design is going to bite somebody for sure. Sigh.

It for sure will. There will be so many users that simply don't know about this.

Update: Ok, shockingly bad is too harsh for this, specifically, but we have been bitten many times in the past by magic-argument-guessing problems in llama tools. If there are choices, they should make it a switch and not magically guess from filenames, which are arbitrary user-defined strings.

I agree. Switches are much easier for new users to get familiar with and are usually much better documented.

Update 2: especially if the new format actually changes interpretation of the imatrix data, because now we are forced to rename all imatrix files to .gguf. won-der-ful.

We should use a .gguf extension for all the new imatrix anyways because that's what they are.

@nicoboss I've further reduced the model size limit on rich1 to 50B due to lack of diskspace.

OK I let Richard know. I'm in quite close contact with him. Not sure what's going on with his old rich1 cloud server. He is currently on holiday and busy with his new AI training server.

@mradermacher That was mostly done in an attempt to address concerns by ikawrakow who has scripts which read the legacy format directly.
Not sure how to make that less confusing while still allowing conversion to the old format.

You just add a flag.

I can change that. Do you have a suggestion for the name of the switch?

I don't really mind how you call it. I mostly care that you don't create a trap for users not aware of it.
--legacy - Stores the imatrx in the old legacy imatrix.dat file format instead of the new imatrix.gguf file format.

HuggingFace cannot guess that a file is GGUF when it doesn't have the .gguf extension, though, so for the best user experience, using .gguf for imatrix files in the GGUF format will be necessary anyway

I agree and we will use the .gguf extension for next generation imatrix files for this and many other reasons. It would be stupid not to use the proper extension for what clearly is a GGUF file.

But of course ikawrakows concerns are not solved by this - anything that allows both formats would allow this, even sane solution, so I don't think this really addresses his concerns specifically.

I think he probably would be even fine with a simple dedicated conversion script but a flag seams the best solution for me.

I am not an expert in llama switch design, but "-of gguf/imatrix" or the like, as practically any other program in this world handles file formats (well, at least those who have backwards compatibility concerns).

That's also a nice idea as that way you let the user to select one instead of just assuming GGUF to be the default.

As I said, it's not the worst misdesign. Things like quantize insisting that filenames must not start with 0x or so are far worse.

Sorry for making such a big deal out of this. It really is a relatively minor poor design choice compared to many other things inside llama.cpp.
The worst llama.cpp design mistake in my opinion is still that for RPC imatrix computation I can only specify -ngl 999 to offload all layers to the RPC servers but can then not specify -ngl 0 for GPU accelerated imatrix computation on those RPC server. This forces us to use GPU only RPC servers and abusing GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to then have GPU memory overflow to RAM which is half as fast compared to -ngl o for imatrix computation. This terrible design already costed us weeks of wasted compute and is what makes RPC imatrix computation of LLama 405B based models take 20 instead of 10 hours.

It might be overall less confusing to always output gguf, and simply have a converter command for those (very few!) who need the old format, or those (more common) who want the new format and only have the old one (and only need a converter anyway). That breaks backwards compatibility, but in a clean way, without people far in the future having to worry about the old format all the time. The current design is so bad because it both breaks backwards compatibility and carries legacy concerns into the indefinite future, as when people wonder why they silently get worse quality or quantisation fails because they didn't name their file .gguf in years to come.

I couldn't agree more. The initial idea with a conversion script was really nice and easy to maintain in my opinion.

That's the same same illogical argument as with ikawrakow earlier - just because a bad design can solve a problem does not mean that the problem must be solved by bad design. Just because all humans are mammals does not mean all mammals are human. Just because ikawrakow needs the old format does not mean everybody should have to work around his problem.

Having said that, I am happy that llama.cpp didn't simply invalidate all existing imatrix files, as usually happens.

I as well highly appreciate that there is a way to convert the old to the new file format as we spent a ton of compute to generate them. It would have been super sad to see them all go to waste. Thank you so much for implementing this conversion!

@compilade actually ikawrakows concerns were not backwards compatibility but code bloat and needless complication. his preferred solution would be to not have imatrix files as gguf anyway (a wasteful and complicated ad-hoc file format).
(i'm not ikawrakow btw..)

He can still use his beloved inferior legacy imatrix.dat format even if we add a switch. But even if we drop support for it entirely he maintains his own llama.cpp fork where he is free to keep the legacy format.

@nicoboss ... and I just found out that a seemingly innocuous bugfix caused thousands of model page updates. that, even worse, might need to be undone.

Oh no what happened?

@mradermacher (sorry for the long answer)

No problem. I'm going through all of them anyways.

I'm not an expert in that either. -of seems good. Or maybe -ofmt to match with the style of -ofreq.

I like -of but also don't mind -ofmt if you tell me what it stands for. Output file media type?

What I'm not sure about with such a flag (and partly why I didn't go with that approach (at least for now)) is whether or not it would cause more confusion than the file suffix.

It's less confusing as you see flags booth inside the help and they are well documented inside the README. You can expect users to look at booth the help and README before using this. To make it even less confusing add a well visible warning if no file type is specified and fall back on detecting the file type based on the extension so you have the advantages of booth approaches.

For a proper format flag, the legacy format should have a name, but I'm not sure what it should be. Naming it imatrix could potentially confuse new users into using it, legacy doesn't feel satisfying, and dat isn't descriptive enough (although it may be appropriate).

I would name it imatrix.dat as that's what likely everyone is calling it.

The harder requirement to fulfill with a format flag is to not break ikawrakow's scripts. That might prevent using GGUF by default, since that could break his scripts until he uses the format flag.

How comes his script is so important we can't under any circumstances break it? How long would it take him to change it? Probably like 1 minute. llama.cpp developers constantly break our scripts and we never complain about it. But if you fall back on using the extension if no flag is specified then I think everyone should be happy. Just make sure to add a well visible warning so you don't cause issues for users that are not aware of the new file format.

A lot of other projects use the filename extension for the format of the output file, like ffmpeg, ImageMagick, etc.

I always just specify the file type and using a flag and wasn't even aware ffmpeg would otherwise determine it based on the extension. This approach also wouldn't work well for me as I often give my videos arbitrary extensions based on the video codec like .av1 for AV1 encoded videos as container formats like mkv and mp4 are stupid as they hide the actual video codec which is like the only format I care about. How would ffmpeg even know based on the extension what codec I want? Extensions just specify the container format which can contain dozens of different video codec. I see no way around having to at least specify the video codec using a flag when using ffmpeg.

Some use a format flag (e.g. pandoc), but also guess from the filename extension when no flag is provided.

Which is a valid approach you could implement

The main use case I was thinking of when adding back support for the old format (because I did make it use only GGUF at some point) was any preexisting scripts generating and reading/parsing imatrix files directly, without passing through llama-quantize, but still using a recent version of llama-imatrix.

I'm quite confident the few scripts I wrote to process imatrix files will all still work without any issues using the new GGUF based format as they all just grep for specific tensor names to for example to determine if the imatrix was computed with or without MLA support. In fact, having the imatrix files as GGUFs will make developing any future scripts far easier. The imatrix.dat file format was mostly undocumented and a pain to work with while GGUF is a well-established format for which many libraries exist that can be used to for example read the GGUFs metadata.

I realize that may not be a very popular use case.

The number of scripts written for imatrix.dat file is minimal as the format itself was too big of a pain for anyone except ikawrakow who invented it to develop any tooling for. Was the imatrix.dat file format even documented somewhat? I didn't remember finding any documentation for it the last time I had to develop a script for it which is why I just used greb. It also was a far simpler solution that did its job very well and gave us a nice list of models to requantize to add MLA support.

I think the use case you're most interested in is between llama-imatrix and llama-quantize, where the actual format or compatibility across versions isn't particularly important (apart maybe from reading because quantizing again can be useful after some time).

Exactly we compute imatrix files using llama-imatrix for tens of thousands of models so they hopefully always work with latest llama-quantize or can be converted to a format that works with latest llama-quantize without having to ever redo the computationally extremely expensive imatrix computation.

Agreed. Reading the legacy format will be useful for most, but writing, maybe not so much.

I agree. Backwards compatibility for reading the imatirx.dat file format is super nice but writing is a very niche use case and you should ensure users don't accidentally still use the old file format just because they don't know any better.

You're right, it does sound like that. Generating the imatrix files certainly requires (the "bloated") libggml.so anyway to run inference, which makes his concerns about bloat confusing. But in the context of reading imatrix files, it makes sense, and I think the compromise of allowing to continue generating/converting to legacy imatrix.dat files somewhat addresses that.

I agree. Having a script to convert back and forth between the formats probably would make anyone happy without introducing much of a maintenance burden for the llama.cpp project.

It doesn't break backwards compatibility, quite the opposite, you're complaining that it's too backwards compatible. Newly generated legacy-format imatrix files can still be read by older llama-quantize versions. However, those who use the default output filename will get imatrix.gguf instead of imatrix.dat, and those using a specific file name will need to opt-in to using the GGUF format (by using the appropriate .gguf filename extension) for imatrix at their own pace. It's certainly not as much a breaking change as silently replacing the format would be.

Keeping backwards compatibility for now is super nice. Just in addition to that add a flag and a warning if no flag is specified and everyone should be happy.

Deprecation of the legacy format will come later, it doesn't have to be all at once.

If that is the plan to solve the maintainability issue then everything is great. Just eventually depreciate and drop support for it and don0t wait years before doing so.

I think what bothers you is that the new format is opt-in, while you would prefer it to be opt-out (since it's strictly better than the old format (because you're not ikawrakow)).

It's less about opt-in/opt-out then making it so someone that is not aware that there is a new imatrix file format doesn't accidentally still use the old one. It's about educating and communicating changes to your existing and future users.

Currently, the only difference in quality is when there is partial data for MoE tensors.

Which is quite huge. What about 3D tensors? Your changes also implement support for 3D tensors. I think they do so in booth formats so I guess for those it doesn't matter which one the user decides to use. 3D tensors are in my opinion the most impactful change as they finally allow users to imatrix quant models with MLA which given the popularity of DeepSeek got quite common. Let’s also not forget about the batch size fixes.

This was already broken on master, and can't be fixed satisfyingly because the legacy format is not extensible enough (the workaround of adding fake 1 values would work, but would not allow merging properly (you know what, since the current behavior of dropping the data also makes merges weird, I guess the better solution is to add those fake 1. I will change this.)).

I never really understood what this entire imatrix merging topic is all about. I wasn’t even aware that merging imatrix files is a thing. Is this documented somewhere? We are currently also using the fake 1's for partial data for MoE tensors and never experienced any issues with it except that we had to disable intermediate saves as it was hiding the information what tensors are missing how much data from any future saves and most crucially the final safe.

The change from that last parenthesis would make the old format completely equivalent to the new format in quality (for now), and the remaining advantages of GGUF-based imatrix files would be extensibility, the ability to sanely merge multiple imatrix.gguf files using different chunk sizes and/or with MoE tensors, and interoperability. (with gguf-dump, the GGUF previews on HF, etc.)

It would be nice to have feature parity between them if it doesn't come at any quality cost but please never scarify advantages of the new file format just for the sake of keeping backwards compatibility or the new file format will forever be limited by the old file format. There for example is already is the larger batch sizes not being broken using the new format which can’t be fixed using the old file format.

Extensibility is one of the major things storing imatrix files with the old format does not allow. It's not versioned, the file type can't be uniquely identified from the content (no magic first bytes), stacked MoE support is a workaround, the chunk size isn't stored, but it's used to pre-scale the values, the number of chunks is stored per tensor instead of storing the number of tokens per 2D slice, there's no optional fields except if they're at the end, new fields can't be added unless they're at the end, types of the fields cannot be changed in any backwards compatible way, extra tensor data (e.g. the sums of activations (in addition to the currently-stored sums of squared activations)) cannot be added cleanly without breaking older files, etc. All of that won't be a problem anymore with the GGUF-based imatrix format.

This is such a great summery why the old imatrix.dat file format is just bad and I don't see why anyone would still want to use it over the much better imatrix.gguf file format. imatrix.dat really doesn't meet the quality standards of modern llama.cpp anymore and I'm really happy it finally gets replaced by a decent file format.

@mradermacher Please update to latest llama.cpp of ouer fork. Sorry for the insane pace of updates in the past days. llama.cpp progress is going absolutely crazy at the moment. I'm sure it will slow down again soon. Suport for so many models got added that it probvably is worth to update so please update and queue the following models:

@mradermacher Any updates regarding the llama.cpp update? I see nico1 still uses the old version which might not be ideal for Jamba imatrix RPC as the RPC servers already use the latest version. I'm mostely reminding you in case you missed my message as it got burried in the massive imatrix.gguf discussions.

[calm]

Wow that statement aged so poorly.

Can't see that. While you may be busy updating llama.cpp, w.r.t. releases it's still extremely calm, with almost no releases every day, compared to the first half of this year.

btw., we have these in the queue as well:

0 1030 si Kimi-K2-Instruct                            
0 1030 si Kimi-K2-Base                                

llama is updated

Failed to load model config from LFM2-700M: The checkpoint you are trying to load has model type lfm2 but Transformers does not recognize this architecture.

Well, llama.cpp beating transformers to the release it seems... :)

This comment has been hidden (marked as Resolved)

btw., we have these in the queue as well:

0 1030 si Kimi-K2-Instruct                            
0 1030 si Kimi-K2-Base

I saw it and already downloaded the model and tried manually providing GGUFs but convert_hf_to_gguf.py unfortionately seems to not yet support the Kimi variant of the DeepseekV3ForCausalLM archidecture:

root@AI:/apool/llama.cpp# venv/bin/python convert_hf_to_gguf.py --outfile /transfer/Kimi-K2-Instruct.gguf /cpool/Kimi-K2-Instruct
INFO:hf-to-gguf:Loading model: Kimi-K2-Instruct
WARNING:hf-to-gguf:Failed to load model config from /cpool/Kimi-K2-Instruct: Loading /cpool/Kimi-K2-Instruct requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:hf-to-gguf:Model architecture: DeepseekV3ForCausalLM
WARNING:hf-to-gguf:Failed to load model config from /cpool/Kimi-K2-Instruct: Loading /cpool/Kimi-K2-Instruct requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-1-of-61.safetensors'
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> F16, shape = {7168, 163840}
INFO:hf-to-gguf:blk.0.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.0.ffn_down.weight,        torch.float8_e4m3fn --> F16, shape = {18432, 7168}
Traceback (most recent call last):
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 7411, in <module>
    main()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 7405, in main
    model_instance.write()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 410, in write
    self.prepare_tensors()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 5679, in prepare_tensors
    super().prepare_tensors()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 277, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 5676, in modify_tensors
    return [(self.map_tensor_name(name), data_torch)]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 236, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight_scale_inv'

If you want to try as well you can find the SafeTensors variant on nico1 under /cpool/Kimi-K2-Instruct.

Ah I see what's going on. They uploaded the experts of the model in float8 like official DeepSeek so we first have to BF16 conveart it. It beeing far larger than DeepSeek made me wrongly think that it already is in BF16.

Edit: It is currently convearting...

Bamba-9B-v1 also isn't as well supported as we would want:

llama_model_load: error loading model: error loading model hyperparameters: key not found in model: granitehybrid.context_length

i wonder what they tested BambaForCausalLM with.

@nicoboss any ideas on how to manage the supposedly supported but not (yet?) working models that we have every time a new arch is introduced?

I am thinking about simply allowing llmc audit to specify a model name, allowing it to "un-override" a model easily. But maybe it might be more useful to keep them in failed state and simply set some flag so audit will skip them. Or we could somehow manage the failed models outside of the queue.

Exmaple: the lfm models supposedly are supported, but I have yet to see one that works. I don't want to nuke them as long as there is hope (because that almost guarantees that we will forget them) but I also don't want them in my audit list every time :)

I think you can just set the .overriude flag for all the models you want to keep and llm audit redo them to switch them from a failed in an overridden status. Automating this process would be nice but ideally, we would keep them in a failed state but exclude them from the normal audit like we do for besteffort models. That way we would keep the error message but they don't bother us during normal audits. We could also mark them on the status page so we know which models to revisit the next time we update llama.cpp.

Sorry I had no time to look into LFM models as I'm currently busy handling Jamba RPC imatrix jobs and dealing with the Kimi K2 mess. We again managed to somehow hit 00:17 on a second Sunday of the month and so OOM crashed on the first attempt of AI21-Jamba-Large-1.7. This is like the 4th time we somehow run an RPC imatrix job exactly when the scheduled ZFS scrub runs and I just keep forgetting about it. I will think of a way to in the future pause all scheduled tasks during RPC imatrix jobs. I could probably just stop the crond service on the host.

Kimi K2 is such a pain. Those are 1T parameter models so the the BF16 source model and source GGUF are 2 TB each. There are currently many different llama.cpp contributors working on fixing it in a relatively uncoordinated way with all using different approaches. I will redo the source GGUF once it is decided which PR will probably get merged. I created my own Q3_K_M quants of it for testing and they are over 450 GB in size. It will be fun to see what the largest quant we can use for imatrix RPC computation will be.
On the bright side all the Kimi K2 effort is so worth it as it currently is by far the best open wight LLM. Just kind of a shame the best I can ever run without RPC will be Q3_K_M or a quant of similar size.
¨

just a heads-up, i have a rather inconvenient case of food infection and won't be very active till i am more healthy again.

just a heads-up, i have a rather inconvenient case of food infection and won't be very active till i am more healthy again.

I hope you soon feel better again soon.

@mradermacher Please update to the latest version of our llama.cpp fork once you feel well enough to do so. Kimi-K2 support just got merged! I'm so excited to try it out. The latest update also adds support for Plamo2ForCausalLM.

@mradermacher Once you have updated llama.cpp please start Kimi-K2-Instruct. I have already updated the source GGUF.

Feeling a bit better, trying to do some simple things. Sheesh, that were two horrible days.

llama is updated, but this message is new:

WARNING: Ignoring invalid distribution ~f-xet (/llmjob/share/python/lib/python3.11/site-packages)

I've restarted kimi, but I don't know if the change invalidated the gguf or not.

Thanks a lot for updating to latest llama.cpp! Kimi-K2-Instruct is now running successfully. I'm so looking forward to this model.

If you have time, please configure Kimi-K2-Instruct to use imatrix RPC. There is obviously no way F16 or even Q8_0 will fit. Q6_K might still be too big but Q5_K_M should work. Because we don’t know yet just make the imatrix task use the F16 naming and I will link whatever quant fits.

Edit: Q6_K sesms to fit so we are going to use it for imatrix RPC so feel free to specify this quant when configuring the Kimi K2 RPC imatrix task. I already provided /tmp/Kimi-K2-Instruct.Q6_K.gguf

Feeling a bit better, trying to do some simple things. Sheesh, that were two horrible days.

Glad you feel better again.

I've restarted kimi, but I don't know if the change invalidated the gguf or not.

It did which is why I overnight regenerated the Kimi-K2-Instruct SOURCE GGUF using my own already updated llama.cpp build. I even had to update some files in the downloaded model and BF16 conversion first as the actual model contained issues and had to be updated as well. Even now Kimi-K2-Instruct to SOURCE GGUF conversion still requires tiktoken and arbitrary code execution which beside its enormous size is why SOURCE GGUFs for this models need to be provided manually.

WARNING: Ignoring invalid distribution ~f-xet (/llmjob/share/python/lib/python3.11/site-packages)

Maybe time to give XET another try in the near future once XET v1.1.6 is out. Currently they are at v1.1.6-rc2. They are currently implementing XET in web assembly so even downloads using the HuggingFace website will likely soon use XET.

Please update llama.cpp to the latest version of ouer fork for https://huggingface.co/mradermacher/model_requests/discussions/1167 and so ouer entire RPC setup has the same version for /tmp/Kimi-K2-Instruct.Q6_K.gguf imatrix RPC.

@mradermacher Please update to the latest llama.cpp version of ouer fork then remove the override from the ERNIE tasks on nico1 and configure the ERNIE 300B tasks to use RPC imatrix at F16.

Feeling a bit better, trying to do some simple things. Sheesh, that were two horrible days.

Due to you being unresponsive for over two days and probably on sick leave I took administrative actions and updated the CUDA llama.cpp binaries on nico1 to quant the newly released and highly anticipated models there. I have not touched the non-binary parts like convert_hf_to_gguf.py to minimize the risk of breaking something. For now I limited this setup to the ERNIE-4.5 series of models (which is why I currently paused nico1) so in case they turn bad we can easily just requant them using our proper setup.

Edit: In the meantime I queued the EXAONE-4.0 and all the diffusion based LLMs as well.

If you find some time please update llama.cpp and configuring the imatrix RPC setup for the following models:

  • Kimi-K2-Instruct for Q6_K
  • ERNIE-4.5-300B-A47B-PT for F16
  • ERNIE-4.5-300B-A47B-Base-PT for F16

llama has been updated, and the jobs should be configured (and soverriden)

@nicoboss

we are doing cpu-imatrix again (nvidia-smi simply seems to hang, that's new, to me)

ggml_cuda_init: failed to initialize CUDA: unknown error

I'll disable imatrix calculations for the time being

I think I will rename the imatrix files to imatrix.gguf in the repos once we switch formats. Although that might mean more work than I am willing to donate on this problem (essentially it means we'd have to handle two different file formats into eternity). I'll try to think about how to do this in an easy way, preferably without a complete stop-the-world approach.

or maybe like quants, MODELNAME.imatrix.gguf

That explains why all our source models are F16 instead of what best fits the original model:

parser.add_argument(
    "--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "auto"], default="f16",
    help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
)

We don't specify --outtype auto so it defaults to --outtype f16. This is causing a lot of problems for us like https://github.com/ggml-org/llama.cpp/issues/14788 which worked when I uses BF16 as source. I assume many NaN/Inf issues are caused by this as well but I need to first confirm that. I recommend we switch to --outtype auto for convert_hf_to_gguf.py unless you have a good reason why you want all source mode to be in F16. I think bartowski once mentioned something about F16 performing better but given hat booth RTX 4090 and my Ryzen Threadripper PRO 7975WX CPU supports BF16 natively I see no reason why this would be the case.

I think I will requeue the Ernie4.5 MoE 300B models but with a manually provided BF16 source GGUF in order to generate the low-bit quants we skipped. I already have them locally to test if they indeed fix this above referenced issue.

we are doing cpu-imatrix again (nvidia-smi simply seems to hang, that's new, to me)

No idea what happened but a reboot of the host fixed it. I guess just NVidia writing not so stable GPU drivers.

I think I will rename the imatrix files to imatrix.gguf in the repos once we switch formats. Although that might mean more work than I am willing to donate on this problem (essentially it means we'd have to handle two different file formats into eternity). I'll try to think about how to do this in an easy way, preferably without a complete stop-the-world approach.

Please rename them once we update. I plan on updating our fork this evening after work. The new format is far better. It fundamentally solves the issue of MoE models with non-activated experts and implements 3D tensor support for MLA in a much cleaner way. It is now also finally a GGUF and so has all the GGUF metadata so if the future we could add custom metadata to the imatrix like what imatrix dataset we used. The backwards compatibility is amazing so we don’t need to stop the world to update.

or maybe like quants, MODELNAME.imatrix.gguf

I like this name.

bf16 is/was not as well optimized, or lacked hardware support. I don't think this is an issue for us.

But I don't see how auto helps us, (per docs) it always quants to 16 bit ("the highest-fidelity 16-bit float type depending on the first loaded tensor type"). I don't think the first loaded tensor type is a good/stable heuristic.

and no clue what they mean with fidelity, but clearly neither precision nor range :=)

And if the first tensor type is f32? It seems we would have to specify the tensor type manually per model.

I assume many NaN/Inf issues are caused by this as well but I need to first confirm that.

converting bf16 to f16 would not result in nan. inf yes, nan no. if we get nan's on conversionb from bf16 to f16, the conversion is buggy.

and if weights are >~2**16, I would assume the model is broken anyway, so this doesn't seem to be a priority.

Are there really models that require such large weights? Can these even be quantized?

The other case would be very small non-zero numbers - again, can those be represented in all quant types?

I think more research is needed, but outtype auto seems just as buggy as no outtype. Why wouldn't convert_hd_to_gguf.py simply leave them as is, and possibly convert f32 to either bf16 or f16? I suspect the only reasonable way for us is to implement a proper "outtype auto".

we could add custom metadata to the imatrix like what imatrix dataset we used.

We already do that, and this is supported by the old format already.

https://github.com/ggml-org/llama.cpp/issues/14788

CISC also claims that this is a model problem then. which is my default stance, unless convert_hf_to_gguf.py contains conversion bugs.

I can imagine that inf-inf or inf/inf is done somewhere (or similar ops such as inf*0), which would give nans. that could explain nans appearing during imatrix gen, but it does not explain nans in source ggufs - unless i am mistaken, convert_hdf_to_gguf should do plain conversion of single weights, without arithmetic ops that could introduce nans.

btw., regarding the imatrix gguf switch, it's not a trivial switch, so don't count on it being done instantly.

I just updated our llama.cpp fork. We are now really close to mainline again as thanks to latest imatrix changes we were able to discard all our imatrix computation modifications such as 3D tensor support for MLA and handling of partial covered experts as those are now upstreamed. In face the only purpose our fork currently still serves is DRYRUN and some debug output.

btw., regarding the imatrix gguf switch, it's not a trivial switch, so don't count on it being done instantly.

We should be able to relay on their backwards compatibility. While there is no rush, I would like to soon switch to the new imatrix file. I'm surprised its such a hard switch as we only have to rename the imatrix file and use the new one if it exists and otherwise fallback to the old one. In worst case you could also keep the current name by just renaming it after its generated but having the .gguf extension is cleaner.

It's a hard switch because of all the places where imatrix files are handled, and the handling changes and needs to take care of both types (for example, "imatrix.dat" is easier to shell-quote than $MODELNAME.,imatrix.gguf). It affects every component, and it needs to be implemented together. Yes, it would (probably) be trivial if I just switched to gguf and changed nothing else, but I intend to push through all the naming changes. The non-triviality comes from the fact that we are running a mix of operations at practically any time - I can't just experiment and implement it step by step until it works.

anyway, i did already implement some support for it in the scheduler a while ago.- only the imatrixjob parts are missing (those are trivial). once nico1 is back, i'll try it out.

i've disabled the imatrix scheduler for the time being, don't be alarmed.

oh, and lastly, there is the problem of llama.cpp often hiding too much private info in generated files, so that warrants a closer look, too.

re, fp conversions:

while going through the models for today I have another case where auto conversion would be wrong: mixed bf16/f16 models - if these are converted to bf16, we lose precision for "good" models (good defined as within f16-range for weights), so that some bad models work better.

as such, the default of converting everything to f16 is imho the sanest at the moment. It can be improved, but not by randomly throwing away bits.

to improve it, we'd need to know: what happens when we leave bf16/f16 at their original type - will mixed types cause llama-imatrix to fail? how to handle f32 models? I suspect we'd either need a switch that allows us to keep f32 for selected models, or use a heuristic based on model size. we already run into this kind of problem with models using prequantized weights (such as E4M3 or so variants for deepseek).

if mixed types work, we probably should have a mode that simply keeps 16-bit layers ate their current type, a switch that converts f32 to f16 or bf16, and for extra points, convert quantized tensors back to something llama can handle (e.-g. f16). how much of an improvement would that be over the current everything-to-f16 model? maybe not enough to warrant it.

I could expose a switch that forces everything to f32 via outtype, for selected models (would need to be used manually), but again, I think we agree that models that need f32 over f16 don't exist, and even if, we couldn't sensibly quantize them if they already lose quality at f16 level. I mean, "good" models. "bad" models with out-of-range weights would absolutely require f32 or bf16 for the range, but I am not convinced quantized formats can handle those anyway.

And how does the heuristic deal with existing f32 layers - clearly, the documentation is wrong in claiming it converts everything to f16 by default, because I think at least the 1d-f32 tensors are left as f32, but I don't know the exact rule.

(and lastly, convert could check whether bf16 can be converted to f16 without loss - as long as it is in exponent range, this should be no issue. even "good" models might not be in exponent range though for small numbers, so exact conversion and ranges can be a bit difficult to figure out).

-o /tmp/Qwen3-1.7B-Nyx-Fusion.imatrix.gguf~

First problem. Filename type detection is just broken. Now I need to use an idiotic naming convention for temporary files because I can't user the standard ~ extension for them. Probably ".gguf~.gguf", because for idiotic reasons, the file really must be named ".gguf". Also, super annoying to clean up later.

Shameful llama.cpp devs. Of course they didn't implement a type switch, even after these problems were pointed out to them.

And of course it's not even in the help output, either. So people won't even know that their imatrix failed because they didn't guess the magical naming convention required to get llama-imatrix to produce high quality output. @compilade thanks, but no thanks to this trainwreck with it's blatant disregard for it's users.

anyway, i did already implement some support for it in the scheduler a while ago.- only the imatrix parts are missing (those are trivial).

Great to hear.

once nico1 is back, i'll try it out.

Sorry I was doing Qwen3-235B-A22B-Instruct-2507 imatrix calculation overnight and mainly just paused it because there simply was no RAM for it to do anything else without causing the imatrix task to start streaming from SSD. I just happen to be asleep when it was done and so only now able to reenable everything.

We now have out first gguf imatrix. llama-imatrix defaulted to 2k batch size, so essentially did 4 chunks in parallel. That's nice, I guess.

The detection at load time works great but the file type at save time is currently hardcoded based on the extension. To be fair they at least warn the user in the stdout that it will sefe in the legacy imatrix format.

// TODO: use the new format in more cases
if (!string_ends_with(fname, ".gguf")) {
    LOG_WRN("\n%s: saving to legacy imatrix format because output suffix is not .gguf\n", __func__);
    this->save_imatrix_legacy(n_chunk);
    return;
}

Now I need to use an idiotic naming convention for temporary files because I can't user the standard ~ extension for them.

We mentain ouer own llama.cpp fork and so could easely just change it from if (!string_ends_with(fname, ".gguf")) { to if (!string_ends_with(fname, ".gguf" && !string_ends_with(fname, ".gguf~")) {. We could even just delete this entire codeblock to make it aleways save in the new file format no matter what filename we specify.

We now have out first gguf imatrix. llama-imatrix defaulted to 2k batch size, so essentially did 4 chunks in parallel. That's nice, I guess.

Wow that is super cool. I wasn’t aware that they changed the default ubatch size. Hopefully this doesn't impact imatrix quality in any way but I think it shouldn't with the new file format. That will speed up imatrix computation by a lot especially for our imatrix RPC setup as less time is wasted transferring data from RAM to GPU memory using GGML_CUDA_ENABLE_UNIFIED_MEMORY and communicating between hosts.

Now I need to use an idiotic naming convention for temporary files because I can't user the standard ~ extension for them.

@mradermacher You know, this is exactly the kind of tangible use case that I would have liked to know about. This is making me consider adding a format flag for real now (and/or making the default GGUF, since I've also been thinking of some additional quality improvements for MoE which aren't possible with the legacy format (mostly related to https://github.com/ggml-org/llama.cpp/pull/9400#discussion_r2189019069, which I need to experiment with))

Also the main (weak) argument I had for not making GGUF the default imatrix format is not relevant anymore, see https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-13739971; ikawrakow doesn't use mainline llama.cpp.

I'll need to sleep first (time zones), but I'll try to get something ready tomorrow.

I wasn’t aware that they changed the default ubatch size.

The default ubatch size wasn't changed, it's still 512; it's simply that the default batch size of 2048 is respected, which means that like llama-perplexity, it works on 4 chunks at once by default. The chunk size is independent of the (u)batch sizes, and is controlled by -c and defaults to 512.

To really calculate 4 chunks per ubatch (instead of per batch), you can use -ub 2048.

This shouldn't impact quality (with both formats).

@mradermacher You know, this is exactly the kind of tangible use case that I would have liked to know about.

Are you fucking shitting me? I have pointed out these issues and more to you long ago and you chose to not act on them. Are you now in damage control mode? What on earth made you now claim you didn't know about that? What do you think this ugly attempt of shifting blame is going to achieve? Anybody can read the thread here and see that you were told about these issues.

Also the main (weak) argument I had for not making GGUF the default imatrix format

It was never a sound argument, as I also already pointed out to you, detailing exactly why that argument does not apply because ikawrakow did not ask for that - it was you who invented that requirement. ikawrakow only wanted a way to get the original imatrix file format, he never asked for it to be the default. Sorry, but giving a fuck and then claiming you weren't told about it is not going to fly.

This is making me consider adding a format flag for real now

aha. "now". After unleashing this trainwreck, ignoring all the input you got before. Well, never late but never, but I am personally very disapppointed of your attempt to spin these issues now.

And I thought maybe I was a bit too angry with you, considering you one of the better llama devs. But I see now, it's just the same - ignoring user concerns and then spinning things.

How about taking responsibility and not making shit up such as "if only somebody had told me". I did, and you chose to not notice?

And another unsound argument of yours (that I stumbled upon when re-reading the thread):

(HuggingFace cannot guess that a file is GGUF when it doesn't have the .gguf extension, though, so for the best user experience, using .gguf for imatrix files in the GGUF format will be necessary anyway)

First of all, huggingface would not need to guess, but assuming it does guess: huggingface does not "recognize" these files even with the gguf extension at the moment, so this argument is moot. If they add support for imatrix gguf files based on file extension, they should simply use whatever is in use on huggingface. So if everybody chose to call their imatrix files ".imatrix" or ".compilade", that is what huggingface would implement. Your claim that huggingface cannot guess the type form other file extensions is simply a non-sequitur.

@nicoboss other than some double-upload issues due to the gguf filename, and certainly some lurking bugs still in the code, introducing gguf imatrix files went quite smooth.

I had my fair share of issues, but most of the code I prepared did work. I'm currently waiting for our first such repo to finish:

https://huggingface.co/mradermacher/Qwen3-1.7B-Nyx-Fusion-i1-GGUF

I expect some changes to the readme patcher and download page, but I guess soon I can resume the imatrix generation.

and https://huggingface.co/mradermacher/Document-Validation-Qwen2.5-VL-Simple-V2-i1-GGUF will be the second repo (it might actually overtake the nyx-fusion one), both together should exercise all new code paths.

huggingface does not "recognize" these files even with the gguf extension at the moment, so this argument is moot.

Thair GGUF viewer works for them if and only if they have a .gguf extension which is really nice:
Screenshot_20250723-090916.png

@nicoboss I stand corrected on that - but they are still not recognized on the model overview page, where they would have to implement extra support for imatrix.gguf extensions, so my original argument holds. @compilade argued that they would have to have that extension, which is obviously not true. The extension is neither required nor sufficient. It's the same kind of non-sequitur argument as with ikawrakow and old-style imatrix files.

Update: and, yeah, I am pretty pissed that compilade now claims lack of knowledge about these issues.

@nicoboss I've enabled imatrix generation again. For some reason, the readme was not generated automatically for Qwen3-1.7B-Nyx-Fusion-i1-GGUF, and I currently have no clue how that could have happened, so we'll need to watch out for that being broken somehow. The download page has also been updated now.

that argument does not apply because ikawrakow did not ask for that - it was you who invented that requirement.

I did, because the first response to the GGUF imatrix PR from ikawrakow I had was kind of frustrated, and then I tried to make compromises but didn't get a reply for a while until I asked again. So I went with fewer intrusive changes at first.

And I thought maybe I was a bit too angry with you, considering you one of the better llama devs. But I see now, it's just the same

You really seem to want to put every llama.cpp dev in the same basket. I'm doing this in my spare time, I don't have to contribute. That doesn't mean I can't make bad decisions, but it does mean I prefer not spending too much time arguing about feature requests, because that's time I could have spent having fun on actual implementation and/or experiments.

Generally, I try to be on good terms with others, which is why I'm not throwing back at you the anger you've thrown at me. This is a personal choice.

How about taking responsibility

@mradermacher Fine. I made a bad decision to solely rely on the filename for the output format, and yes I ignored your main suggestion of using a format flag.

I genuinely didn't think about why using a different extension than .gguf would be desirable, until you mentioned the use case of temporary names.

Also I'm writing this at like 3am, so sorry if this lack coherency.

Well, it generated it for Document-Validation-Qwen2.5-VL-Simple-V2-i1-GGUF now. I have no clue what the heck happened to nyx-fusion.

You really seem to want to put every llama.cpp dev in the same basket.

As I pointed out, I gave you the benefit of the doubt. But I do judge llama.cpp devs by their behaviour, and that includes you. It's unfair of you to now try to make it into an ad-hominem. I only judge you by your actions, not by your affiliations. If you don't like it, act better?

I'm doing this in my spare time, I don't have to contribute.

I agree. Nobody is forcing you, either. Did you get the impression that anybody wanted to force you to work? Why bring it up? I suddenly get strong deja vu feelings from other llama.cpp devs who point out that they are just volunteers, but somehow forget that others are just volunteers, too. I didn't have to point out obvious issues with the quality of your contribution, I did it in my free time, unpaid, as well. Are you trying to say I should shut up because you don't care anyway? Then, yes, I could have saved time myself as well.

What point are you trying to make? That you are special and are allowed to make shit up because you are a volunteer llama.cpp dev? Does that somehow make you better than other volunteers or give you extra privileges?

It's a serious question: why bring up that you are a volunteer in a forum where everybody is a volunteer?

Generally, I try to be on good terms with others, which is why I'm not throwing back at you the anger you've thrown at me. This is a personal choice.

I am angry because I told you about every single such issue with adequate reasoning and you now claim you haven't been told. I am not angry because you did a bad job as a volunteer (everything is suboptimal, and volunteeres tend to do better jobs). I am angry because you now make it seem you had to do a bad job because nobody gave you constructive criticism, which is simply not true.

It's ok to do a suboptimal job, but it's not ok to shift blame.

PS: To be clear, I am not asking for any changes. I had to work around every single of your bad design decisions already. There is only more misery down the road for us if things change and possibly break. I still think it's an unnecessary trap to tie behavioural changes to magic undocumented file naming conventions (and yes, maybe it is documented somewhere, but it really should be in the help output and not in some obscure readme).

PPS: and to make that clear, too: the workarounds required were really minor (just very annoying). by far most of the issues were created by my decision to rename the imatrix files everywhere and were not caused by llama.cpp changes. My concerns are and always were for other users, and your attempt to shift the blame.

PPPS: @compilade seriously, just own it up. don't shift blame by saying others want to put you into baskets, you are just a volunteer, if only you had known, call out the witch hunt etc. - just accept any valid criticism, that's all that's needed. nobody criticised you for making non-optimal decisions. you can redeem yourself easily.

@nicoboss if you look at kulyk-en-uk, maybe it's a similar issue to LFM2-350M?

so far, it's otherwise uneventful:

-rw------- 1 root root 5.4M Jul 23 10:23 imatrix-remote/AgThinker-14B-final.imatrix.gguf
-rw------- 1 root root 5.4M Jul 23 10:02 imatrix-remote/Delphermes-8B.imatrix.gguf
-rw------- 1 root root 4.6M Jul 23 09:10 imatrix-remote/Document-Validation-Qwen2.5-VL-Simple-V2.imatrix.gguf
-rw------- 1 root root  11M Jul 23 10:01 imatrix-remote/MS3.2-24B-Chaos-Skies.imatrix.gguf
-rw------- 1 root root 2.1M Jul 23 10:03 imatrix-remote/Nous-1-2B.imatrix.gguf
-rw------- 1 root root 3.9M Jul 23 09:54 imatrix-remote/Nous-1-4B.imatrix.gguf
-rw------- 1 root root 5.4M Jul 23 10:47 imatrix-remote/Nous-1-8B.imatrix.gguf
-rw------- 1 root root 2.1M Jul 23 10:58 imatrix-remote/Polaris-1.7B-stage-1.imatrix.gguf
-rw------- 1 root root 2.1M Jul 23 09:56 imatrix-remote/Qwen2.5-1.5B-MegaScience.imatrix.gguf
-rw------- 1 root root 3.4M Jul 23 10:07 imatrix-remote/Qwen2.5-3B-MegaScience.imatrix.gguf
-rw------- 1 root root 4.6M Jul 23 10:16 imatrix-remote/Qwen2.5-7B-MegaScience.imatrix.gguf
-rw------- 1 root root 4.6M Jul 23 10:29 imatrix-remote/Qwen2.5-VL-7B-Abliterated-Caption-it.imatrix.gguf
-rw------- 1 root root 2.1M Jul 23 08:05 imatrix-remote/Qwen3-1.7B-Nyx-Fusion.imatrix.gguf
-rw------- 1 root root 7.8M Jul 23 10:41 imatrix-remote/Qwen3-14B-MegaScience.imatrix.gguf
-rw------- 1 root root 6.0M Jul 23 10:14 imatrix-remote/Qwen3-Blitzar-Coder-F1-6B-Brainstorm20x.imatrix.gguf
-rw------- 1 root root 6.0M Jul 23 09:49 imatrix-remote/Qwen3-Jan-Nano-128k-6B-Brainstorm20x.imatrix.gguf
-rw------- 1 root root 5.4M Jul 23 10:10 imatrix-remote/R3-Phi-4-reasoning-plus-LoRA-5K-v1.1.imatrix.gguf
-rw------- 1 root root  16M Jul 23 10:54 imatrix-remote/RoboBrain2.0-32B-Ascend-FlagOS.imatrix.gguf
-rw------- 1 root root 4.6M Jul 23 10:53 imatrix-remote/olmOCR-7B-0725.imatrix.gguf
-rw------- 1 root root 7.0M Jul 23 10:23 imatrix-remote/rwkv7-7.2B-g0.imatrix.gguf

and there is no inconvenient metadata either (not that I expected any - in fact, I expected more, but I guess more is to come):

kaos ~# gguflayersize imatrix-remote/MS3.2-24B-Chaos-Skies.imatrix.gguf --meta
{
   "general.type" : [
      8,
      "imatrix"
   ],
   "imatrix.chunk_count" : [
      4,
      321
   ],
   "imatrix.chunk_size" : [
      4,
      512
   ],
   "imatrix.datasets" : [
      9,
      [
         "imatrix-training-full-3"
      ]
   ]
}

@compilade differences aside, do you know how much the batch_size can be pushed up? I am wondering how feasible it would be to process very large models in multiple passes, trading disk I/O for memory. that's only realistic if we can reduce the number of passes to something very low - I did that in the past to some success. (and don't feel pressured to answer, we are all volunteers, and I don't expect you to just do my homework, no bad feelings if you don't know or don't want to help).

do you know how much the batch_size can be pushed up?

@mradermacher I think the limit (assuming a chunk size of 512) is 32768, because LLAMA_MAX_SEQ is defined to 64 in src/llama-cparams.h. That constant can likely be bumped.

I think the ideal ubatch size for huge models would need to be big and small enough to still fit the buffers of at least any single operation in memory, to avoid extra swapping. So find the biggest tensor used in a matmul, and then add its size to its intermediate input and output. For MoE tensors, this is (n_embd * n_ff * n_experts) + (n_embd + n_ff ) * n_expert_used * n_ubatch * sizeof(float).

So with sizes in bytes, this is:
n_ubatch = (usable RAM - biggest MoE tensor size) / ((n_embd + n_ff) * n_expert_used * sizeof(float))

For https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct in F16 or BF16, this is
n_ubatch = (usable RAM - 5,033,164,800) / (278,528)

For 24GB of usable RAM, this results in 68096, which feels big and so there might be other buffers that need to be taken into account.

When bumping the ubatch size (-ub), you should also bump the batch size (-b).

--no-warmup should also be useful to avoid doing a first pass through the model weights.

(To see which tensor it's working on (to judge the speed without waiting for everything to finish), you can pass --verbose to llama-imatrix)

just fyi, never seen an inf before

ggml_validate_row_data: found inf value at block 67605656
llama_model_quantize: failed to quantize: tensor 'output.weight' has invalid data
main: failed to quantize model from './Hokkaidoben-to-hyojun-converter.gguf'
job finished, status 47
job-done<0 Hokkaidoben-to-hyojun-converter static 47>

@compilade wow, that's very promising. that means 5, maybe 3 passes could do it. @nicoboss we are not there yet, but that could mean almost unlimited model sizes within a reasonable time frame.

swapping betwene main memory and gpu might still be faster than disk I/O, too.

@nicoboss https://huggingface.co/DiffuOpen/MDM-1.7B looks really worthy (We introduce MDM-1.7B, a diffusion language model with an 1.7B scale, trained entirely from scratch with open sourece 1.1T tokens.),. alas:

File "/llmjob/share/python/lib/python3.11/site-packages/sentencepiece/init.py", line 1172, in _func
raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

Qwen3-235B-A22B-Instruct-2507 failed on IQ1_M - I've restarted with "nolow".

/llmjob/llama.cpp-cuda512/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed

That's not a exactly new error btw., I got a lot of those last year, but it sure is an epidemic right now again.

@nicoboss did you manually assign models to marco? it currently has a 280GB model assigned, but it's maximum model size is 150, that shouldn't happen.

@nicoboss also, I see again a lot of tiny-random-* models. I think we really shouldn't quant these, they will only dilute our model collection.

update: actually just one, log grepping for the win. nevertheless, I don't think they should be queued unless they aren't really random test models.

@compilade Thanks a lot for your implementing the new GGUF based imatrix file format and the great explanations about batch and ubatch sizes!

@mradermacher You know, this is exactly the kind of tangible use case that I would have liked to know about. This is making me consider adding a format flag for real now (and/or making the default GGUF, since I've also been thinking of some additional quality improvements for MoE which aren't possible with the legacy format (mostly related to https://github.com/ggml-org/llama.cpp/pull/9400#discussion_r2189019069, which I need to experiment with))

I don't blame you for this. I completely forgot about the ~ suffix added to temporary imatrix files myself or I would have brought it. This despite me looking at them like every day as all of team mradermacher's imatrix computation happens on my PC.

We did say that having no format flag will be disruptive to our workflow and will require changes from our side which is exactly what happened. I'm quite happy with how smooth switching to the new file format when. I honestly expected far more issues. The process of switching to a completely new file format for imatrix quants would have been far more disruptive without backwards compatibility. Thank you a lot for implementing it this way!

Also the main (weak) argument I had for not making GGUF the default imatrix format is not relevant anymore, see https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-13739971; ikawrakow doesn't use mainline llama.cpp.

I thought it is obvious that ikawrakow is using his ik_llama.cpp for as your entire discussion happened there. In any case I appreciate how you tried to make it right for everyone and tried breaking as little backwards compatibility as possible. None of my personal tools broke with the switch to the new GGUF based imatrix file format.

and there is no inconvenient metadata either (not that I expected any - in fact, I expected more, but I guess more is to come):

At the time I double checked as well that the new format doesn't leak any information which it luckily doesn't.

The default ubatch size wasn't changed, it's still 512; it's simply that the default batch size of 2048 is respected, which means that like llama-perplexity, it works on 4 chunks at once by default. The chunk size is independent of the (u)batch sizes, and is controlled by -c and defaults to 512.
To really calculate 4 chunks per ubatch (instead of per batch), you can use -ub 2048.
This shouldn't impact quality (with both formats).
When bumping the ubatch size (-ub), you should also bump the batch size (-b).

Wow that is amazing to hear! Especially if this does not impact quality of the computed imatrix this will probably improve performance by a lot especially for the RPC setup. I will start doing some performance testing soon.

For 24GB of usable RAM, this results in 68096, which feels big and so there might be other buffers that need to be taken into account.

Do you mean RAM or GPU memory? On our usual imatrix computation server we can use up to 512 GiB of RAM but only have 24 GB of GPU memory.

@compilade wow, that's very promising. that means 5, maybe 3 passes could do it. @nicoboss we are not there yet, but that could mean almost unlimited model sizes within a reasonable time frame.
swapping betwene main memory and gpu might still be faster than disk I/O, too.

It indeed is extremely promising. RPC will probably still be the way to go for larger models but with less passes it will be much faster. I'm so excited to do some testing as soon I have time to do so.

just fyi, never seen an inf before

They do happen and I saw them in the past but are far rarer compared to NaN's.

@nicoboss if you look at kulyk-en-uk, maybe it's a similar issue to LFM2-350M?
@nicoboss https://huggingface.co/DiffuOpen/MDM-1.7B looks really worthy (We introduce MDM-1.7B, a diffusion language model with an 1.7B scale, trained entirely from scratch with open sourece 1.1T tokens.),. alas:

I will look into them as soon I have time to do so. I had quite a busy week but this week will hopefully be better.

Qwen3-235B-A22B-Instruct-2507 failed on IQ1_M - I've restarted with "nolow".

Nice to see the new (nolow)i-quants-gemma-a llmc audit option already seeing its first use.

/llmjob/llama.cpp-cuda512/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
That's not a exactly new error btw., I got a lot of those last year, but it sure is an epidemic right now again.

I haven't seen that one for a really long time before recently. In case you wondered what happened to the ERNIE-4.5 MoE 300B models I requeued them booth but with source quants in BF16 to fix this but Qwen3-235B-A22B-Instruct-2507 already had his source in BF16 as I manually converted it so it is unfortunately not an universal solution to this error. Despite this I plan on adding a "source" outtype option to convert_hf_to_gguf.py which keeps the source datatype unless for tensors where llama.cpp demands F32. This would be cleaner and might solve some random issues.

@nicoboss did you manually assign models to marco? it currently has a 280GB model assigned, but it's maximum model size is 150, that shouldn't happen.

As already discussed in https://huggingface.co/mradermacher/model_requests/discussions/1196 I figured out that just specifying all worker except nico1 is not a good way to ensure models don't get scheduled to nico1 as normal limitations will be bypassed which lead to some workers getting more models than they should. As long workers didn't risk running out of budget/storage I didn't nuke them. There would ideally be a way to specify to which worker a model shouldn't go or a way to stop a worker from receiving any new models without completely pausing it. But now that most massive manual models are done this is no longer something we urgantly need.

@nicoboss also, I see again a lot of tiny-random-* models. I think we really shouldn't quant these, they will only dilute our model collection.
update: actually just one, log grepping for the win. nevertheless, I don't think they should be queued unless they aren't really random test models.

Today I requantized all the non-DeepSeek MLA models one of which was tiny-random-minicpm3. We already had it before as empty repository and it failed as expected even with MLA so I nukeall deleted it. Surprisingly PLM-1.8B-Instruct imatrix computation which in the past worked without any issues failed as well. Good thing I cloned all the non-MLA model to my account so it isn't lost.

@mradermacher Eden-L3.3-70b-0.1 ON marco is stuck in hfd which in turn blocks the download of other models on marco.

@mradermacher If you have time update llama.cpp so we can do Kimi-K2-Base . There absolutely nothing interesting happened since last time we updated but they fixed some Kimi chat template issues which is why this model is blocked until the next time we update. I kept delaying asking for an update as there are interesting PRs but I guess we can't wait for them forever.

@mradermacher Please configure Qwen3-Coder-480B-A35B-Instruct to use RPC. It's too big to run without RPC and is already stuck inside the imatrix queue for many days.

@mradermacher Support for SmallThinkerForCausalLM got merged and I just again updated our fork so at least the llama.cpp update is now worth it.

@mradermacher Support for VoxtralForConditionalGenerationjust got merged and I just again updated our fork so at least the llama.cpp update is now double worth it.

Nice to see the new (nolow)i-quants-gemma-a llmc audit option already seeing its first use.

I had to use it on dozens of models already, unfortunately.

I kept delaying asking for an update as there are interesting PRs but I guess we can't wait for them forever.

Updates might seem like a big thing, but it's just running a single command and waiting (and hoping nothing breaks during the partial update in the rsync phase).

Qwen3-Coder-480B-A35B-Instruct

done

Eden-L3.3-70b-0.1 ON marco is stuck in hfd

Thanks, I guess we need an alarm for that. hf-cli was stuck without any network connection, waiting for... the world to end probably.

tiny-random-minicpm3. We already had it before as empty repository

I don't know if it is due to that, but nuke'ing a job does not delete repos, something that I should be well aware but certainly keep forgetting, since most jobs I nuke during audit are stuck
before they create a repo. That surely creates a lot of empty, and some partial repos, as well. That's mainly a job for the background checker. So many things to do.

llama updated

[ 11/ 747] blk.0.ffn_down_exps.weight - [ 2560, 6144, 160, 1], type = bf16, converting to q4_K .. ggml_validate_row_data: found nan value at block 46

the validation that failed, is it the source tensor, or does it quantize and somehow unquantize and then get nans? I would assume the former, but it must be the latter, as it only happens on some quants. Or maybe something else is going on.

I am not sure if it's due to qwen (which always had nan issues), or something else, but at the moment, it's almost no fun quantizing with practically every model failing.

[321]236.0861,[322]236.2889,[323]236.5113,[324]236.7808,collect_imatrix: inconsistent size for blk.0.ssm_in.weight (6144 vs 1536)

Haven't seen this failure before (mamba_790_hf_qa)

How comes each GPU now suddendly has two simultanous imatrix computation tasks:

ram budget 490 use 45
-2000  141 cogito-v2-preview-llama-70B                   run/imatrix (GPU-18) 11/80 100.12s/c 67.8/131.0m(112.4-115.6) [184/314] 8.1459
-1800    4 PLM-1.8B-Instruct                             error/1 (GPU-18) (status: failure)
    0   66 MetaStone-S1-32B-0730                         run/imatrix (GPU-2d) 20/64 25.73s/c 51.2/34.1m(56.8-57.3) [284/318] 13.1319
    0   30 Qwen2.5-14B-YOYO-Average                      run/imatrix (GPU-2d) 35/48 10.78s/c 8.3/14.3m(11.6-12.6) [208/318] 9.5243
    0   16 Arabic-Law-Meta-Qwen3-8B-Base                 run/imatrix (GPU-18) 47/36 7.32s/c 3.9/9.7m(5.2-6.2) [200/318] 10.0965
    0    ? Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER run/hfdprep noquant/00003/00017
    0    ? MedraN-E4B-steered                            run/hfd (from kaos) 8% 102.38MB/s

There is nothingy inherently wrong with running two imatrix tasks simultaneously on the same GPU is you use -ngl 0 if there is any reason to do so but max-ngl is currently still set to 999 which could would make them half as fast if they overflow into RAM.

the validation that failed, is it the source tensor, or does it quantize and somehow unquantize and then get nans? I would assume the former, but it must be the latter, as it only happens on some quants. Or maybe something else is going on.

It's NaNs in the results of intermediate calculations so likely some bug in the way llama.cpp quantizes but they keep claiming its because of the source tensor despite it not making much sense as it only happens to some quants.

I am not sure if it's due to qwen (which always had nan issues), or something else, but at the moment, it's almost no fun quantizing with practically every model failing.

Yes it's quite insane how often I have to set the lowflag. Just this morning I had to again set two Qwen MoE based models to nolow.

[321]236.0861,[322]236.2889,[323]236.5113,[324]236.7808,collect_imatrix: inconsistent size for blk.0.ssm_in.weight (6144 vs 1536)

That's quite interesting. I wonder if it's because of the new GGUF based imatrix format. I might try using the legacy format if I find time.

I had to reboot kaos because nfs was stuck (again) and couldn't be restarted. That shoiuld not cause multiple jobs to start (it would not see the maybe running jobs though and start fresh ones, but it would then not show them as running). So no clue what happened, but likely it is due to the reboot. When it happens and I don'T see it, kill some.

Yes it's quite insane how often I have to set the lowflag. Just this morning I had to again set two Qwen MoE based models to nolow.

It's not even working on many models, as even Q2_K sometimes fails. The nolow is meant only for quants that insist on imatrix data, which Q2_K is not.

Actually, were there really 4 jobs running, or was it a display error (caused by rebooting and the imatrix scheduler not knowing what is going on and playing it safe).

[321]236.0861,[322]236.2889,[323]236.5113,[324]236.7808,collect_imatrix: inconsistent size for blk.0.ssm_in.weight (6144 vs 1536)

That's quite interesting. I wonder if it's because of the new GGUF based imatrix format.

It's not caused by the new format, but how 3d tensors are handled. This affects recurrent and hybrid models. I'm investigating a fix. See https://github.com/ggml-org/llama.cpp/issues/14979#issuecomment-3138614267.

A workaround is to force computing a only a single sequence at once (with -b 512, or any batch size smaller than twice the chunk size). I'm sorry I didn't test recurrent models after adding support for 3d tensors.

The problem is that src1->ne[2] (from the intermediate embeddings) is used instead of src0->ne[2] (from the model tensor) for the 3d-ness. This works for MLA, but not for recurrent models.

EDIT: a fix is implemented in https://github.com/ggml-org/llama.cpp/pull/14994

some bug in the way llama.cpp quantizes but they keep claiming its because of the source tensor despite it not making much sense as it only happens to some quants.

Yeah. They might be right, too - this is a spec thing - first one would need to know how llama arrives at those nan's, and then one would have to find out if this is because of "out of spec" weights (if there even is a spec, likely there isn't).

It boils down to llama.cpp devs not being interested in fine tunes, only in the original models. Which in the case of qwen, of course, generate similar problems, but hey :)

These problems have been with us since qwen 1.

The frustrating part is that I failed my mission if we can't provide low-bit quants of large models.

[ 11/ 747] blk.0.ffn_down_exps.weight - [ 2560, 6144, 160, 1], type = bf16, converting to q4_K .. ggml_validate_row_data: found nan value at block 46

the validation that failed, is it the source tensor, or does it quantize and somehow unquantize and then get nans? I would assume the former, but it must be the latter, as it only happens on some quants.

@mradermacher
It is the latter, there's a validation step after quantization. See https://github.com/ggml-org/llama.cpp/blob/d6818d06a6237631523bc0f45d42e79482667948/src/llama-quant.cpp#L472

There's a chance that such NANs are caused by too small evaluation counts for some experts, in which case this patch should help:

diff --git a/tools/quantize/quantize.cpp b/tools/quantize/quantize.cpp
index 0e89a2b81..c91a2ea38 100644
--- a/tools/quantize/quantize.cpp
+++ b/tools/quantize/quantize.cpp
@@ -289,7 +289,7 @@ static int load_imatrix(const std::string & imatrix_file, std::vector<std::strin
             const float count = ((const float *) counts->data)[j];
             if (count > 0.0f) {
                 for (int64_t i = 0; i < ne0; ++i) {
-                    e[j*ne0 + i] = ((const float *) sums->data)[j*ne0 + i] / count;
+                    e[j*ne0 + i] = (((const float *) sums->data)[j*ne0 + i] + 1.0f) / (count + 1.0f);
                 }
             } else {
                 // Partial imatrix data, this tensor never got any input during calibration

See https://github.com/ggml-org/llama.cpp/pull/9400#discussion_r2189019069 for some discussion around that.

I think I'll make a PR (in a few days) with a configurable prior weight to test how effective this is.

It is the latter, there's a validation step after quantization. See https://github.com/ggml-org/llama.cpp/blob/d6818d06a6237631523bc0f45d42e79482667948/src/llama-quant.cpp#L472
There's a chance that such NANs are caused by too small evaluation counts for some experts, in which case this patch should help:

@compilade You are a genius! This patch indeed seams to fix the NaN issues that are preaging us for low bits per wight imatrix quants for so many MoE models!

@mradermacher I merged @compilade 's patch so please update to latest llama.cpp and requeue all MoE models for which we used the nolow flag after GGUF imatrix got introduced and restart the ones currently on workers so they recheck for missing quants. If we want to fix models that whare created with the legacy imatrix format we need to recompute their imatrix as this change only properly works with imatrix files in the GGUF format.

We also get support for LLaDA 8b Diffusion and MiniCPM-V 4.0. For MiniCPM-V 4.0 the source GGUFs will need to be provided manually as they require minicpmv-surgery.py and minicpmv-convert-image-encoder-to-gguf.py before convert_hf_to_gguf.py can be used.

It's not caused by the new format, but how 3d tensors are handled. This affects recurrent and hybrid models. I'm investigating a fix. See https://github.com/ggml-org/llama.cpp/issues/14979#issuecomment-3138614267.
a fix is implemented in https://github.com/ggml-org/llama.cpp/pull/14994

Thank you so much for your great work @compilade .
@mradermacher This also fixes imatrix generation for LFM2. Because this is a larger bugfix with the risk of breaking things I decided to let it go through normal quality control instead of merging it now.

@mradermacher This also fixes imatrix generation for LFM2. Because this is a larger bugfix with the risk of breaking things I decided to let it go through normal quality control instead of merging it now.

I concur :)

requeue all MoE models for which we used the nolow flag after GGUF imatrix got introduced and restart the ones currently on workers so they recheck for missing quants. If we want to fix models that whare created with the legacy imatrix format we need to recompute their imatrix

Unfortunately, it's not a flag, and the data is not available centrally atm., but it is somewhere on the todo.

we need to recompute their imatrix

Well, it should be convertable as long as we an recreate the source gguf. Not sure how realistic that is, but could be a one-liner at the right place (modulo lots of testing and fails)

llama has been updated

llama has been updated

Thanks a lot! :D

Well, it should be convertable as long as we an recreate the source gguf. Not sure how realistic that is, but could be a one-liner at the right place (modulo lots of testing and fails)

As far I'm aware the information required for the MoE NaN fix simply doesn't exist in the old imatrix file format and so won't magically exist if we convert it to the new file format forcing us to recompute the imatrix if we want to fix low bits per wight quants for old MoE models.

ah, so the converter is... essentially useless.

/llmjob/llama.cpp-cuda512/src/llama-context.cpp:804: GGML_ASSERT(cparams.n_ubatch >= n_tokens && "encoder requires n_ubatch >= n_tokens") failed

I assume that is another case where multiple chunks per batch is not working? I think llama-imatrix should probably default to one chunk per batch.

Update: seems to affect all llada models and some others

yup, i've hardcoded a default of -b 512 for llama-imatrix, seems to fix a bunch of models

the PLM-1.8 imatrix failure looks somewhat interesting:

/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:82: CUDA error
CUDA error: the requested functionality is not supported
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:1939
  cublasGemmStridedBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, src0_ptr, cu_data_type_a, nb01/nb00, nb02/nb00, src1_ptr, cu_data_type_b, s11, s12, beta, dst_t, cu_data_type, ne0, ne1*ne0, ne12*ne13, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

I wonder what functionality is missing. bf16 support? Maybe something went wrong compiling kernels? llama.cpp "recently" (months ago?) did something with which kernels are actually compiled for which archs, to save space, but I can't seem to find the issue for it atm.

It's this call:

    if (r2 == 1 && r3 == 1 && ggml_is_contiguous_2(src0) && ggml_is_contiguous_2(src1)) {
        // there is no broadcast and src0, src1 are contiguous across dims 2, 3
        // use cublasGemmStridedBatchedEx
        CUBLAS_CHECK(
        cublasGemmStridedBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N,
                ne01, ne11, ne10,
                alpha, src0_ptr, cu_data_type_a, nb01/nb00, nb02/nb00, // strideA
                       src1_ptr, cu_data_type_b, s11,       s12,       // strideB
                beta,     dst_t, cu_data_type,   ne0,       ne1*ne0,   // strideC
                ne12*ne13,
                cu_compute_type,
                CUBLAS_GEMM_DEFAULT_TENSOR_OP));
    } else {
        // use cublasGemmBatchedEx
        const int64_t ne23 = ne12*ne13;

        ggml_cuda_pool_alloc<const void *> ptrs_src(ctx.pool(), 2*ne23);
        ggml_cuda_pool_alloc<      void *> ptrs_dst(ctx.pool(), 1*ne23);

@mradermacher Please update to the latest llama.cpp version in ouer fork.

Our fork finally adds the --outtype source option to convert_hf_to_gguf.py. It now keeps F16, BF16 and F32 tensors in their original datatype, falls back to F16 for unknown datatypes and keeps storing tensors that should always be F32 in F32 according to GGUF specifications. I tested this options for a few models and found no issues fo far. I might even try to upstream this change as it seams really usefull so I recommend you make use of it after updating by specifying --outtype source.

Imatrix changes:

Other important changes:

  • HunYuanDenseV1ForCausalLM support
  • Qwen3-Embedding models
  • fix tokenizer for JetBrain Mellu
  • KimiVLForConditionalGeneration (text only)
  • Glm4MoeForCausalLM support

@mradermacher If you have time, please also mark the cogito-v2-preview-llama-405B imatrix task as imatrix RPC. If you don't want to use RPC we could use /root/cogito-v2-preview-llama-405B.Q8_0.gguf but I believe the model deserves RPC.

I'm so surprised git managed to remove it from imatrix.cpp automaticaly during merging

If trtue, wouldn't that be a bug? Git is not supposed to silently remove changes on conflicts. It's the whole point of such a system to make sure changes are not silently overwritten :)

--outtype source

that seems exactly what i was asking for/what i would have expected it would do by default already. the only issues I can see is either too big models, or issues with mixed arithmetic in kernels. Anyway, it's the dealt now, once llama has been updated.

updated, and using --outtype source for everything now

cogito-v2-preview-llama-405B

marked, no quant

and we have some fat glm 4.5 models in the queue

It would be nice to have some models where we can actually provide all quants for a change. Sigh. Right now, it feels like no nontrivial model survives quant creation without hacks.

PS: I've included IQ3_XXS in the quants we skip for "nolow".

PPS: especially frustrating because these big models really deserve low-bit quants.

@mradermacher Please update llama.cpp to latest version of ouer fork so we can do https://huggingface.co/openai/gpt-oss-120b and https://huggingface.co/openai/gpt-oss-20b

updated and restarted

@mradermacher There will be some major upcoming infrastructural changes to rich1, nico1 and nico2

rich1

The server currently hosting rich1 will get decommissioned. We are currently in the progress of migrated everything to Richard's supercomputer currently located in Malaysia. The /tmp folder gets again excluded from backups/migrations but I assume I can just manually rsync over its content. I don't think there is anything you need to do but you usually prefer taking your own backups just in case.

Here the new specifications of Richard's new server:

  • CPUs: 2x AMD EPYC 7H12 64-Core (Total: 128 cores/256 threads)
  • RAMs: 4x 64 GiB DDR4 RAM (Total: 256 GiB) (Richard might upgrade to far more in the future)
  • GPUs: 4x NVIDIA A100-SXM4-40GB + 1x Tesla T4 currently running driver 575.51.03 and CUDA Version: 12.9 (they will be mostly used by Richard or me but could be used for imatrix should something ever happen to nico1)
  • Storage: HDD for now but plenty of it. (We can likely switch to SSD once Richard physically visits the server the next time to install some decently sized SSDs)
  • Internet: 2 Gbit/s download to 1 Gbit/s upload relatively stable but with dynamic IP and forced IP rotation every few days. Port forwarding is possible.
  • Kernel: 6.8.12-11-pve (could change ifRichard decides to upgrade to Proxmox 9)
  • Location: Currently located in Malaysia in an office.

nico1/nico2

  • Will be upgraded to Proxmox 9 based on Debian Trixie (13) and kernel 6.14.8-2
  • GPU drivers will be upgraded to 580.65.06 which I already checked are available using apt inside your container. Please update them soon you can so I can update the drivers on the host and ensure they don't cause any issues before upgrading to Proxmox 9. 580.65.06 is the first driver supporting CUDA 13.0. Driver 580.65.06 and llama.cpp built using CUDA 13.0 is already used on one of the RPC servers.

Personal health

I'm in bed with influenza since last Wednesday. First time sick since over 2 years and not an experience I want to repeat anytime soon. While I still feel quite sick, I already feel slightly better again. I hope my health will be fully restored by Tuesday.

@nicoboss kaos has been blocked by the provider (hetzner) for unrelated reasons.

I'm confident that we can get it unblocked today. In fact, they complained about ipv6 traffic and then only blocked the ipv4 address, which, unfortunately, is the main address for connectivity. I will likely not put any workarounds in place but just wait it out, unless things get really bad. this also means no quantisations for the time being.

rich1: maybe we should finish the current queue on rich1 and pause it, just in case. but, yeah, other than tunnel endpoints, the move should be pretty transparent if /tmp is also moved (which contains the queue data). and now i understand why you call it his supercomputer, it's indeed quite impressive :)

influenza: incidentally, that's why both my parents were hospitalized last year. fortunately for you, you are far younger, but it's a serious illness. all the best to you.

Rent on rich1 is until 25th. So you can run queue until 20th on rich1 and then we move

kaos is back, for the time being, but i am a bit in shambles.

@RichardErkhov thanks for the heads-up :)

@mradermacher I will now start upgrading everything to Proxmox 9. Keep in mind that once the upgrade is complete you will need to upgrade to GPU drivers to 580.65.06 for imatrix computation to work again.

ok, i assume you already did and imatrix is paused waiting for this

seems to have worked with even less than the usual theatrics

not removing the imatrix pause flag since nico1 llmjopb is also paused

Update: wow, that was quick

i am planning to add an audit choice like "-IQ1_S small_IQ4_NL" that allows one to just exclude specific quants

and i am not happy with the situation of practically all larger models we currently quantize being failing. this feels like last year. is it the models or did somehting in llama break...

Thanks a lot for dealing with the GPU driver update so quickly.

ok, i assume you already did and imatrix is paused waiting for this

Yes exactly. StormPeak (nico1) and CastlePeak (nico2) are now both on Proxmox 9

seems to have worked with even less than the usual theatrics

Awesome to hear! I noticed you haven't yet rebuilt llama.cpp. Maybe some issues will arise the next time we rebuild as /usr/local/cuda-13.0/bin is not in your PATH.

not removing the imatrix pause flag since nico1 llmjopb is also paused

Thanks, and sorry for the poor communication. I indeed wanted to reboot the container after you upgrade the drivers just to see if anything breaks which nothing did, so we are good to go. I resumed nico1 and it already started doing imatrix computation.

Update: wow, that was quick

Oh yes and I obviously started writing this response when I resumed nico1 minutes after you notified me that you upgraded the GPU drivers but then got interrupted by work.

i am planning to add an audit choice like "-IQ1_S small_IQ4_NL" that allows one to just exclude specific quants

That would be quite useful. Also maybe just add an option to skip the current quant. We maybe even want to auto skip if certain known errors occur. The number of quants we have to skip is getting a bit ridiculous.

and i am not happy with the situation of practically all larger models we currently quantize being failing. this feels like last year. is it the models or did something in llama break...

I really hate it as well. There are multiple reasons for this. For some like GML 4.5 based models we know why. GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) is the most concerning one in my opinion as it almost never occurred in the past and is now occurring for way too many big models. I think that one started with imatrix in GGUF file format. I really should do some research about this – I hope I will fine some time for this next weekend. At least the NaN plague seems to be over now.

cuda is independent of the graphics card driver, and i will stay at 12.6 for a while, likely

That would be quite useful. Also maybe just add an option to skip the current quant.

Unfortunately, there is no such thing as a current quant. I would have to parse the progress message, which would be doable, but is too much of a hack for me yet :)

We maybe even want to auto skip if certain known errors occur. The number of quants we have to skip is getting a bit ridiculous.

I'd rather have a solution. Or insights. For any nontrivial model size, IQ3* and lower are some of the most important quant types. The number of users for a Q6 of a 1TB model is essentially nil.

GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) is the most concerning one in my opinion as it almost never occurred in the past and is now occurring for way too many big models.

That is not true. It was very common in the past, when I was still naive and opened bug reports. I was told it's my imatrix trainign data, and a month later, it was silently fixed upstream. IT's clearly a bug in llama.cpp, and I wonder if it is a regression instead of a fully new problem.

I think that one started with imatrix in GGUF file format.

That is my suspicion, too, but if I understood @compilade correctly, there should not have been any changes yet, the different format should be completely transparent at this time (but could be wrong). In any case, it might be a coincidence, but clearly things got much worse at the same time as imatrix gguf was introduced.

Please try to update to latest llama.cpp when you have time for Lfm2VlForConditionalGeneration support and queue:

cuda is independent of the graphics card driver, and i will stay at 12.6 for a while, likely

The only CUDA installation I can currently find on nico1 using find -name "nvcc" is CUDA 13.0 under /usr/local/cuda-13.0/bin which is not added to PATH but I'm not that familiar about your build environment so I'm probably missing something.

@nicoboss

ssh: connect to host castle.nico.re port 2108: Connection timed out

(not urgent)

ssh: connect to host castle.nico.re port 2108: Connection timed out

I know IP changed again as I had to reboot Threadripper hosting the OpenWrt router when I upgraded to Proxmox 9. I'm updating the DNS entires now.

The only CUDA installation I can currently find on nico1

llama currently uses /usr/local/cuda-12.0 on nico1. A good method to find what a binary uses is ldd on llama-imatrix, which will show the full path of the libraries. The current build environment used on nico1 is also 12.0, see update-alternatives --display cuda

(you did scare me a bit there :)

@nicoboss llmc audit now has the ability to configure specific quants to be skipped. The syntax is not rally documented as I have no good place for it, but it works like "-QUANTSPEC QUANTSPEC...", where quantspec is the exact (wrapper version) name of the quant to skip, such as "small-IQ4_NL" (I have changed the status display a few weeks ago to display the wrapper quant name, not the llama quant name.

By default, this will skip imatrix quants. You can prefix a quant name with s: si: or i: to select static or imatrix quant types.

E.g.

-IQ1_S skips imatrix IQ1_S
-si:small-IQ4_NL skips both static and imatrix
-IQ1_S s:IQ3_S IQ3_XXS skips imatrix IQ1_S and IQ3_XXS and static-only IQ3_S

Update: forgot, please don't use it immediatelly but wait a day. I will try it out on the currently queued models.

ssh: connect to host castle.nico.re port 2108: Connection timed out

The DNS entries are now fixed and are correctly pointing to the new IP.

llama currently uses /usr/local/cuda-12.0 on nico1. A good method to find what a binary uses is ldd on llama-imatrix, which will show the full path of the libraries. The current build environment used on nico1 is also 12.0, see update-alternatives --display cuda

After further investigating this I'm even more confused:

nico1 /usr/local# find -name nvcc
./cuda-13.0/bin/nvcc
nico1 /usr/local# cd /usr/local/cuda-12.0
/usr/local/cuda-13.0
nico1 /usr/local/cuda-13.0#
nico1 /usr/local# update-alternatives --display cuda
cuda - manual mode
  link best version is /usr/local/cuda-13.0
  link currently points to /usr/local/cuda-12.9
  link cuda is /usr/local/cuda
/usr/local/cuda-12.9 - priority 129
/usr/local/cuda-13.0 - priority 130

/usr/local/cuda-12.9 does not contain nvcc.

But yes based on ldd llama-imatrix it indeed uses the libraries inside /usr/local/cuda-12 which seams to be cuda 12.9

llama has been updated. as a side note, I recently tried out LFM2-700M in wllama of all things (for a potential translation project), and when chatting, was quite impressed with it. Very little censoring, too - couldn't even get it to refuse anything.

skipping Q2_K for Kimi-Dev did not help much, it failed again on Q4_K_S:

[ 27/ 963] blk.1.ffn_up.weight - [ 8192, 29568, 1, 1], type = bf16, converting to q4_K .. ggml_validate_row_data: found nan value at block 32

Forgot to mention, but you probably already know, the default list of quant names (that can be skipped) is defined in the quantize script, but most likely you'll just look at the status output to see which quant caused troubles. Don't use it lightly, every time we use it, we lose our mission :)

QUANTS_S="x-f16 Q4_K_S Q2_K Q8_0 Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS"
QUANTS_I="Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S"

/usr/local/cuda-12.9 does not contain nvcc.

I don't think we build anything that uses nvcc on nico1. If that changes, we can probably compile those with 13.0 (the only things we compile on nico1 are some python modules, which, I think, even come prebuilt for 12.x), as only convert_*py will use those, which doesn't link against llama.

Most probably all this will be moot as at some point, I will upgrade cuda everywhere to use 13. Or 14. Or...

The quant-skip audit option seems to work. I also removed kimi*abliterated (only static quants for the time being) and MDM.

looks much cleaner again.

And one other difference between nolow and the -XXX is that the former overrides the quant list, while the latter uses a "skip these quants" job functionality. Probably not much difference in practise, other than better underestanding of what happened when listing the job - it's easier to parse '''iquants_skip => IQ3_XS''' then to look at the full list and see which are missing.

@nicoboss llmc audit now has the ability to configure specific quants to be skipped. The syntax is not rally documented as I have no good place for it, but it works like "-QUANTSPEC QUANTSPEC...", where quantspec is the exact (wrapper version) name of the quant to skip, such as "small-IQ4_NL" (I have changed the status display a few weeks ago to display the wrapper quant name, not the llama quant name.
By default, this will skip imatrix quants. You can prefix a quant name with s: si: or i: to select static or imatrix quant types.

Awesome thanks a lot. This will proof very useful if llama.cpp keeps failing for so many quants. My inability to skip just the current quant is actually why I so far always hesitated to skip them myself using nolow.

I recently tried out LFM2-700M in wllama of all things (for a potential translation project), and when chatting, was quite impressed with it. Very little censoring, too - couldn't even get it to refuse anything.

Nice to hear. I have not tired LFM2 so far as 700M is tiny but I am really looking forward to try LFM2-VL-1.6B as dynamic tokens for vision sounds super cool. I'm really not satisfied with the vision capabilities of current models. All beside Gemma 3 are terrible so far. Maybe LFM2-VL finally has decent vision now that they improved how images are tokenized. I will try it later today.

skipping Q2_K for Kimi-Dev did not help much, it failed again on Q4_K_S:
[ 27/ 963] blk.1.ffn_up.weight - [ 8192, 29568, 1, 1], type = bf16, converting to q4_K .. ggml_validate_row_data: found nan value at block 32
I also removed kimi*abliterated (only static quants for the time being) and MDM.

Damn we another NaN. Quite sad as Kimi-Dev is a really good model. I extensively tested it and feels like Qwen 2.5 72B but even more intelligent.

Forgot to mention, but you probably already know, the default list of quant names (that can be skipped) is defined in the quantize script, but most likely you'll just look at the status output to see which quant caused troubles. Don't use it lightly, every time we use it, we lose our mission :)

I found that list for around 1 month because I for some reason thought it would be easily to manually see when the quant I need will get processed than to write the quant sniper script I later wrote and currently use to hardlink snipe quants before they get deleted. I see me more looking at the status page and just keep skipping the current one if I encounter an unfixable quant specific issue like a NaN or Inf. For GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) I still see if there is anything that can be done. This one is so annoying but so far can't think of a way for us to fix it as similar to NaN/Inf it probably means that there are bad wights in the original model or some intermediate state which likely are NaN/Inf and there is not really any way for us to figure out what exactly is going wrong. It only happening for imatrix quants seams to strongly indicates that applying the imatrix is causing thouse issues but I don't see anything wrong with how the imatrix is getting applied. When I have time I will likely just try with last version before GGUF based imatrix and first version after GGUF based imatrix just to check if that is the change that started causing all this issues. Realistically it could also just be bad luck with newly released models especially MoE models are far more likely to have this kind of issues.

I don't think we build anything that uses nvcc on nico1. If that changes, we can probably compile those with 13.0 (the only things we compile on nico1 are some python modules, which, I think, even come prebuilt for 12.x), as only convert_*py will use those, which doesn't link against llama.

Ah that's why it works because for everything except cuda 13 only the runtime but not the required development tools to actually build something seem to be installed. You building llama.cpp somewhere else obviously makes missing nvcc a non-issue.

Most probably all this will be moot as at some point, I will upgrade cuda everywhere to use 13. Or 14. Or...

As long it works, I don't mind too much. I don't think there will be much of a performance difference. I'm now using CUDA 13 everywhere so all RPC servers will be built using CUDA 13 but even that shouldn't matter.

The quant-skip audit option seems to work.

Great so I will start using it tomorrow in case another model is having thouse stupid issues.

looks much cleaner again.

It indeed does. This evening I will also do the final stuck RPC task and no that the big GLM models that are done I probably should have the storage required to finally do one of the 3 blocked large models.

For GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) I still see if there is anything that can be done.

Maybe this?

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 94f6405ca..0bc2c4481 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -4266,7 +4266,7 @@ static void quantize_row_iq1_s_impl(const float * GGML_RESTRICT x, void * GGML_R
                     sumw[j+1] = sumw[j] + weight[i];
                 }
             }
-            float best_score = -FLT_MIN, scale = max;
+            float best_score = -FLT_MAX, scale = max;
             int besti1 = -1, besti2 = -1, best_shift = 0;
             for (int i1 = 0; i1 <= block_size; ++i1) {
                 for (int i2 = i1; i2 <= block_size; ++i2) {
@@ -4442,7 +4442,7 @@ static void quantize_row_iq1_m_impl(const float * GGML_RESTRICT x, void * GGML_R
                 idx[2*j] = j;
             }
             qsort(pairs, block_size, 2*sizeof(float), iq1_sort_helper);
-            float best_score = -FLT_MIN, scale = max;
+            float best_score = -FLT_MAX, scale = max;
             int besti1 = -1, besti2 = -1, best_k = -1;
             // 0: +, +
             // 1: +, -

I didn't test this yet, but that's something potentially wrong I've noticed.
FLT_MIN is the value right next to 0, and if multiplied with a sumq2 smaller than 0.5 (can happen with some imatrix), and the squared sumqx is equal to 0 (can happen with BF16 weights), that could lead to the assertion.

I'm currently trying to explore how the low-bit i-quants could be reimplemented with algorithms similar to those in https://github.com/ggml-org/llama.cpp/pull/12557. Not quite there yet.

I so far always hesitated to skip them myself using nolow.

Very good - nolow is specifically for the case of skipping quants due to missing imatrix tensors, i..e the quants that llama thinks are "low bit wuants that need imatrix".

This one is so annoying but so far can't think of a way for us to fix it as similar to NaN/Inf it probably means that there are bad

Pretty sure this is a llama.cpp bug, and not a new one.

Update:

compilade is probably on to sth. Would be great news if this were a relatively simple to fix, or workaround, bug.

@compilade if best_score is as it is named for, then probably whoever wrote that wanted the minimum float, and thus, -FLOAT_MIN would be an easy thinko to end up with.

If best_score is not used for anything but selecting the best quantisation, then I guess the assert means the all found solution scores are negative or zero. If negative scores can't happen, anything less than 0 might be a working initialiser even. Nope, I refuse to look at the code and this is just pure speculation.

Ok, looked at the code, not much wiser, but I would feel confident trying out this change. I can't see how it can break anything.

@compilade how'd you even notice that while working on something else...

although, to be nitpicky, FLT_MIN is not the next value to 0, just the next normalised one (i.e. -FLT_MIN * 1e-6 or so will still be non-zero) . I think if -FLT_MIN was really meant to be right, then why not use 0? Clearly, the original author meant to use some other sentinel value. My open question would be if it is guaranteed that a best_score of -FLT_MAX will never overflow the calculations either? If score can never be < 0, then maybe -1 would be more suitable. Unless it can't overflow anyway. And even if it overflows, it would just end up being -inf, which would also be fine.

@niocboss I will try it out on the three models currently on marco.

@nicoboss I doesn't assert. Whether the result is garbage is another question, but I am willing to risk it. Maybe @compilade or you could follow this up with upstream, have somebody else look at it?

@compilade how'd you even notice that while working on something else...

@mradermacher
I was reading the IQ* quantization code anyway to replace it with something saner, so I was first studying how it works, and I noticed this part of IQ1* doesn't look right.
It's a similar pattern of finding the best score as the other quantization functions, but it's different because it doesn't gracefully handle when nothing is found.

Another weird thing I've noticed is that for IQ2* quants, the search assumes the grid has values in {1, 3, 5}, while they actually are in {1, 3.125, 5.375}. Might be partly why there are a bunch of arbitrary fudge factors all around that part of the codebase.

although, to be nitpicky, FLT_MIN is not the next value to 0, just the next normalised one (i.e. -FLT_MIN * 1e-6 or so will still be non-zero)

You're right, I assumed it was subnormal but didn't check. Still, -FLT_MIN * 5.9604645e-08 and anything smaller will be zero (that other number is from the smallest non-zero subnormal divided by twice the smallest normal F32 number).

If score can never be < 0, then maybe -1 would be more suitable. Unless it can't overflow anyway. And even if it overflows, it would just end up being -inf, which would also be fine.

It's only the initial value, which is overwritten as soon as a bigger sumqx * sumqx than best_score * sumq2 is found. I think the range of acceptable initial values for best_score is likely anything from -INF to -1.0f, since a non-zero sumq2 multiplied by -1.0f shouldn't round to 0, and -INF should have the behavior of being smaller than anything else (and so overflows of -FLT_MAX * sumq2 shouldn't matter).

Maybe @compilade or you could follow this up with upstream, have somebody else look at it?

I'll open a pull request, with also a fix for NANs in make_qp_quants from https://github.com/ggml-org/llama.cpp/pull/11773#discussion_r2066664121.

I'll open a pull request

Done in https://github.com/ggml-org/llama.cpp/pull/15379.

cogito-v2-preview-llama-405B RPC imatrix computation is now running,

llama_model_load: error loading model: vector::_M_range_check: __n (which is 4) >= this->size() (which is 4)

llama.cpp has a super unintuitive error messages to tell the user that some RPC servers aren't reachable. It basicaly tells you that 4 >= 4 where 4 is the number of RPC servers. The only way I figured it out is because above it tells you Failed to connect to <IP_OF_RPC_SERVER>. In any case I will not forget this one as it made me seriously think about its meaning.

Done in https://github.com/ggml-org/llama.cpp/pull/15379.

Thanks you so much for your finding a solution of one of our most serious issues.

FLT_MIN is the value right next to 0, and if multiplied with a sumq2 smaller than 0.5 (can happen with some imatrix), and the squared sumqx is equal to 0 (can happen with BF16 weights), that could lead to the assertion.

That is such a nice find. I never even thought of that! It's quite remarkable that you even managed to find that. I'm having troubles even fully understanding the intended logic behind this code.

https://github.com/ggml-org/llama.cpp/pull/12557 (ggml-quants : weighted rounding algorithms with cumulative search)

I'm so excited for that one. I remember it very well from when it was first created. It is super cool that just by changing how you round it is possible to heavily improve the quality of quants without any disadvantages.

@nicoboss I doesn't assert. Whether the result is garbage is another question, but I am willing to risk it. Maybe @compilade or you could follow this up with upstream, have somebody else look at it?

What awesome news. The result will almost certainly not be garbage. If it doesn't trigger the assert the model should be fine based on my limited understanding of that code. I merged it into our mradermacher branch as I consider this change relatively safe and it fixes a massive number of issues we are currently experiencing. I recommend you update to latest version of our branch as soon you feel comfortable adopting this change. Let's closely follow the official PR. Maybe we should requeue the models that failed because of this issue so we can provide low-bits-per-wight quants for them.

Edit: I merged the ggml-quants : avoid division by zero in make_q3_quants change as well.

Edit: I merged the ggml-quants : avoid division by zero in make_q3_quants change as well.

Well, I am quite conservative, but compilade's fix is small enough and with apparently clear consequences, so normally, I'd say let's not overdo it. Looking at that fix, it seems obvious enough as well, though.

Anyway, llama is compiling and will be updated in ~15mins or so.

(ggml-quants : weighted rounding algorithms with cumulative search) I'm so excited for that one. I remember it very well from when it was first created.

At the danger of causing discomfort again, after re-reading, I see that pattern where I pointed to ikawrakow's extensive comments on these changes so they are not forgotten, and compilade seemingly ignored almost all of them except one. I hope it ends well nevertheless, but I think ikawrakows input should not be ignored, even if ultimately it is either wrong or irrelevant.

Well, I am quite conservative, but compilade's fix is small enough and with apparently clear consequences, so normally, I'd say let's not overdo it. Looking at that fix, it seems obvious enough as well, though.

It just got merged into mainline llama.cpp without any additional changes so we are good. This is a really impactful change in the sense that we can now do all the quants that previously failed with this error.

All hail compilade, then. Reported mar 2024, nobody gave a shit, but him, even if accidentally. That's noteworthy, even praiseworthy :)

@mradermacher reminder that rich1 needs to be stopped ion the 20th for the transition. so finish up your jobs so I can take it down. if you need more days, 25th the rent is ending, so you may run it until that day

I have to say I'm super happy about the GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) fix. I requeued Qwen3-Coder-480B-A35B-Instruct today before deleting its source GGUF and it successfully did all the previously failed quants which already includes the following with more to come:

  • i1-IQ2_M
  • i1-Q2_K_S
  • i1-IQ1_M
  • i1-IQ2_XXS

should have put it here in the first place to make sure you can see it:

@nicoboss vision arch list is automated now, the list originally used is /llmjob/share/convert_hf_to_gguf_models.txt, but for unnecessary speed reasons the preprocessed version in /llmjob/share/convert_hf_to_gguf_models.pm is loaded by llmjob. As always, mostly untested in production.

@RichardErkhov wow, I admit I totally forgot. I've "disabled" rich1, and it should be finished (only 3 jobs remaining) Thanks foer the heads-up

@nicoboss I don't think S1-Base-671B and cogito-v2-preview-deepseek-671B-MoE should be besteffort?

@mradermacherhi again, as soon as you are done with rich1 and ready for a backup, please just pause it or even shutdown the container. I will do everything else needed with nico

@nicoboss https://huggingface.co/chutesai/Qwen3-235B-A22B-Instruct-2507-1M what are your thoughts about this model? the only thing we have in it's favour is the "-1M"

@RichardErkhov done, it's shut down

thank you, moving now =)

@nicoboss https://huggingface.co/chutesai/Qwen3-235B-A22B-Instruct-2507-1M what are your thoughts about this model? the only thing we have in it's favour is the "-1M"

It's a hash identical copy of https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507 so let's not waste any resources on this garbage.

Please mark DeepSeek-V3.1-Base for Q8 RPC imatrix computation so I can start it once static quants are done. That model is supper exciting. It’s quite a massive improvement compared to DeepSeek V3 but also slightly more censored.

marked!

And what do you think of https://huggingface.co/deepseek-ai/DeepSeek-V3.1? might be a trick question :)

And what do you think of https://huggingface.co/deepseek-ai/DeepSeek-V3.1? might be a trick question :)

It's amazing. Please do the same for it as well and mark it for Q8 RPC imatrix computation.

string_parse_kv_override: malformed KV override 'general.url=str:https://huggingface.co/mradermacher/Qwen3-42B-A3B-2507-Thinking-Abliterated-uncensored-TOTAL-RECALL-v2-Medium-MASTER-CODER-i1-GGUF', value cannot exceed 127 chars

wow, 127 is ridiculously small, especially for urls. I considired just increasing the limit (gguf only has 64-bit lengths for strings if i remember correctly), but the limit is not even a symbol, it's all over the code. stellar code (and strncpy is used everywhere, too):

    char    val_str[128];

    } else if (strncmp(sep, "str:", 4) == 0) {
        sep += 4;
        kvo.tag = LLAMA_KV_OVERRIDE_TYPE_STR;
        if (strlen(sep) > 127) {
            LOG_ERR("%s: malformed KV override '%s', value cannot exceed 127 chars\n", __func__, data);
            return false;
        }
        strncpy(kvo.val_str, sep, 127);
        kvo.val_str[127] = '\0';

            strncpy(kvo.val_str, imatrix_file.c_str(), 127);
            kvo.val_str[127] = '\0';
            strncpy(kvo.val_str, imatrix_datasets[0].c_str(), 127);
            kvo.val_str[127] = '\0';
   [and many more]

@nicoboss the only options I see is patching (the length in the header file and the length check/copy, the other occurances are probably harmless and just cut off the value, e.g. in quantize), or simply dropping urls. (It's not the first time I run into this limit, and last time I just dropped the key forever).

update: and why are these not simply std::string's? all this strncpy/c-string stuff surely will bite them at one point. or, better yet, why isn't it the type they use to represent gguf values elsewhere that already must exist.

both deepseek are marked for rpc

@nicoboss any thoughts on the MXFP4_MOE quant type? seems not very relevant for us

a new failure (llama-imatrix):

/llmjob/llama.cpp-cuda512/ggml/src/ggml-cpu/ops.cpp:5280: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

@mradermacher Something is seriously wrong with the system. All quantize jobs are failing with systemd-run: unrecognized option '--expand-environment=no' This already affects DeepSeek-V3.1, capybara-math-30B, QiMing-Pantheon-Qwen3-14B, and Qwen3-4B-OpenR1Math-MARL

I'm currently running DeepSeek-V3.1-Base RPC imatrix computation so nico1 won't be idle.

Regarding rich1 I'm currently working with Richard on creating some decent storage setup for it.

Edit: Thanks it seams to be fixed now! :D

Yeah, one server was upgraded to debian trixie, and things went downhill from there. It indeed should be fixed, at the cost of a python downgrade to 3.10 (from 3.11). I am surprised python has zero backwards compatibility, but it is what it is.

And we are not in a hurry - there have never been as few models coming out as in the last month or so. Quite scary :(

(ideal time to tackle 2022 models, but I am also rather busy. and good to hear you being lively and not ill anymore, I presume :)

@nicoboss the llama.cpp upgrade tonight forced an unexpected upgrade to cuda 13. it should work, but I haven't had the time to test it. if things fail, don't panic and always carry your towel. i will fix it eventually.

(also, only llama uses cuda 13, python still uses cuda 12.4, as there are no cuda 13 wheel packages for torchvision &c. it should not matter)

@mradermacher CUDA 13 works perfectly fine and yes version for python doesn't matter but something on marco is going very wrong which causes all models sheduled to that worker to fail:

/llmjob/llama.cpp/build/bin/llama-quantize: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/llama-quantize)
/llmjob/llama.cpp/build/bin/llama-quantize: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libllama.so)
/llmjob/llama.cpp/build/bin/llama-quantize: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libggml-base.so)
/llmjob/llama.cpp/build/bin/llama-quantize: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libggml-rpc.so)
job finished, status 47
job-done<0 Multiverse-7B imatrix 47>

error/47 1/24,Q2_K
https://huggingface.co/InfiniAILab/Multiverse-7B
{[[PROGRESS:dryrun...]]}
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/llama-cli)
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libllama.so)
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libggml-base.so)
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libggml-rpc.so)
dryrun failed
job finished, status 57
job-done<0 Seed-OSS-36B-Base-woSyn noquant 57>

error/57 dryrun...
https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base-woSyn
{[[PROGRESS:dryrun...]]}
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/llama-cli)
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libllama.so)
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libggml-base.so)
/llmjob/llama.cpp/build/bin/llama-cli: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /llmjob/llama.cpp/build/bin/libggml-rpc.so)
dryrun failed
job finished, status 57
job-done<0 Seed-OSS-36B-Instruct-abliterated noquant 57>

error/57 dryrun...
https://huggingface.co/nicoboss/Seed-OSS-36B-Instruct-abliterated

Edit: Oh nice. You already fixed it. Thanks a lot!

yeah, the problem on marco was a bit more complicated, so I left him a message yesterday. in the end result, we simply upgraded his box, which was overdue anyway :)

@nicoboss while doing a backup of nico1 I found 1.4TB of gguf files in /root, what's the deal with them?

i'm upgrading nico1 to trixie. it's not strictly needed, but I'm in the flow

@nicoboss while doing a backup of nico1 I found 1.4TB of gguf files in /root, what's the deal with them?

Models originating from QuantSniper hardlinks I'm either currently testing or plan on trying within the next few days before either deleting them or moving them to HDD. Usually they should only be there for maybe a day or two but I'm currently having quite an massive backlog of big models due to many unfortinate reasons. I'm currently testing Qwen3-Coder-480B-A35B-Instruct.i1-Q4_K_M.gguf which is why imatrix is paused for the past few hours. If the storage situation gets critical I always move or delete them immediately. In any case they should all be gone within the next week.

i'm upgrading nico1 to trixie. it's not strictly needed, but I'm in the flow

Awesome. That's super cool. I love trixie. It's what I currently have on the host.

the storage situation is not critical, I just wanted to make sure somebody takes responsibility for cleaning up at some point

Awesome. That's super cool.

well, except for login, which is still from bookworm, because the util-linux version in trixie fails to login most of the time :) and systemd telling people to delete their /sbin, which obviously would break debian. and all fonts having different metrics. oh, and it disabled my monitor outputs via remote-x while upgrading (yeah). and inetutils-inetd kills off screen sessions on logout. but otherwise, trixie was a pleasant upgrade. unfortunately, debian is normally so good that this makes it one of the worst upgrades :)

turns out the old parsing to find models was not robust enough, and I found 34000 potential models we missed - mostly mistral

@nicoboss now that we have a fuller queue, the 1.4TB missing budget has actually bitten us, i'll adjust the budget and see if that works, though

update: less than 300G is free atm, though, which can quickly cause failures when quants are quantized. i will probably pause deepseek to help.

update 2: i feel in the long run, though, cheating on the scheduler by using the tasty and presumably unused free space will just cause problems again and again, forcing me to manually clean up the mess. I think we need another solution than me accidentally stumbling over these things. If you need the space, we should permanently reduce the available space on nico1.

update 3: in fact, 1.5TB of disk space were eaten up within a few hours due to a combination of imatrix being paused for the evening and uploads randomly bunching up.

update 4: i've adjusted priority slightly to hopefully free more space quickly, at the moment, nico doesn't have enough free space to continue quanting at full speed.

turns out the old parsing to find models was not robust enough, and I found 34000 potential models we missed - mostly mistral

Wow that is A LOT. A VERY lot. Hopefully we get new rich1 running again soon. During nice wetter I might also turn on nico2 again.

@nicoboss now that we have a fuller queue, the 1.4TB missing budget has actually bitten us, i'll adjust the budget and see if that works, though

It's currently only 1 TB as I already tested some of them yesterday/today. In any case I'm currently moving the remaining three GGUFs to HDD so all storage should be available again in a few hours. You will love new rich1 as there you have more storage than on nico1.

less than 300G is free atm, though, which can quickly cause failures when quants are quantized. i will probably pause deepseek to help.
i feel in the long run, though, cheating on the scheduler by using the tasty and presumably unused free space will just cause problems again and again, forcing me to manually clean up the mess.

Maybe instead of telling the scheduler how much space is supposed to be there he could look at how much is actually available before pushing models to hosts. I guess I should also just make my QuantSniper script automatically move the models to HDD.

Wow that is A LOT. A VERY lot. Hopefully we get new rich1 running again soon. During nice wetter I might also turn on nico2 again.

Ahh, it's all background jobs, really, don't panic. But it is indeed a lot, because there is a much higher fraction of interesting models in there than during normal selection, because the boring models have mostly been filtered out.

It's currently only 1 TB as I already tested some of them yesterday/today.

It is kind of clearing up. It was an unfortunate combination of priorities, models bunching up for imatrix'ing, and most importantly, we had >1TB of not-yet-uploaded quants at some point, which have cleared up.

Also, I am fully responsible ever since I said it's OK, just in case, but it is frustrating. It looks as if there is a lot of unused space, free to use, but it's an illusion.

Maybe instead of telling the scheduler how much space is supposed to be there he could look at how much is actually available before pushing models to hosts.

We've been through this. You don't know how to do that, and me neither. the scheduler would have to make a full disk scan every minute to see which files are there and also have an algorithm to decide whether these files are owned by itself or somebody else. Even then, we would need time travel because the scheduler cannot predict the situation in the future, or which files will be added.

Since you seem to disagree, answer this question: df tells you there is now F free space. How do you tell how much of that will be available for use for the next day? How much of that will be use for uploads, for ggufs from other servers? What's the upload speed in a few hours? We don't know how much quants will use up, we don't know how much models use up and will use up etc.

The only way this can work is when the scheduler knows how much it can use, and this is what is done right now. We could dynamically reduce that, e.g. I could make a file where you can put some number and then decreases the budget, but you would have to reserve space a few hours before using it. And the space situation is already tight (which is why larger models often reside on other disks), and reducing it further will just end up as on rich1, where we were severely limited on what we can quantize.

But you decide, it's your hardware.

Or tell me the magical algorithm to find out how much space is actually available. df won't tell you that.

PS: the scheduler already has various emergency modes when it runs out of space despite nominally having budget, but these are emergency modes. they saved us from failing quants today, at the cost of about an hour of not quantizing.

It's currently only 1 TB

That explains things, I reduced the budget by 1TB and wondered why that seemed to work out (i.e. where were the extra 400G from :)

btw., the reason why so many models were skipped was that the method to "parse" supported archs was basically grep ModelBase.register, which worked fine in the past, but degraded. Not enough to be super-noticable, but I had a list of models for some months now that I wanted to investigate, on why I seemingly overlooked them.

But I didn't have time to look at them till now. It came in handy that I accidentally added --print-arch output parsing a week or so ago. That is now used.

Oh, and I am only through 5000 of 34000 models, although I will likely increase my threshold for model selection the more I go back in time.

34000 models is a bit more than what I normally review in a month btw.

And yes, we could permanently e.g. decrease the budget by 1TB for example, and then allow for the occasional scheduling issue (that would hopefully not require manual intervention). It would still break under the same situation when an extra 1.4T would be used though, for example.

I tried turning on nico2 again but it doesn't seem to currently be possible. Even if I reenable it it doesn't appear on the status page so you must have disabled it in a different way. Maybe it requires some updates? I already updated the host to Debian trixie but the nico2 is still on Debian bookwoorm.

Ahh, it's all background jobs, really, don't panic. But it is indeed a lot, because there is a much higher fraction of interesting models in there than during normal selection, because the boring models have mostly been filtered out.

Damn wow that will take forever but great we do this before it turns late automn/winter as in the current season there is still plenty of sun.

It is kind of clearing up. It was an unfortunate combination of priorities, models bunching up for imatrix'ing, and most importantly, we had >1TB of not-yet-uploaded quants at some point, which have cleared up.

Don't worry about it anymore. They are now all on HDD.

mv: error reading 'AI21-Jamba-Large-1.7.i1-Q5_K_M.gguf': Input/output error

Another reason not to keep them on SSDs known to suffer bit rot. I keep forgetting that there is a reason you are using those particular SSDs. They are perfect for your specific use-case but not for much else. In case you wonder they are currently sitting at 40% and 44% wearout and so will be good for another 1.5 years of quantization at the current speed. Let’s hope we make it that far so I can finally upgrade them:

Percentage Used:                    40%
Data Units Read:                    6,782,996,130 [3.47 PB]
Data Units Written:                 3,756,628,810 [1.92 PB]

Percentage Used:                    44%
Data Units Read:                    6,403,960,042 [3.27 PB]
Data Units Written:                 3,674,643,656 [1.88 PB]

PS: the scheduler already has various emergency modes when it runs out of space despite nominally having budget, but these are emergency modes. they saved us from failing quants today, at the cost of about an hour of not quantizing.

That is good enough.

That explains things, I reduced the budget by 1TB and wondered why that seemed to work out (i.e. where were the extra 400G from :)

They are now all gone so you can increase it again. I might temporary use up to around 400 GB for quant sniper hardlinks but in the future will quickly move them to HDD so I don't think we should reserve any space for that.

btw., the reason why so many models were skipped was that the method to "parse" supported archs was basically grep ModelBase.register, which worked fine in the past, but degraded. Not enough to be super-noticable, but I had a list of models for some months now that I wanted to investigate, on why I seemingly overlooked them.
But I didn't have time to look at them till now. It came in handy that I accidentally added --print-arch output parsing a week or so ago. That is now used.

Interesting. I might need to switch to using --print-arch in that case as well. I'm currently using r'@ModelBase\.register\("([^"]+)"(?:, "([^"]+)")*(?:, "([^"]+)")*\)' so I might have the same issue.

Oh, and I am only through 5000 of 34000 models, although I will likely increase my threshold for model selection the more I go back in time.

You so far queued around 500 most of which are only static so it is not that bad. So only around 3000 models to come. We can handle that especially once we get rich1 on supercomputer working. The storage setup on supercomputer is more complicated than anticipated, the server has currently some internet issues and Richard is sick so it takes a bit longer than anticipated but we are working on it and should hopefully get everything working within the next few days

34000 models is a bit more than what I normally review in a month btw.

I didn’t realize that you review 34K models per month. Wow you put so much work and effort into this!

And yes, we could permanently e.g. decrease the budget by 1TB for example, and then allow for the occasional scheduling issue (that would hopefully not require manual intervention). It would still break under the same situation when an extra 1.4T would be used though, for example.

It’s fine. Just keep it the way it is or slightly decrease it but definitely not by 1 TB. That situation was super rare and cased by many unfortunate factors coming together.

I tried turning on nico2 again but it doesn't seem to currently be possible. Even if I reenable it it doesn't appear on the status page so you must have disabled it in a different way. Maybe it requires some updates?

I've indeed disabled nico2 (and rich1) because I was annoyed at the content indicator (and by now it also requires some updated). I'll update it and activate it.

Although I would be happy with letting things tug along, depending on how things develop.

Another reason not to keep them on SSDs known to suffer bit rot. I keep forgetting that there is a reason you are using those particular SSDs.

It's again shocking. Not even the crapiest sandisk ssds I have ever showed bitrot, and some are well into their 500% usage. Wow, just wow.

Interesting. I might need to switch to using --print-arch in that case as well

Yup, no MistralForCausalLM, and no LLama 2 either, for example.

I didn’t realize that you review 34K models per month. Wow you put so much work and effort into this!

Maybe "review" is a bit overstated, but I look at about 30k names per month for this. It's kind of interesting, but I admit it did wear off a bit :-) I am secretly playing with using an llm for more pre-sorting :)

phew, kaos is not in great shape, the rsync to update nico2 is running for almost 10 minutes now. anyway, I'll update nico2, do a test quant, then shuit it down, and it it should wake up tomorrow again as usual.

hmm, weird stuff, the download process on nico2 exited twice, with no error message.

also, storage has changed somehow, the config says to use 1600GB (so there had ot be more), but now only 1400GB are available on the disk. i'll try to adjust. not sure what happened.

aha, trixie helpfull force-mounted tmpfs on /tmp despite this being disablked on bookworm, I've even written this down and forgot about it. that explains the download, but not the lakc of an error message, and the 45gb, but not the overall storage decrease.

no complaining, just wondering, because silent changes like that are gonna bite me again :)

anyway, i've reduced overall budget by 600GB for that. unfortunatelöy, that puts us into the same storage problem area as rich1, but maybe that won't matter. i've configured nico2 to 70B models or smaller as a result.

no, strange things happen, the convert process was oom-killed. maybe it's the one model that requires more than 32gb?

yup, seems to be the model then. OptD_Phi4plus is the only model in the laste year or so that didn't convert with 32GB ram, or somehow something is enforced from the outside on nico2.

anyway, we'll see more tomorrow morning, i'll shut it down now.

yup, seems to be the model then. OptD_Phi4plus is the only model in the laste year or so that didn't convert with 32GB ram, or somehow something is enforced from the outside on nico2.

I just checked dmesg on CastlePeak and it indeed was due to it exceeding the 32 GiB cgroup limit we set. The host itself still had around 200 GiB of free memory:

[32972.884420] Memory cgroup out of memory: Killed process 268222 (pt_main_thread) total-vm:7308188kB, anon-rss:179940kB, file-rss:903816kB, shmem-rss:0kB, UID:100000 pgtables:2880kB oom_score_adj:0

@nicoboss the shutdown script (ssh -q [email protected]) also no longer shuts down nico2

anyway, we'll see more tomorrow morning, i'll shut it down now.

I just realized that I still had the shutdown handler disabled. I enabled it now.

Edit: Nice we found out it no longer works at the same time :D
Edit2: I ran the shutdown handler, it worked and CastlePeak is now turned off.

OptD_Phi4plus

Guess we won't quantize that then :)

Nice we found out it no longer works at the same time :D

this time I got proof first111!

(usually I stay quiet when I find and fix something in the hope you won't stumble over it, but you have a track record of finding every problem quickly)

I ran the shutdown handler, it worked and CastlePeak is now turned off.

It should switch on at 7am and do it's thing, but I'll be there anyway if it doesn't.

This evening I used the host-pause script rebooted the host and then used the host-resume script like I did many times in the past. However this time the resume script doesn't work leaving nico1 disabled. I already tried many times but am always getting the same error.

llmjob worker nico2 disabled, skipping.
cmd_push 3
/llmjob/wdir/llmjob_slave.json~: No such file or directory at /llmjob/share/bin/llmjob line 228.
 remote error from '10.28.1.6', propagated at /llmjob/share/bin/llmjob line 553, <GEN6> line 3.
        main::__ANON__() called at /llmjob/share/bin/llmjob line 1451
        main::cmd_push() called at /llmjob/share/bin/llmjob line 1963
llmjob worker nico2 disabled, skipping.
cmd_push 4
/llmjob/wdir/llmjob_slave.json~: No such file or directory at /llmjob/share/bin/llmjob line 228.
 remote error from '10.28.1.6', propagated at /llmjob/share/bin/llmjob line 553, <GEN6> line 3.
        main::__ANON__() called at /llmjob/share/bin/llmjob line 1451
        main::cmd_push() called at /llmjob/share/bin/llmjob line 1963

Guess we won't quantize that then :)

I can just manualy provide the GGUF for it

this time I got proof first111!

I first waited a bit as I thought you might not have yet sent the shutdown command until I realized that I completely forgot to reenable the shutdown handler. I made it so I can toggle it so the automatic shutdown will only occur if we don't need CastlePeak for anything. For example, during imatrix RPC computation it shutting down would be disastrous. While nico2 was not in use I always just kept it disabled so we don't accidentally turn it off should we use it for anything else. I'm usually physically located in the same room as StormPeak and CastlePeak. I'm often even doing my job on StormPeak as I prefer it over my company notebook.

(usually I stay quiet when I find and fix something in the hope you won't stumble over it, but you have a track record of finding every problem quickly)

I'm usually quite closely monitoring things so I'm quite likely to spot issues.

It should switch on at 7am and do it's thing, but I'll be there anyway if it doesn't.

Let's see if it works. Automated boot should for sure work. I'm a bit less confident about the script that gets executed on boot to set things like CPU frequency and ARC cache but it almost certainly should work as I only changed a single line since I last tested it.

Shit /tmp is gone. That explains it:

tmpfs                    252G  140K  252G   1% /tmp

aha, trixie helpfull force-mounted tmpfs on /tmp despite this being disablked on bookworm, I've even written this down and forgot about it. that explains the download, but not the lakc of an error message, and the 45gb, but not the overall storage decrease.

That garbage got me. I already feared we lost /tmp like we once did on rich1. Luckely a simple umount /tmp fixed it and brought back the real /tmp after which the host-resume script worked perfectly fine.

Not sure if I'm crazy but I remember there where some imatrix tasks in the imatirx queue from back when I paused the host. I think they are gone now as I can't see them inside the queue anymore and I also can't find any imatrix log that they were computed.

I tried to run imatirx computation of InternVL3_5-241B-A28B and because I was unhappy which how mmap didn’t cache it into RAM tried mlock. Turns out the model is 471 GB and so together with the 5 quantisation tasks running at the time caused the host to run out of memory and crash reboot. This again caused /tmp to be hidden by tmpfs and more importantly somehow brough the imatrix task in a very strange state where the status page still shows it as running instead of failed despite no longer being running. Even llmc force-restart-imatrix InternVL3_5-241B-A28B is unable to reset it. Please reset it and mark that one as blocked for now. We will have to pause the quantisation tasks before starting it.

When we are at InternVL3_5-241B-A28B do you have any idea why it isn't marked as vision model? Our llama.cpp version does support mmproj extraction of it. I even went out of my way to manually try it using venv/bin/python convert_hf_to_gguf.py --mmproj /bpool/InternVL3_5-241B-A28B and it worked yet when I queued it, it didn't appear as vision model so I didn't even try and just provided the GGUF. I still have a local copy of the SafeTensors model so once you fixed it we can generate the mmproj files for it. Once it is fixed I also plan on queueing the rest of the InternVL3_5 series of models. They are extreamly good models: https://huggingface.co/collections/OpenGVLab/internvl35-68ac87bd52ebe953485927fb

Edit: Nice the InternVL3_5-241B-A28B imatrix state has fixed itself in the meantime and now properly shows as failed. Your system is genius that if has timeouts everywhere to fix itself.

When this happens:

[1]3.9549,[2]2.9901,Read from remote host 10.28.1.6: Connection reset by peer

then the imatrix schedfuler simply has no clue what is going on anymore, and rather than crash your node by starting more, it will consider the job as still running, kind of.

This again caused /tmp to be hidden by tmpfs and more importantly somehow brough the imatrix task

That is very surprising, as I yesterday mask'ed tmp.mount myself, or rather, made sure it is masked, because I had forgotten it for nico2, but it was masked. Could it be that proxmox(?) does things in /etc before booting? I think it does try to set the hostname, for example but I never really investigated. I would say no, because nico2 comes up without issues, and tmp.mount is still masked there.

on the other hand, the rsync backup i made after upgrading nico1 does show it as masked, so something must have unmasked it.

in any case, masking it and rebooting is the way to go. please reboot it once convenient. (unless you used another method to fix it, but unmounting /tmp after booting is likely not good enough as too many things have directories/sockets in there needed at runtime).

I'm now looking into why nico2 is not back, and then the rest of the issues. oh boy.

Even llmc force-restart-imatrix InternVL3_5-241B-A28B is unable to reset it.

I assume that is because you did it too early, while it was still "running". Because it works now.

In general I try to err on the safe side, and force-restart-imatrix onkly works when the job has failed, not when the job is in unknown state, but poossibly still running. I suspect it took ssh a while to retry (probably only due to keepalive, which can take hours).

Well, found an unrelated bug, when llmc times out (And it currentlyx does because kaos is very overloaded and enabling nico2 requires an rsync of /llmjob), I get this:

Can't locate object method "throw" via package "AnyEvent::CondVar" at /llmjob/share/llmjob.pm line 414.

shame on me, I wrote both sides of this, and couldn't remember that the method is called croak, not throw. :(

Anyway, not sure what to do other than increase the llmc timeout to something obscene. Or get a faster kaos.

it's so humbling to see a very busy rotating rust box wait 10 minutes to start a simple command

also, i really underestimated the skipped model list. normally, I go through up to 1000 models each time I do models, which is usually good enough for most days (600-1000 models is typical).

But these models are intense, because apparently most of "uninteresting" models have alrteady been skipped, so I end up with way too many hard to parse model names to quickly go through. Nevertheless,
I went through 16k out of the 34k models so far, so I guess it will be around 1200 models overall.

ok, everything looks much better now. llmstatusd couldn't update for... at least half an hour.

@mradermacher rich1 is ready for you to use. We created a 12 TiB BTRFS raid0 pool over 3 HDDs from which you can reserve 8 TiB for your scheduler. It really nice to have another worker we can use for large models. We currently assigned 200 GiB of RAM and 256 cores to your container. Please reenable it and have fun. Port 9999 can be used for SSH access using public dynamic IP. Feel free to use rich1 it as hard as you can. Your system currently only supports 1 imatrix node but if you ever add support for a second one, we can add some GPUs to your container on rich1.

@mradermacher Please hardcore InternVLChatModel as vision model no matter what llama.cpp thinks as it is supported. It is one and them not being recognized as such as starting to get quite problematic as they now created to transform many models into hair architecture like InternVL3_5-GPT-OSS-20B-A4B-Preview which all have the same issue.

Please also update to latest llama.cpp of our fork for the highly anticipated Kimi VL vision extraction support, MiniCPM-V 4.5 vision extraction support and interns1-mini support. Let's also retry InternVL3_5-GPT-OSS-20B-A4B-Preview after the update.

Any status update regarding enabling rich1? We currently have some large of high priority models in the queue that would be perfect for it. Please don't gforget to remove the model size limit when rebelling it now that the previous storage constraints are gone. I also recommend doing multiple concurrent tasks as we have plenty of RAM and many CPU cores to keep busy.

@mradermacher Can you even access rich1? I'm not sure if it is still connected to your VPN as when I try to enable it over llmc enable rich1 the SSH connection to it over VPN times out. Maybe this is also just because you disabled it. If I need to do anything on rich1 to fix this or email you its current IP so you can SSH it just let me know.

nico1 ~# llmc enable rich1
rsync /llmjob/llama.cpp-nocuda rich1:/llmjob/.
rich1: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(232) [sender=3.2.7]
ssh: connect to host 10.28.1.7 port 22: Connection timed out

@mradermacher When I check ip addr on rich1 the wgllm interface I have on nico1 is compleately missing and wg show does not return anything.

@mradermacher I created a script /root/rich1.sh on nico1 that lets you access rich1 over SSH until you fixed the VPN.

Please hardcore InternVLChatModel as vision model

Can you first test whether mmproj extraction actually works for an unsupported model? I am a bit doubtful on why convert_hd_to_gguy.py should work on that model when it explicitly says it doesn't support it.

Please reenable it and have fun. Port 9999 can be used for SSH access using public dynamic IP.

It might be public, but IO don't know its magic digits.

wgllm on rich1 is configured via systemd, so if it is missing, somebody changed the config in /etc, i.e. the container is somehow corrupted. Even with no network access it should start showing up.

I created a script /root/rich1.sh

Ah, that contains the magic digits. I removed the script, as I don't think I will ever use it, but I can log in to rich1.

@nicoboss it seems something horrible has happened when copying rich1, all permissions and ownerships are corrupted, which is why systemd couldn't read the config files. I don't think that is reasonably fixable without setting it up fresh. IS there a chance of getting the actual container data, not some weird windows copy? :)

If you don't know how to copy a filesystem, with rsync you need "-axH --numeric-ids", and with tar you need "--xattrs --numeric-owner" to copy a linux file system. anything less corrupts the system more or less. in rsync -a makes sure ownership is copied, -x is for extended attributes, -H for hardlinks and --numeric-ids adjusts for the fact that your host has different uids/gids than the container.

If the original container is lost forever, I can see if I maybe have a very old backup, and rsync that over, hopeing that it is complete (it's not meant to be a fully restorable backup)

imatrix node but if you ever add support for a second one

The cheapest way would be to make a copy of the script (well, or add a second "personality" and queue) - it already is a hacked copy of llmjob anyway.

At the moment, from my side, the only time I wish I had a second node would be when nico1 is down or busy. But maybe you want rich1 to also shoulder a bit of that load. That is mostly between you and richard.

The bigger issue is to decide where to distribute the model to. Would be nice if the imatrix scheduler could know about more nodes.

@nicoboss it seems something horrible has happened when copying rich1, all permissions and ownerships are corrupted, which is why systemd couldn't read the config files. I don't think that is reasonably fixable without setting it up fresh. IS there a chance of getting the actual container data, not some weird windows copy? :)

Only file/folder ownerships should be affected but not the permissions or any other attributes. I thought I already manualy fixed all the file/folder ownerships by recreating them the way they are on nico1 - I must have missed one. That wasn't because of the backup but because Richard accidentialy chowned them all to root on the original container before making the backup as he accidentialy made it recursive.

llama.cpp is updated

Only file/folder ownerships should be affected but not the permissions or any other attributes.

The permission bits themselves might be ok, but effective permissions are corrupted (i.e. networkd could access it, and now can't).

Richard accidentaly chowned them all to root on the original container before making the backup as he accidentialy made it recursive.

Holy shit. I don't think it's reasonable to try to fix this manually, I'll try to restore it myself from backup then, wish me luck.

In any case, the only thing that really should be improved is telling me about it, as you obviously knew it happened, but you just let me stumbled blind and forced me to investigate myself

Yeah, or maybe you didn't realise that chowning all files would severely corrupt a unix system.

(things I didn't asked for :)

Can you first test whether mmproj extraction actually works for an unsupported model? I am a bit doubtful on why convert_hd_to_gguy.py should work on that model when it explicitly says it doesn't support it.

I already tested for InternVL3_5-241B-A28B as mentioned it in my original request to treat InternVLChatModel as vision model in https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/6#68ae60a635ca29ee4c4ea4bf. mmproj extraction worked without any issues.

even the ssh key has changed on rich1...

Holy shit. I don't think it's reasonable to try to fix this manually, I'll try to restore it myself from backup then, wish me luck.
In any case, the only thing that really should be improved is telling me about it, as you obviously knew it happened, but you just let me stumbled blind and forced me to investigate myself
Yeah, or maybe you didn't realise that chowning all files would severely corrupt a unix system.

I thought I already fixed it because I did find / \( -path /proc -o -path /sys -o -path /dev -o -path /run \) -prune -o ! -user root -exec stat -c '%U %G %n' {} \; 2>/dev/null on nico1 to find all the files/folder that aren't owned by root and then manually exactly replicated the ownership on rich1. Because I thought I already fixed it I felt it's not important to tell you and it also seamed unrelated to the VPN issue. I will try to communicate even more information in the future even if I feel like they might not be relevant to you.

I already tested for InternVL3_5-241B-A28B

Thanks, I'll provide a manual override

@nicoboss more than 60k files had wrong (effective, didn't care to check what exactly) permissions. Pretty sure you couldn't have fixed those by hand :)

I thought I already fixed it because I did find

Ok, wow. Well, thanks for telling me in the future. I must say, it would strike me as extremely relevant and obvious, but maybe you were not aware of how important ownership is on unix, so, fair game, things happen.

even the ssh key has changed on rich1...

This is expected behaviour as it's one of the files managed by Proxmox and we moved to a diffrent Proxmox host. Same applies to all other files managed by the host like the network configurations, the hostname and some other files as can see in ther documentation under https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_guest_operating_system_configuration:

Proxmox VE tries to detect the Linux distribution in the container, and modifies some files. Here is a short list of things done at container startup:

set /etc/hostname
    to set the container name
modify /etc/hosts
    to allow lookup of the local hostname
network setup
    pass the complete network setup to the container
configure DNS
    pass information about DNS servers
adapt the init system
    for example, fix the number of spawned getty processes
set the root password
    when creating a new container
rewrite ssh_host_keys
    so that each container has unique keys
randomize crontab
    so that cron does not start at the same time on all containers

Thanks, I'll provide a manual override

Great. Thanks a lot.

llama.cpp is updated

Perfect!

@nicoboss more than 60k files had wrong (effective, didn't care to check what exactly) permissions. Pretty sure you couldn't have fixed those by hand :)

Wow that's impressive. On nico1 I only found 25 files (beside cached manuals) not owned by root so manually fixing those on rich1 seemed like no big deal. Maybe my command to detect files not owned by root had a mistake because based on your observation I should have found thousands.

but maybe you were not aware of how important ownership is on unix

I know permissions are super important but ownership always seemed not so important for containers with only a root user. I thought in this setup everything is owned by root beside a few files from services that create their special users. It blows my mind that you had 60k files with wrong attributes because beside ownership all other attributes must have been unchanged. I still find it really hard to believe that there are so many files not owned by root. But you are right I probably should have known better and treated this far more seriously. Sorry for not telling you in advance.

so what could have been a dreary night suddenly got very interesting. somehow my rsync did delete the thing it shouldn't have deleted, and I could run binaries anymore on rich1.

now, that is an exciting problem :) let's see if i can fix a system with just a single bash process running, and no binaries startable.

(no, i'm serious, don't help/interfere)

rich1 looks super unhealthy at the moment. The vast majority of commands seam to be broken and should my SSH session drop I wouldn't be able to reconnect to it:

rich1 /# ls
-bash: /bin/ls: cannot execute: required file not found
[Exit 127]

so what could have been a dreary night suddenly got very interesting. somehow my rsync did delete the thing it shouldn't have deleted, and I could run binaries anymore on rich1.
now, that is an exciting problem :) let's see if i can fix a system with just a single bash process running, and no binaries startable.
(no, i'm serious, don't help/interfere)

That explains it. No worries I won't do anything. Not that I really can do much with like 90% of commands no longer working and no new SSH sessions beeing possible. By the way the commands listed under help still work.

It blows my mind that you had 60k files with wrong attributes

I'm still in fact finding, I don't know if all those had different owenrship or something else was different (other than contents). Anyways, I herewith absolve you of any wrongdoing, if that was even in question. I'll report later after my very exciting adventure. I love interesting challenges, suhc as, how to copy a binary file via a socket with just a single bash running.

a system with just a single bash process running, and no binaries startable.

how to copy a binary file [...] with just a single bash running.

@mradermacher

https://www.qfbox.info/bashcp might be useful in that case, although it requires locally compiling a binary to produce the appropriate echo commands to paste.

(EDIT: I see you've used /dev/tcp, which is probably more convenient)

no, my backup was not very healthy. it does not even have libcrypto or a complete libc - bbut it had all the toplevel directories, so looked sound. and i think i know what happened there (I had a data loss problem in my backup a few months ago, and didn't care to restore rich1 because, who needs a backup if you make one), but i can go forward from that.

for future reference, that was an interesting problem - convert a static busybox to a simple hex dump, then use /dev/tcp and ( while read hex; do printf "\\x$hex";done) >/usr/bin/who to overwrite a harmless existing binary (no chmod nor cast available), then use bash's "exec -a" feature to run a busybox shell.

https://www.qfbox.info/bashcp might be useful in that case, although it requires locally compiling a binary to produce the appropriate echo commands to paste.

Thanks, yes, that would also have helped. And indeed, what cost me the most time is to get the idea of using printf/hex. There really is no other way to connect two fd's in bash, it seems.

What that page should add is how to get an executable file without chmod, though. A busybox binary is useless if you can't run it. I abused an existing binary, but a better method would be nice.

Update: oh, the page does do exactly that, sorry, only skimmed the page.

I had a data loss problem in my backup a few months ago, and didn't care to restore rich1 because, who needs a backup if you make one

The issue with that thought is that if you need it you can no longer just make one but don't worry we have one just with broken ownership.

I'll report later after my very exciting adventure. I love interesting challenges, suhc as, how to copy a binary file via a socket with just a single bash running.

That sounds like a fun challenge. If you can't solve it I could tell Richard to add/change some files from the host or restore the backup we made during migration from old to new server using vzdump.

https://www.qfbox.info/bashcp might be useful in that case, although it requires locally compiling a binary to produce the appropriate echo commands to paste.

That's super cool. I feel bad for you getting constantly spammed with notifications because of us treating HuggingFace like a chat platform.

for future reference, that was an interesting problem - convert a static busybox to a simple hex dump, then use /dev/tcp and ( while read hex; do printf "\x$hex";done) >/usr/bin/who to overwrite a harmless existing binary (no chmod nor cast available), then use bash's "exec -a" feature to run a busybox shell.

Wow that is so nice. What a cool solution to this problem.

The issue with that thought is that if you need it you can no longer just make one but don't worry we have one just with broken ownership.

The recovery for both is about the same, though - reinstall debian, then clean up extra filkes, which is what I am currently effectively doing.

Wow that is so nice. What a cool solution to this problem.

Relatively speaking, I do this a lot, but I never had to copy a binary with just bash, so the only really new thing was to use printf (which seems to be easier then working around bash's echo). But yeah, @compilade cheated by actually finding a precanned solution :)

Normally, I also am more like a strict posix shell guy, so a solution with just posix sh might be cool (/dev/tcp is out, but pasting still works) - but neither echo (no escape codes) nor printf (no built-in) will be useful.

Maybe a single-octet read loop with empty string menaing 0-octet might work...

The lack of exec -a in posix can be worked around by compiling a different binary.

@nicoboss ok, I failed. when I did systemctl restart systemd-networkd to get the tunnel I was instantly without network.

I have not tried rebooting, but with luck, rich1 will come up when rebooted. not sure about the network, but obviously, I'm out of the game and need help.

the pre-trixie etc is in /etc-old, if that helps.

@nicoboss ok, I failed. when I did systemctl restart systemd-networkd to get the tunnel I was instantly without network.
I have not tried rebooting, but with luck, rich1 will come up when rebooted. not sure about the network, but obviously, I'm out of the game and need help.

No problem. It's quite remarkable how far you made it without losing access. I also lost connection so we unfortunately have to wait for Richard to wake up again in around 3 hours. I already asked him to reboot it. If that doesn't help Richard can pct enter it from the host and fix it.

well, at least the first part was titillating, restoring debian and upgrading to trixie was not. i also gave some more though on posix-sh-only, but I can't come up with any portable way to write a 0-octet. so sucks.

(I once hacked an arm binary (accepting hex from input, pretty much this: https://gitlab.com/rav7teif/linux.wifatch/-/blob/master/tn/rf.c) with no 0-octets inside the elf binary itself, so I could feed it over telnet, but that is hard to do for every arch, and current kernels probably wouldn't accept it).

anyway, so much for being off-topic.

I wokeup early for no reason. Reboot didnt fix lol. What did you mess up ? Any specific steps you want me to follow (preferably)?

@RichardErkhov first of all, you messed it up, thank you for taking precious hours of my night :) In any case, "didn't fix" is rather unspecific. surely it will be able to start systemd and do something, and then error out somehow? If I knew what is wrong I would have fixed it.

Basically I installed trixie packages, and when I thought it was ready for boot, I tried systemctl restart systemd-networkd to make sure the tunnel would go up, and then the network was gone. Since you broke networkd, I have no clue how it got network the first time it booted (it wasn't through networkd as on the old server, so something must have changed that is out of my control)

Maybe the old method on the old host is not compatible with the method on your new host? (looking at my bckup, it simply did dhcp on eth0)

Otherwise, I'd have to get some hints (console messages) to be able to even guess what could be wrong. But if all you see is that it lacks network connectivity, then it's the network config in /etc - I copied over the config from the old box.

But without any clue why it doesn't come up, or what's going wrong, I can't say more. I would assume that it does boot to some extent though, as systemd was successfully restarting before it happened.

you messed it up, thank you for taking precious hours of my night :)

Well the network was working when I went to sleep, but sorry for whatever I did wrong lol =)

ping 8.8.8.8 says unreachable

Let me try checking configs (after exams I forget a lot, complete brain dead so sorry if I keep forgetting stuff). If not a secret, which file should I see ? =)

Well the network was working when I went to sleep, but sorry for whatever I did wrong lol =)

The fact that it wasn't working is why nico asked me to investigate in the first place. And we know why it wasn't working, you chown'ed everything to root, so the networkd didn't start.

You can look at any file, as long as you don't publish it's contents (or at least not the secret keys) :)

I don't know how networking works in your configuration, so I can't tell you what to do. I can only tell you that on the old rich1 there was an eth0, and systemd tried used dhcp on it. But when it was started the first time, it had networking without it (because networkd failed to start), so I assume the networking config changed, but that is outside of rich1.

There ius also the possibility that systemd-networkd still doesn't start properly - but normally, when it simply fails to start, it doesn't touch the network config.

I've googled a bit, and proxmox possibly does some kind of guessing how the container configures it's network. Maybe that goes wrong. But if programs properly run inside (ping, systemd), then most likely
the system does boot to some extent.

Again, I'd need something tangible to even guess what the problem is. Such as the journal, or boot messages.

@nicboss before I forget:

I know permissions are super important but ownership always seemed not so important for containers with only a root user.

there might be very stripped down containers which only have a root user, but debian (or any full system that boots systemd or sysvinit, really) is never of this type. it always contains multiple users.

Hmmm. It is eth0, ip should be 192.168.2.106/24, which is correct in /etc/network/interfaces, the router should be 192.168.2.1, which is also correct. I am confused lol

Oh so you wanted to say in the evening it worked because it was even more broken lol ?

And I think system boots up properly given that I am inside ct right now looking at config files lol

Again, I'd need something tangible to even guess what the problem is. Such as the journal, or boot messages.

Im on phone for next few hours, so getting that would be pain rn

/etc/network/interfaces

this file is not used on debian unless you install some really old legacy packages. modern system configuration is done with systemd-networkd, for better or worse.

it could be that that is the problem - the package that would use that file is not even installed. at least not now, maybe it was installed before.

if dhcp doesn't work on the new host, you'd have to configure /etc/systemd/network/eth0.network (or something similar, I can't look at it right now :).

Probably this might work:

[Match]
Name=eth0

[Network]
Address=192.168.2.106/24
Gateway=192.168.2.1

I am pretty ure my backup of /etc is complete, and rich1 didn't have an /etc/network/interfaces file in july (when the backup was made), but there was an eth0.network, and it must have been used, since the wg tunnel is configured via networkd as well, and network is all or nothing.

So something changed the network config since then. Well, I can't administrate around silent changes like that, if I don't know about it, I can't keep it working.

Let's wait for nico, he is a good networking guy and knows rich1 config =)

because if I touch it then ... if something can go wrong it will go wrong

I know^W^WThat's a good idea :)

I managed to get this screenshot from Richard in case you see anything wrong there. I'm currently waiting for Richard to be home again so I can work together with him to fix this:
image.png

I accidentally repaired it. Yes, breaking it more indeed fixes the network. /etc/network/interfaces was the fix. I completely commented it out and now it works

@mradermacher Internet on rich1 works again but something with WireGuard VPN is still not working. You can use dmesg inside rich1 container to debug WireGuard VPN. Here the WireGuard dmesg log with kaos IP replaced with kaos:

[135910.920836] wireguard: wgllm: Interface created
[135910.954132] wireguard: wgllm: Peer 25 created
[135910.954202] wireguard: wgllm: Peer 26 created
[135910.954275] wireguard: wgllm: Peer 27 created
[135910.954347] wireguard: wgllm: Peer 28 created
[135910.954414] wireguard: wgllm: Peer 29 created
[135910.954486] wireguard: wgllm: Peer 30 created
[136790.912729] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 25
[136791.921307] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 25
[136792.946287] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 25
[136793.969256] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 25
[136802.748018] wireguard: wgllm: Sending handshake initiation to peer 30 (kaos:7103)
[136807.856849] wireguard: wgllm: Sending handshake initiation to peer 30 (kaos:7103)
[136813.168655] wireguard: wgllm: Handshake for peer 30 (kaos:7103) did not complete after 5 seconds, retrying (try 2)
[136813.168726] wireguard: wgllm: Sending handshake initiation to peer 30 (kaos:7103)
(...)
[136945.260983] wireguard: wgllm: Handshake for peer 30 (kaos:7103) did not complete after 20 attempts, giving up

I noticed that /etc/systemd/network/50-wgllm.netdev on rich1 is very outdated due to you restoring a really old backup so I updated it with the one from our rich1 backup. VPN now seams to work given that key pairs get exchanged but it still doesn't seam to actually work:

[138614.400873] wireguard: wgllm: Interface created
[138614.426813] wireguard: wgllm: Peer 34 created
[138614.426922] wireguard: wgllm: Peer 35 created
[138614.427004] wireguard: wgllm: Peer 36 created
[138717.757475] wireguard: wgllm: Creating namespace exiting
[138717.834224] wireguard: wgllm: Peer 31 ((einval)) destroyed
[138717.834627] wireguard: wgllm: Peer 32 ((einval)) destroyed
[138717.834779] wireguard: wgllm: Peer 33 (kaos:7103) destroyed
[138717.862415] wireguard: wgllm: Interface destroyed
[140465.007778] wireguard: wgllm: Sending handshake initiation to peer 36 (kaos:7103)
[140465.211108] wireguard: wgllm: Receiving handshake response from peer 36 (kaos:7103)
[140465.211124] wireguard: wgllm: Keypair 1 created for peer 36
[140475.273413] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 35
[140475.604844] wireguard: wgllm: Receiving keepalive packet from peer 36 (kaos:7103)
[140476.303235] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 35
[140543.501407] wireguard: wgllm: Sending keepalive packet to peer 36 (kaos:7103)
[140595.212043] wireguard: wgllm: Sending keepalive packet to peer 36 (kaos:7103)
[140595.212282] wireguard: wgllm: Sending handshake initiation to peer 36 (kaos:7103)
[140595.408798] wireguard: wgllm: Receiving handshake response from peer 36 (kaos:7103)
[140595.408816] wireguard: wgllm: Keypair 2 created for peer 36
[140595.408824] wireguard: wgllm: Sending keepalive packet to peer 36 (kaos:7103)
[140628.431057] wireguard: wgllm: No valid endpoint has been configured or discovered for peer 35

good morning :)

/etc/network/interfaces was the fix. I completely commented it out and now it works

Nothing is commented out though, and now we have two network configs for eth0. So what really was done? I don't think this is a good config going forward, if two config mechanisms race each other.

/etc/network/interfaces requires the old ifupdown package to configure a static ip (and reconfigures lo?), while /etc/systemd/network/eth0.network uses dhcp.

both are running and trying to configure the interface. What is the preferred mechanism, static IP or dhcp? And does proxmox somnehow muck with networkd? Because if I remember it correctly, it did so on nico1, where we had to switch to ifupdown because of proxmox.

50-wgllm.netdev on rich1 is very outdated

indeed, well caught :( I'll have a look

I have no clue what is going on with wgllm, but I see no packets either arriving on kaos nor being sent on rich1. Even with a totally worng config it should send handshakes immediately.

Can I assume that rich1 has a fixed external IP and port 7103 goes through to rich1? If not, we can't use wireguard, as wireguard does not support dynamic ips as endpoints.

(this is unrelated to why it doesn't work right now, though)

Update: I can assume it, but it's not true, so this would need to be fixed before we cna use rich1, bnut it's not the problem right now, I would think. Unless the same config prevents wireguard from sending data outm which is probably not the case.

@nicoboss can't reach nico2 from outside: ssh: connect to host castle.nico.re port 2111: No route to host

Update: uh, it's offline of course

Just fyi, I decided on the systemd network config and rebooted rich1, and that seems to work. However, after reboot:

-rw-r--r-- 1 root root 137 Aug 29 20:15 interfaces
-rw-r--r-- 1 root root 321 Aug 29 11:32 interfaces_old

So indeed, something (well, proxmox I guess) patches around in /etc on reboot. That also explains why @RichardErkhov commented it out with no effect, it just comes back on reboot.

I'll switch to ifupdown and rbeoot once more, since proxmox seems to require this speciifc debian setup. What does it do if some other linux is installed? That seems... broken.

Anyway, it came back up with ifupdown, and systemd now keeps its hands off of eth0.

what can i say, a few reboots made wgllm work, kind of. this smells rotten. I don't think I've changed anything in the config (I regenerated it from scratch, but even if the keys would have changed, itz should have send out data)

anyway, I need some fixed ip/external port visibility for wireguard - without it, it cant't work, only one node in the wireguard network can have dynamic ip, and that is currently nico1+nico2 (which have a fixed IP internally)

I am currently trying this: Endpoint=x.x.x.229:7103

unrelated to rich1: I've added MMPROJ/InternVLChatModel to /llmjob/share/convert_hf_to_gguf_models.pm, which is what everything should currently use to make decisions.

@nicoboss I've requeued InternVL3_5-GPT-OSS-20B-A4B-Preview, but it fails during mmproj extraction. Probably this model only? You should be able to requeue other InternVLChatModel models to test. I've queued InternVL3_5-241B-A28B

ValueError: Can not map tensor 'language_model.model.layers.0.input_layernorm.weight'

Update: InternVL3_5-241B-A28B works, so it is either just that model, or incomplete support in llama.cpp

Also, for rich1 parameters, I assume / is a rotating disk (i.e. very slow), but I can use, say, 4TB? (and a few more if ever we want to do imatrix gens)? and should i try to use all the cpus i currently have, which would probably need 3-4 quant jobs.

good morning :)

Good evening! :D

Nothing is commented out though, and now we have two network configs for eth0. So what really was done? I don't think this is a good config going forward, if two config mechanisms race each other.

Exactly what is happening and depending on which won the race your container had no internet.

/etc/network/interfaces requires the old ifupdown package to configure a static ip (and reconfigures lo?), while /etc/systemd/network/eth0.network uses dhcp.
both are running and trying to configure the interface. What is the preferred mechanism, static IP or dhcp? And does proxmox somnehow muck with networkd? Because if I remember it correctly, it did so on nico1, where we had to switch to ifupdown because of proxmox.

Please use 192.168.2.106 as static internal IP as that's what we set in Proxmox web interface.

Can I assume that rich1 has a fixed external IP and port 7103 goes through to rich1? If not, we can't use wireguard, as wireguard does not support dynamic ips as endpoints.

No, it has not. It is way worse. Unlike nico1/nico2 that have one that only changes if I reboot the router his ISP performs a forceful external IP change every few days.

So indeed, something (well, proxmox I guess) patches around in /etc on reboot. That also explains why @RichardErkhov commented it out with no effect, it just comes back on reboot.

The `/etc/network/interfaces networking configuration is one of the files managed by Proxmox. You don't have to use it but just keep in mind that if you don't the IP set in the web interface won't have any impact but it’s not really something we care about but is something to keep in mind.

I'll switch to ifupdown and rbeoot once more, since proxmox seems to require this speciifc debian setup.

That is for sure the preferred setup as then everything should work as intended. What are we using on nico1?

What does it do if some other linux is installed? That seems... broken.

It is supoosed to automaticaly recognize what network configuration to manage but I'm using /etc/network/interfaces everywhere so I don't know if it actually does so.

Anyway, it came back up with ifupdown, and systemd now keeps its hands off of eth0.

Perfect! Thanks a lot for fixing the network setup.

what can i say, a few reboots made wgllm work, kind of. this smells rotten. I don't think I've changed anything in the config (I regenerated it from scratch, but even if the keys would have changed, itz should have send out data)

Glad it ended up working. Should you still experiance WireGuard VPN issues you can currently use dmesg to debug them as we enabled logging for it.

anyway, I need some fixed ip/external port visibility for wireguard - without it, it cant't work, only one node in the wireguard network can have dynamic ip, and that is currently nico1+nico2 (which have a fixed IP internally)
I am currently trying this: Endpoint=x.x.x.229:7103

Can you use DNS? If so, you could use castle.nico.re for nico1 and nico2 and use the dynamic IP for rich1. Even if you can't use DNS I would take it away from nico1/nico2 as it changing is relatively rare. Sometimes I don't reboot router for half a year. There is unfortunately no easy way to get a fixed IP for rich1. He already tried getting one from ISP 3 month ago.

unrelated to rich1: I've added MMPROJ/InternVLChatModel to /llmjob/share/convert_hf_to_gguf_models.pm, which is what everything should currently use to make decisions.

Awesome. Thanks a lot!

@nicoboss I've requeued InternVL3_5-GPT-OSS-20B-A4B-Preview, but it fails during mmproj extraction. Probably this model only? You should be able to requeue other InternVLChatModel models to test. I've queued InternVL3_5-241B-A28B

It should work but we will see. Currently at run/noquant 97/97

Also, for rich1 parameters, I assume / is a rotating disk (i.e. very slow), but I can use, say, 4TB? (and a few more if ever we want to do imatrix gens)? and should i try to use all the cpus i currently have, which would probably need 3-4 quant jobs.

It's three HDDs in RAID 0 so while slower than NVMe SSDs it is not as slow as you think. It is actually faster than a SATA SSD for my workloads due to having much faster sequential read/write performance. You can reserve 8 TB for scheduler and if you ever need more just ask. That way I can use up to 4 TB for myself. Regarding CPU just make use of all of it and take as much RAM as you need.

Unlike nico1/nico2 that have one that only changes if I reboot the router his ISP performs a forceful external IP change every few days.

Unfortunately, that nominally rules out wireguard... There are alternatives (e.g. my own venerable gvpe), but I'd rather avoid so much set-up/maintenance trouble.

I will think about it, and since we likely only need the tunnel between nico1 and rich2 for imatrix, I can probably do that over ssh for that. If true, it should be relatively easy, since imatrixjob already supports the most convoluted routes (such as using ssh over multiple hops).

Can you use DNS? If so, you could use castle.nico.re for nico1 and nico2 and use the dynamic IP for rich1.

systemd-networkd can use DNS, it just doesn't support dynamic ip addresses. We'd have to reconfigure the tunnel on all endpoints on every ip change (and I don't think systemd-networkd can even do it).

It works for nico12 because a) nico1 and nico2 have fixed ip addresses internally, so can see each other, and b) they aggressively send keepalive pings to every other host every 50s. the other hosts do react to ip changes, but they cannot contact nico1/2 by themselves. That works for one host, because nico1 can ping everybody else (because they have a fixed ip), but it doesn't work for two.

It is supoosed to automaticaly recognize what network configuration to manage but I'm using /etc/network/interfaces everywhere so I don't know if it actually does so.

Well, we already know it doesn't :)

The interesting question is why it worked on rich1 before.

Can you use DNS? If so, you could use castle.nico.re for nico1 and nico2 and use the dynamic IP for rich1.

I don't see how that would help. It would just shift the problem to nico1+nico2, which relatively frequently change their ip address, and we have failures every time it happens.

It is actually faster than a SATA SSD for my workloads due to having much faster sequential read/write performance.

Well, we have operational experience with 4-disk raids in both kaos and marco, and it's dog slow for llm quanting :)=, certainly compared to a sata ssd.

But it's fine! I'll settle for 6TB, plus maybe more if we do imatrix, since I foresee that we will want to do really big models on rich1, and nico1 regularly runs out of space for a few TB. 6TB seems like ample, even for big models (that is about twice nico1), and since you want to use it, you will eventually find better use. We can easily change that, but it seems like a generous config at the moment.

Regarding CPU just make use of all of it

torture that boy, dont be careful about frying the electric box, we already went through that and it's fixed =)

actually, the mmproj extraction for intern.*gpt-oss did work, I must have misread the logs. What failed was the gguf conversion.

and it's dog slow for llm quanting :)=

depends on how you (ab)use it. these are quite good HDD for some reason (nice file amount IO), they also costed like 9$/TB

I assume the new rich1 will have similar network bottlenecks as the old one? I can hardly get 300Mbps at the moment, and it's very uneven. i.e. I will keep hdfprep enabled.

Ok, 300Mbps was a speedtest. huggingface is kind of kilobytes per second. This seems... too bad to be real.

And now I get a solid 1Gbps, after a long time of <10Mbps

New rich1 is 2 Gbit/s download to 1 Gbit/s upload (but I think limited to 1 Gbit/s download for some reason). While you where doing speedtest I was full speed downloading and reading a model to/from the same HDD you use which is probably why you got such strange results.

depends on how you (ab)use it. these are quite good HDD for some reason (nice file amount IO), they also costed like 9$/TB

Well, you know what we do, convert_hf_to_gguf.py, which has abysmal I/O patterns, for no technical reason.

New rich1 is 2 Gbit/s download to 1 Gbit/s upload. While you where doing speedtest I was full speed downloading and reading a model to/from the same HDD you use which is probably why you got such strange results.

That makes sense. In any case, don't think we can do 4 jobs then :-)

That makes sense. In any case, don't think we can do 4 jobs then :-)

We probably should just try which configuration is the fastest. Maybe 3 jobs as we have 3 HDDs in BTRFS RAID 0.

Hmm, ideally, I might want to route rich1<->nico1 via kaos somehow, I really don't feel well using nico's public dns, as it seems to be failing every other time I need it (which is not often). I don't see how to do that easily, for permission reasons, though. I will try to give it some more thought.

Hmm, ideally, I might want to route rich1<->nico1 via kaos somehow, I really don't feel well using nico's public dns, as it seems to be failing every other time I need it (which is not often). I don't see how to do that easily, for permission reasons, though. I will try to give it some more thought.

Don't worry about nico1/nico2 DNS being outdated. I want to setup DDNS on OpenWrt router anyways. I had this on my ToDo list for a while but now that it is actually used for something important, I can set it up tomorrow.

[178682.321801] Memory cgroup out of memory: Killed process 4098889 (hf) total-vm:8642940kB, anon-rss:2860144kB, file-rss:14408kB, shmem-rss:0kB, UID:0 pgtables:6924kB oom_score_adj:0

downloads get oom-killed now. and it seems hf download used 32GB of memory.

why????

any idea what could be happening here? Does hf download download every file to memory first? Holy shit, who writes such software. Or does anybody have an idea of what could be happening?

any idea what could be happening here? Does hf download download every file to memory first? Holy shit, who writes such software. Or does anybody have an idea of what could be happening?

What version of huggingface-cli are you using? Did you finnaly upgrade to the latest one and are now using XET for downloads?

Let's go back to huggingfface-cli (I recently swichted to hf). If that also fails, I'll try to limit the number of concurrent downloads, because the failing model has no file larger than 4GB. We currently use the default, whatever that is.

What version of huggingface-cli are you using? Did you finnaly upgrade to the latest one and are now using XET for downloads?

I always use the latest one when I update (that is, yesterday), and I've always used XET for downloads, as long as that is the default, I use this config:

export HF_HUB_DISABLE_TELEMETRY=1
export HF_HUB_CACHE="$repo/cache/hub"
export HF_XET_CACHE="$repo/cache/xet"
export HF_XET_CHUNK_CACHE_SIZE_BYTES=0
export HF_XET_RECONSTRUCT_WRITE_SEQUENTIALLY=1
export HF_HUB_DISABLE_PROGRESS_BARS=1

wow, 200% cpu for a file download. wight now, with hugginface-cli, which I would assume should be the same code as hf download, it's at 3GB, slowly increasing. Maybe we are on to something and it's xet.

I think if that is the case, we need to disable xet for downloads. There doesn't seem to be much benefit to using it (I would assume there is close to zero benefit), but stark drawbacks (cpu, ram).

(UIGEN-FX-30B-08-26 also did get oom-killed during download)

I was watching in top, and the download was hovering between 2.6GB and 3.7GB RES, and then it was oom-killed.

Hmm, the actual rss seem to add up to ~4GB. I've never been good with oom-killer reports, but I don't see why it is being oom-killed.

Is there maybe an outer and very low memory limit?

[179640.561008] [ pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[179640.567732] [4143523]     0 4143523     2407      522        0      522         0    57344        0             0 bash
[179640.575399] [4143534]     0 4143534     2407      527        0      527         0    57344        0             0 hfd
[179640.583149] [4143536]     0 4143536  2258576  1053124  1049504     3620         0  9732096        0             0 huggingface-cli
[179640.599116] [4143537]     0 4143537     2031        0        0        0         0    61440        0             0 tee

Let's go back to huggingfface-cli (I recently swichted to hf).

I always thought are an alias for the same thing and the new command is just more convenient as it requires less to type.

If that also fails, I'll try to limit the number of concurrent downloads, because the failing model has no file larger than 4GB. We currently use the default, whatever that is.

That could explain it but by default there should already be a limit of 8 concurrent downloads as far I'm aware.

wow, 200% cpu for a file download. wight now, with hugginface-cli, which I would assume should be the same code as hf download, it's at 3GB, slowly increasing. Maybe we are on to something and it's xet.

I mean XET does a lot of work so it using some CPU is expected but it really should never use more than around 20 GB as its entire HDD cache is limited to 16 GB plus some GB for it running. Even ir realistically should never come anywhere close to that as there is a reason it's a HDD cache and not an in-memory cache but I wouldn't at all be surprised if this is related to XET.

think if that is the case, we need to disable xet for downloads. There doesn't seem to be much benefit to using it (I would assume there is close to zero benefit), but stark drawbacks (cpu, ram).

Or just downgrade to a huggingface-cli version that did not yet have this issue.

I was watching in top, and the download was hovering between 2.6GB and 3.7GB RES, and then it was oom-killed.

That is super strange. On what host? On all of them or a specific one?

no, it's definitely the 32GB internal cgroup limit.

That is super strange. On what host? On all of them or a specific one?

Only on rich1. I'll switch back to hf, as huggingface-cli gets killed as well. The versions should be identical everywhere except on nico1, which has some cuda python libraries that the others don't have,

--max-workers

Default is 8, but netstat shows 63 https connections for one download. I don't know how a "worker" maps on http connections, but the config should be identical between all hosts, even on nico1.

With "--max-workers 1" still see ast least 51 https connections. But a lot less cpu (~less than half). And still the same download speed at the moment. RAM usage in top seems about halved.

I always thought are an alias for the same thing and the new command is just more convenient as it requires less to type.

It's definitely different code (difrferent help outputs), but I would assume both commands call into the same library code that we also use for uploads.

Disk full - seems the 13TB were a lie, I only got 0.8TB :)

Ha! tmpfs /tmp mount is back!

I have systemctl mask tmp.mount still on my screen, though (if I scroll back to what I typed yesterday). Something must have unmasked it.

That also explains the OOM, obviously.

Update: well, not obviously. It's clearly not limited by my 32GB cgroup. But hey, I take it.

Anyway, all these unexpected changes (I'm not pointing fingers here, since proxmox patches around as well) that I don't know about make it really really difficult to use rich1.

Sucks, one model job has been lost due to the OOM. I remember there were 4 jobs on rich1, but I can only find 3 in the log. @nicoboss didn't you say you keep some archive of the status? Would you possibly have one that shows the jobs on rich1 when it had 4 jobs?

I decided to transfer imatrix from rich1 to nico1 via ssh via kaos, that seems to me to be the most stable solution.

Sucks, one model job has been lost due to the OOM. I remember there were 4 jobs on rich1, but I can only find 3 in the log. @nicoboss didn't you say you keep some archive of the status? Would you possibly have one that shows the jobs on rich1 when it had 4 jobs?

Unlike nico1/nico2 we setup rich1 is a priviledged container so you can just dmeg grep to see what jobs got OOM killed and requeue them:

rich1 ~# dmesg | grep oom-kill
[178682.287787] hf-xet-26 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[178682.320910] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0-1,oom_memcg=/lxc/106/ns/system.slice/llmjob-wrap-GLM-Steam-106B-A12B-v1-hfd-1477.scope,task_memcg=/lxc/106/ns/system.slice/llmjob-wrap-GLM-Steam-106B-A12B-v1-hfd-1477.scope,task=hf,pid=4098889,uid=0
[178991.975295] hf invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[178992.016437] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0-1,oom_memcg=/lxc/106/ns/system.slice/llmjob-wrap-UIGEN-FX-30B-08-26-hfd-1834.scope,task_memcg=/lxc/106/ns/system.slice/llmjob-wrap-UIGEN-FX-30B-08-26-hfd-1834.scope,task=hf,pid=4120077,uid=0
[179639.862561] hf-xet-19 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[179640.606725] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0-1,oom_memcg=/lxc/106/ns/system.slice/llmjob-wrap-GLM-Steam-106B-A12B-v1-hfd-2136.scope,task_memcg=/lxc/106/ns/system.slice/llmjob-wrap-GLM-Steam-106B-A12B-v1-hfd-2136.scope,task=huggingface-cli,pid=4143536,uid=0

So sould be thouse two bouth of which you already queued:

  • GLM-Steam-106B-A12B-v1
  • UIGEN-FX-30B-08-26

I unfortinately don't have a status page capture of this exact moment.

these are quite good HDD for some reason

If these are the DC HC 550 that the kernel lists them as (which they might well not be) then these are bog-average. I have some of those. But then, mostly these are at physical limits, so I don't think it rotating disk can really differentiate itself much (seagate 2x's can, although I am not sure they are really usefrul, and yes, you can have snails that rotate slower, but most drives are pretty much the same speed these days, at the same capacity).

Anyway, it's at 300MB/s throughput when copying, which is about right, and puts it almost at the speed of a sata ssd disk (one that doesn't crap out at 80MBps during stress :)

In practise that means I can hardly keep the cpu busy at 5% (that will change a bit once we hit iq-quants, I hope)

Also, I am happy with a single disk, if you ever want to kick me to something slower.

So sould be thouse two bouth of which you already queued:

I found three (one is sitting in the submit queue) - queued jobs will not show up in dmesg, unfortunately, only killed jobs. But I am pretty sure there were four. Sucks, it was a prio 0 job, too.

rich1 seems to work, and has successfully quantized its first quants :)

Regarding speed - it seems the disk is the absolute bottleneck, i.e. it cannot download+convert models faster than it can quantize them, even at (relatively) low cpu usage. I'd be surprised we can run more than one or maybe two quant jobs. I'll make a throughput calculation once GLM has converted. Could be off by a factor of two.

Ok, statistics time. Downloading+conversion of GLM-Steam-106B-A12B-v1 ran at 24MB/s together. Conversion alone had an average I/O speed of 32MB/s (model size divided by time, not actual I/O).

Yup, there is no danger of making rich1's cpus busy from my side. Which is good and bad :)

nice, I can read cpu power usage. That's the most busy I saw it, at around 30%. But even that is likely too fast for the uplink (when nothing is converting)

(one concurrent download reduces that to 50W at 3%).

 143.28
 161.44
 168.54
 177.04
 180.67
 187.03
 189.95
 177.97
 171.56
 157.00
 163.32
 159.66
 153.76
 155.07
 154.82
 139.38
 151.57
 155.29
 149.14
 187.46
 199.05
 196.39
 198.78
 195.84
 201.58
 191.02
 161.70

The practical recommendation would be to gather a bit more experience and then reduce the number of cores associated with quanting to, say, 20 or so. Anything faster would probably saturate the disk, and faster disks would probably saturate upload bandwidth.

@nicoboss can you have a look why ROLEPL-AI-v2-Qwen2.5-72B is failing without exit status on nico2? i assume it's the oom killer. but why does this suddenly affect multiple models when this never was an issue for almost a year. did things in llama.cpp change? or was it not the oom killer. i have no clue.

Also, I am happy with a single disk, if you ever want to kick me to something slower.

No, we want through pain of making that pool for you so you can use that pool =)
It took us like 2 or 3 days, primarily because the ISP decided it is not a good day to have internet lol

@RichardErkhov Our hoster sends us a "we will block your host in 5 hours because of abuse" mails almost every day at the moment. usually, but not always, followed by "ignore this issue, the ticket is closed." Tell me about bad internet days :)

Also interesting - faster than a raid0 might be a smaller raid0 with a separate disk for temporary files, especially for conversion. It helps marco a lot. Even better would be to read the models form one disk,. and write the quants to another, but llmjob can't do that yet. In any case, hard to get, so that's just a theoretical thing. Maybe consulting with me would be not as bad as it apparently sounds? But if both nico and you also use it, it's not exactly, "for me" either, so maybe even harder to change... :)

In other news, he fantastic storage array is currently copying ggufs for imatrix gen to nico1, at an average speed of 2MBps. Yay. Would probably be a bit faster if a bunch of jobs wouldn't have bunched together due to, like, half a dozen bugs in imatrixjob I had to fix tonight, after implementing them in the first place.

In other news, weirdly enough, with 7 quant jobs I imagine I see better cpu usage than with two. I would have expected the I/O fight to be unbearable, but apparently, it's not that bad. Next I will try with 3 jobs again.

maybe an ld_preload wrapper, or a llama.cpp patch that takes a lock while reading a tensor might do wonders. or simply much higher readahead.

definitely, with "only" three quant jobs, I hardly get more than 5% cpu usage (5% of all cores) . with 7 it was easily twice as much. That's not good. But probably simply due to 7 cpu jobs fighting with 7 I/O jobs evens out the odds.

Well, good night. Maybe when I wake up one model will be successfully verified via rsync, potentially, possibly.

great, all the rsyncs from rich1 timed out. time to start over. at 2 mb per second. is it even feasible... sigh.

The reason they were so slow before was because we likely maxed out the 1 Gbit/s upload and obviously against the hundreds of parallel HuggingFace upload connections rsync only gets a very small share of bandwidth. I checked multiple times today and it was maxed out most of the times. Instead of transferring it might be faster to just redownload them from HuggingFace on nico1. That would save the HDD bandwidth and upload bandwidth on rich1 for more important things. Ideally we would obviously do imatrix computation for reasonably sized models directly on rich1.

The reason they all timed out was because of a power outage caused by the new air conditioning installation at the location the server is hosted. I'm quite surprised it has to start over. Shouldn't rsync be able to resume the transfer?

I also noticed many tasks being blocked/frozen/timeofday on rich1. Any idea why this is the case? rich1 is a node supposed to run 24/7 and so have no timeofday limit.

maybe an ld_preload wrapper, or a llama.cpp patch that takes a lock while reading a tensor might do wonders. or simply much higher readahead.

I just did some research and there indeed is /sys/fs/btrfs/<UUID>/bdi/read_ahead_kb for BTRFS to adjust the read-ahead buffer size. We have plenty of RAM so we likely should increase that.

Also interesting - faster than a raid0 might be a smaller raid0 with a separate disk for temporary files, especially for conversion. It helps marco a lot.

Maybe could check if we can arrange for that. How large should that temporary disk be? rich1 has some SATA SSD storage and also a 4th HDD usually IOPS maxed by Richard. We could also do 8 TB RAID 0 over 2 HDDs and 4 TB on the 3rd HDD but a single RAID0 seamd like a better setup.

But if both nico and you also use it, it's not exactly, "for me" either, so maybe even harder to change... :)

It is easy to change as Richard is only using the 4th HDD and the ZFS partition on the other disks. I only use the BTRFS pool as temporary storage for models I download/checkpoints/models waiting to upload while the rootfs of my container is on the SATA SSD. I can always easily free all the space I use on the BTRFS pool so recreating it in an improved way would be no issue. It also would be no issue to backup your container before redoing it and restoring it afterwards exactly the way it is as the backup/restore system on supercomputer uses Proxmox Backup Server which is an established backup solution.

Once we start permanmently maxing out the 1 Gbit/s upload bandwith (which in reality seams more like an 1.2 Gbit/s upload limit) on rich1 there is little reason to further improve ouer setup. Currently it is fully upload bandwith limited and basicaly just starts to pile up as many upload tasks untill HDD slows down. Maybe it would make sense to limit the upload tasks to avoid it piling up too much:

           40   17 si openthoughts3_10k_llama3                     run/static 10/12,Q5_K_S [280/292] (hfu Q3_K_L Q3_K_M Q3_K_S Q4_K_M Q6_K Q8_0 f16)
           40   17 si L3.1-Pneuma-8B-0429                          run/static 8/12,Q3_K_L [235/292] (hfu Q3_K_M Q3_K_S Q6_K Q8_0 f16)
           40   15 si Ice0.107-04.05-RP-ORPO-v2                    run/static 7/12,Q3_K_S [219/291] (hfu Q2_K Q3_K_M Q6_K Q8_0 f16)
         9999    7    SmolLM2-1.7B-magpie-ultra-v1.0-random        blocked/nonempty 362m

The reason they were so slow before was because we likely maxed out the 1 Gbit/s upload and obviously against the hundreds of parallel HuggingFace upload connections rsync only gets a very small share of bandwidth

Interesting how you state this with such certainty when it's 100'% false :) The files already exist on nico1, so all the rsyncs do is read the file on both sides and compare checksums. The limiting factor is the disk bandwidth.

And possibly the network quality, I get 30% packet loss at the moment from kaos to rich1.

Shouldn't rsync be able to resume the transfer?

Again, the bottleneck is disk bandwidth. It has to read the files to know what's inside. And no, rsync has no way to resume a transfer - for that, it would have to save state, but it has no such functionaluty afaik. It can "resume" in the sense of saving network bandwidfth, but it has to start analyzing the files from scratch (or be told to blindly trust it, e.g. with --append, in which case we won't need the rsync as it can onyl lead to corruption).

I will not dispense with the rsync - we already had multiple cases where the redownloaded version on nico1 was different (version differences, races, bugs), and I really don't want another source of errors for imatrix generation.

I also noticed many tasks being blocked/frozen/timeofday on rich1. Any idea why this is the case? rich1 is a node supposed to run 24/7 and so have no timeofday limit.

Good catch! I had copied over large parts of /etc from nico1 to restore the box, and forgot about the cron jobs. Also, on the fifth rsync I forgot about vnstat, so that statistic has been lost. And probably more.

Also, discussing configuration - while I can double cpu usage by running 7+ jobs, this is a very inefficient way of using rich1 - nice for experiment, not nice for being a minor user. Unfortunately, the scaling is not linear, so
reducing the jobs will considerably slow it down.

On the other hand, the disks are extremely slow, pretty much what is expected of the hardware, of course. I have ample experience with that on my other boxes, and it works there because their cpu is also considerably slower. on marco, I have a separate spindle for temp files, which helps a lot, but it's also limited by the disk speed.

So, going forward, I guess at most two jobs is reasonable. In fact, probably only one quant job is reasonable. It could easily be reduced to a much smaller number of cores, which as an advantage gives more freedom for other uses of the box, since the majority of cores are guaranteed to be idle then.

Things will be skewed, for example, if the quantizer works on an a cpu intensive IQ2 quant, then cpu will be the bottleneck (and/or llama.cpp's parallelising architecture), but that's only the case for comparatively little time.

So, we should be realistic - the disk storage is by far the bottleneck, and then comes network I/O.

And that's fine, we should then work on finding a config that uses the maximum of what is available (i.e. disk) and then uses the least amount of other resources to use it.

ah, adding to the previous, we might want to reduce the number of parallel uploads. They are at 80 hfu's on rich1 due to it's historically bad network connection. The new connection is also abysmal, but far better and certainly more even.

The parallel uploads further make the situation worse for the disk, so maybe we should still allow a huge number of quants "uploading" but maybe limit the number of actual active uploads. That should be relatively easy to implement. We have soemthing like that in place, but it simply limits the number of hfu's and waits, what we want is something that doesn't pause quantisations.

The reason why we want such a queue is because I suspect that we will generate quants in bursts (sometimes an hour without uploasds because of model conversion, then quickly quants).

Of course, most of this is currently "for fun", because we will run out of models again, fortunately.

update: also, the number of parallel disk reads by quantising processes. that would help nico1, too, but will require support inside the quantize process.

I think I forgot toi mention another detail, two of the rsync's were actually transferring data, but they had the same speed. It's UIGEN* and Q3-*.

And after UIGEN started at 50+MBps, it's now at 10kBps, probably because it reached the end of the portion that is already on nico1, and the packet loss at the moment effectively prohibits much activity.

And no, the packet loss cannot be explained by hundreds of parallel uploads. If it was fair, we'd need 15000 active uploads to explain the bandwidth, and there still should be <1% packet loss, due to tcp's dynamic adjustments. tcp is not fair, but should be within a reasonably small factor, and should never cause this much packet loss.

right, the packet loss is between frankfurt de-cix and ... some russian carrier,. hoola, that's wild :)

Interesting how you state this with such certainty when it's 100'% false :) The files already exist on nico1, so all the rsyncs do is read the file on both sides and compare checksums. The limiting factor is the disk bandwidth.
I will not dispense with the rsync - we already had multiple cases where the redownloaded version on nico1 was different (version differences, races, bugs), and I really don't want another source of errors for imatrix generation.

So you download from HuggingFace and run convert on both sides and then use rsync to patch the file on nico1 to match the one on the source in case it is different? That makes a lot of sense.

Again, the bottleneck is disk bandwidth. It has to read the files to know what's inside. And no, rsync has no way to resume a transfer - for that, it would have to save state, but it has no such functionaluty afaik. It can "resume" in the sense of saving network bandwidfth, but it has to start analyzing the files from scratch (or be told to blindly trust it, e.g. with --append, in which case we won't need the rsync as it can onyl lead to corruption).

Sorry I though bottlenack is bandwith as it is currently most of the time maxed out as well. Obviously it has to read the entire file on booth sides to compare it.

Good catch! I had copied over large parts of /etc from nico1 to restore the box, and forgot about the cron jobs.

Great. Thanks for resolving this issue.

Also, on the fifth rsync I forgot about vnstat, so that statistic has been lost. And probably more.

We still have a backup of old rich1 so we could upload a renamed copy back to rich1 in case we ever want to look at the historical statistics.

Also, discussing configuration - while I can double cpu usage by running 7+ jobs, this is a very inefficient way of using rich1 - nice for experiment, not nice for being a minor user. Unfortunately, the scaling is not linear, so reducing the jobs will considerably slow it down.

Richard doesn't at all care about efficiency. The CPU is pretty much idle all the time as we mainly care about the A100 GPUs. We also don't have to pay for electricity cost as that's on the company where the server is currently located. Your use of it being inefficient is perfectly fine.

On the other hand, the disks are extremely slow, pretty much what is expected of the hardware, of course. I have ample experience with that on my other boxes, and it works there because their cpu is also considerably slower.

We might be able to install some large SATA SSDs once Richard physically visits supercomputer in a few weeks.

On marco, I have a separate spindle for temp files, which helps a lot, but it's also limited by the disk speed.

As mentioned before we might be able to replicate this setup if you can tell us how large that disk needs to be. We can also give you a some SATA SSD storage.

So, going forward, I guess at most two jobs is reasonable. In fact, probably only one quant job is reasonable. It could easily be reduced to a much smaller number of cores, which as an advantage gives more freedom for other uses of the box, since the majority of cores are guaranteed to be idle then.

2 jobs are way too little. You really should just choose the number of jobs that result in the highest throughput. I'm sure that's the way Richard wants you to use his server.

Things will be skewed, for example, if the quantizer works on an a cpu intensive IQ2 quant, then cpu will be the bottleneck (and/or llama.cpp's parallelising architecture), but that's only the case for comparatively little time.

I yet have to see you maxing out that CPU. Hopefully some parallel I-quant jobs will do.

So, we should be realistic - the disk storage is by far the bottleneck, and then comes network I/O.

When I check nload we are currently pretty much always upload bandwidth bottlenecked so if we keep optimizing at some point we will switch to being upload bandwidth bottlenecked.

And that's fine, we should then work on finding a config that uses the maximum of what is available (i.e. disk) and then uses the least amount of other resources to use it.

I agree. To what should we increase the BTRFS read-ahead size?

ah, adding to the previous, we might want to reduce the number of parallel uploads. They are at 80 hfu's on rich1 due to it's historically bad network connection. The new connection is also abysmal, but far better and certainly more even.

While the internet connection on new rich1 is quite stable please make sure rich1 has a good local scheduler. It sometimes has internet outages. There is a construction side across the street that really hates the fiber optic cables and the equipment from ISP already required some replacements in the past after it broke but it’s now hopefully all resolved. I think local scheduler should already do quite well on rich1 as we will likely often use it for massive models. Maybe we need to make it so the upload limit only applies if there is internet connectivity or we would stop all progress on an internet outage.

The parallel uploads further make the situation worse for the disk, so maybe we should still allow a huge number of quants "uploading" but maybe limit the number of actual active uploads. That should be relatively easy to implement. We have soemthing like that in place, but it simply limits the number of hfu's and waits, what we want is something that doesn't pause quantisations.

I agree. Many concurrent upload connections are not that useful on HDD and in our case many parallel uploads but with less upload connections probably make the most sense.

The reason why we want such a queue is because I suspect that we will generate quants in bursts (sometimes an hour without uploasds because of model conversion, then quickly quants).

I expect such bursts as well but if we do many models in paralell they should be way less likely to occur.

Of course, most of this is currently "for fun", because we will run out of models again, fortunately.

We always have the old MLA models we need to requant for which rich1 would be perfect. But eventually we might even run out of those.

update: also, the number of parallel disk reads by quantising processes. that would help nico1, too, but will require support inside the quantize process.

There are parallel disk reads for quantization? I thought it does read a tensor, quants and repeats.

I think I forgot toi mention another detail, two of the rsync's were actually transferring data, but they had the same speed. It's UIGEN* and Q3-*.
And after UIGEN started at 50+MBps, it's now at 10kBps, probably because it reached the end of the portion that is already on nico1, and the packet loss at the moment effectively prohibits much activity.
And no, the packet loss cannot be explained by hundreds of parallel uploads. If it was fair, we'd need 15000 active uploads to explain the bandwidth, and there still should be <1% packet loss, due to tcp's dynamic adjustments. tcp is not fair, but should be within a reasonably small factor, and should never cause this much packet loss.

So rsync is packet loss and not disk bottlenecked?

right, the packet loss is between frankfurt de-cix and ... some russian carrier,. hoola, that's wild :)

Damn I didn't think it would go over Russia on a connection between Germany and Malaysia. Maybe it would be worth trying to directly synch to nico1 without going over your server.

So you download from HuggingFace and run convert on both sides and then use rsync to patch the file on nico1 to match the one on the source in case it is different? That makes a lot of sense.

Yeah, it's the "hfdprep" state in imatrix, it's used for hosts which are severely bandwidfht-limited. You forgot about it, because we do it for so long (for marco and rich1).

Your use of it being inefficient is perfectly fine.

very good :) concurrent quants still use memory though. what's the situation there?

We might be able to install some large SATA SSDs once Richard physically visits supercomputer in a few weeks.

Just to be sure, I am not complaining, not begging for more. Just trying to get the most efficient use out of everything that is most compatible to our donor(s) :)

It sometimes has internet outages

Problem right now is more spurious dns resolution failures due to packet loss. As for for outages, there is not that much I can do about it, unfortunately, as ihn many cases, when a connection is lost, I can't say why that is - e.g. when i run "ssh rich1 rsync" I don't know what the rsync is doing qwhen ssh fails.

stuff that runs locally can often be retried (currently, that is only the uploads).

Maybe we need to make it so the upload limit only applies if there is internet connectivity or we would stop all progress on an internet outage.

It's ok to stop progress during an outage. If you mean stop forever, I don't see how that would happen - either it fails, or retries. Right now, uploads are retried forever for conditions I feel safe about (not dns failure, but maybe that can be added), mostly as a failsafe, we don't want endless loops, as that relies on somebody noticing that something might be wrong, as has happend e.g. when you nopticed an upload goes on forver because pyhton is stuck again.

When I check nload we are currently pretty much always upload bandwidth bottlenecked so if we keep optimizing at some point we will switch to being upload bandwidth bottlenecked.

That is definitely not due to me then, though. While I bump into ~1.5GBps often, the majority of the time I upload at less then 600MBps, sometimes drastically less.

And ye,s that is what IO mentioned as well, if we get the disk susbsystem faster, we will run into network next. But there is no danger of running into that right now, at least if I was the only user.

I expect such bursts as well but if we do many models in paralell they should be way less likely to occur.

But that is extremely inefficient because they fight for disk.

There are parallel disk reads for quantization? I thought it does read a tensor, quants and repeats.

For a single llamas-quantize process, I think so, yes. But we are always running more than one, and the pattern is that they fight for disk, then quantize, or one quantized, the other reads, or some combination of that.

It would be closer to trivial to have an fcntl lock around the tensor read to serialise disk reads. But(!) it's not trivial to have alimit that isn't ONE parallel read, so that m,eans the asingle quantize read will then compete willall other I/O, which might slow it down. But indeed, that could be compensated (to an extent only) by having many quantize processes.

Even limiting active uploads is nontrivial already.

So rsync is packet loss and not disk bottlenecked?

Those two only, yes, because due to when I implemented it, hfdprep was not in palce yet. The others, and all following ones, should be disk I/O-limited because they will mostly transfer checksums, if all goes well.

Damn I didn't think it would go over Russia on a connection between Germany and Malaysia.

He, I worded it carefully to say russian carrier, not russia. I don't know where it weas routed physically (malaysian teleocm doesn't like to be traced). But yesterday, the route was very different, it wen from helsinki to malaysia telekom and then I lost track. Today iot goes to de-cix first. Fascinating.

The reverse path is also fascinating, it goes through 192., then the public IP, then a series of 10. addreesses, which is... holy shit, which carrier uses home addresses instead of 100.* reserved for that. Maybe it's a big company that routes internally first, that would explain it.

None of that is relevant, just fascinating.

routing via nico might randomly work better, but neither nico nor rich has a fixed ip, again. besides, it tends to use the same routes (right now, nico goes via hurricane electric (as did kaos yesterday), a very nice carrier which provides free high-quality ipv6 tunnels for decades now, and more: https://tunnelbroker.net/ - I'm a certified he ipv6 guru, haha).

But some things are difficult to explain:

0    ? UIGEN-FX-30B-08-26                            run/hfd (from rich1) 93% 222.61kB/s
0    ? Qwen3-42B-A3B-2507-YOYO2-TOTAL-RECALL-Instruct run/hfd (from rich1) 59% 102.69kB/s
0    ? Seed-OSS-36B-Base-Instruct-Karcher-Merge      run/hfd (from rich1) 24% 58.03MB/s
0    ? OpenAi-GPT-oss-36B-BrainStorm20x-uncensored   run/hfd (from rich1) 13% 82.93MB/s

UIGEN actually transfers data (== file is not on destnation), the others not. The frst two (1<MBps) are running for relatively long, the bottom two have been started recently. All go via the tunnel to kaos, then nico1.

The only explanation I have is that tcp gets slower and slower due to uneven packet loss, so long-running tcp connections basically starve and take a long time to get up to speed again when the connection is better. But there are lots of others, involving meddling somewhere.

Oh, and the problem of active uploads is even harder, as of course there is a hashing phase first, which only does disk I/O.

very good :) concurrent quants still use memory though. what's the situation there?

We currently have 256 GiB of RAM in Richard's supercomputer. We dynamically allocated 200 GiB of it to your container. In the vast majority of time, we need very little RAM. In the rare instances we do need more we can just pause rich1 if there isn't enough. Please make use of the 200 GB you have. We considering upgrading to more RAM in the future. The server currently uses 4x64 GB and if I remember correctly Richard’s supercomputer has 32 RAM slots so we technically could upgrade it up to 2 TiB just by adding more RAM sticks.

routing via nico might randomly work better, but neither nico nor rich has a fixed ip, again. besides, it tends to use the same routes (right now, nico goes via hurricane electric (as did kaos yesterday), a very nice carrier which provides free high-quality ipv6 tunnels for decades now, and more: https://tunnelbroker.net/ - I'm a certified he ipv6 guru, haha).

I love IPv6. In the past I sometimes even disabled IPv4 on the router because I feel it is outdated and should no longer exist. Unfortinately it is still mandatory for some websites and serives. Booth nico1 and rich1 have native IPv6 support.

I wonder if ouer ISPs ever rotate IPv6. I will check next time rich1 rotates and if it doesn't change I will reboot router and see if it rotates for me. If not maybe we could just use IPv6 to transfer data between nico1 and rich1.

The only explanation I have is that tcp gets slower and slower due to uneven packet loss, so long-running tcp connections basically starve and take a long time to get up to speed again when the connection is better.

If I remember correctly TCP keeps doubling speed until it discovers significant packet loss (Slow Start). But after that it uses factors like Packet loss events, latency/RTT variation and ACK arrival patterns depending on the congestion control algorithm used. On rich1 that is BBR as we set net.ipv4.tcp_congestion_control = bbr which mainely relies on packet loss. Just in case it is relevant here all of rich1's network related sysctl settings. Feel free to suggest any changes.

net.ipv4.tcp_mem = 4096 87380 67108864
net.ipv4.udp_mem = 4096 87380 33554432
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.core.rmem_default = 33554432
net.core.wmem_default = 33554432
net.ipv4.udp_wmem_min = 16384
net.ipv4.udp_rmem_min = 16384
net.core.wmem_max = 134217728
net.core.rmem_max = 134217728
net.core.busy_poll = 50
net.core.busy_read = 50
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65535
net.core.netdev_budget = 1000
net.core.optmem_max = 65535
net.ipv4.tcp_frto = 0
net.core.somaxconn = 32768
net.core.netdev_max_backlog = 32768
net.core.dev_weight = 64
net.core.default_qdisc = fq
net.ipv4.ip_forward = 1

Oh, and the problem of active uploads is even harder, as of course there is a hashing phase first, which only does disk I/O.

I compleately forgot about that one. We probably want to try to upload relatively quickly after hashing so the file is likely to still be cached in RAM unless we have an insane amount of concurrent uploads.

If not maybe we could just use IPv6 to transfer data between nico1 and rich1.

ipv6 overhead is not entirely free, but yeah, that would essentially give a fixed ip. i don't think it helps much, since normally, rsync transfers little. right now, i have a shamefully hacky socat call in kaos's inetd.conf that simply forwards port 16something to nico1:22. and if kaos fails, everything fails :-)

still, just having the tunnel would be useful. can't use my script to create the tunnel, though, but with four nodes, that's not an issue :)

If I remember correctly TCP keeps doubling speed until it discovers significant packet loss

It could be entirely different, too, for example, rsync has trouble with large files with very few scattered changes, and can scan for a long time, resulting in very low transfer speeds. i.e. if rsync takes a minute to find the next chunk, it might display 1KBps average peed during that time. This is data-dependent, and I have seen cases where whole file copies were faster than rsync's incremental transfers, and these are algorithm-related, and not I/O speed related.

Doesn't look like this is the case here, but it could explain some.

BBR

well, congestion algorithms will not magically fix problems, but bbr is, I think, a very good one.

IPv6

Can't say I hate ipv6, but it's clearly second systems syndrome, and very badly designed (on a human level). And I think the people who have to implement it in networks mostly agree, as ipv6 had horrible adoption rates, far worse than a stupid hack that simply increased network sizes would have endures. the only reason it was adopted is because it had to. The reason for the horrible rates were the stubbornness of the people to implement what was actually needed (dhcp6, nat, private addresses and more). We have all these things now, and ipv6 is entirely usable, but it was a decades-long war. We could have IPv6 a decade earlier, at least, if not for these holdout positions.

I still don't have native ipv6 for my vodafone home connection in germany, because of lack of supporting protocols support that only exist for ipv4 and were deemed not needed, or bad, for ipv6. Well, the pragmatics have eventually won over the ivory tower. But too late.

settings

Defaults should be fine for a normal network, and rich1's connection I would classify as such. Not the old one, though :)

I've implemented a simple slot mechanism for active uploads, which simply wrap the api.upload_folder method, so includes hashing. for some reason it's not active, and maybe it fails. it should be one file per slot in /dev/shm/slotlock-.hfu-active

question: what, exactly, is stored in the local xet cache? if i have a per-upload xet cache, will this essentially erase any benefits? i am surprised it is needed at all.

50% cpu at 1% wait time, but only 50MBps upload. hmm.

update:

90% cpu, 1% wait time, 3 uploads giving 70MBps. but the 3 Q and 2 I-quants write 200MBps of quant data. (also, no conversions running, they use a lot of I/O)

sucks that we apparently need 6 uplod slots to fill 1.5GBps.

question: what, exactly, is stored in the local xet cache? if i have a per-upload xet cache, will this essentially erase any benefits? i am surprised it is needed at all.

The XET shard_cache is used for deduplication and to save upload bandwidth. It caches hashes of chunks you already uploaded plus some it downloads from the content-addressed store for global deduplication. When it then splits the file to upload into chunks and discovers the hash it computed is already in the shard_cache it knows it doesn't have to upload it.

There should be no need for a per-upload XET cache as it is supposed to honor HF_XET_SHARD_CACHE_SIZE_LIMIT and expire after 1 month. Are we sure the latest version of XET has still not fixed this issue? It already matured quite a lot. I downloaded and uploaded quite a lot of repositories on my supercomputer container and my XET cache currently sits at 7.9 GiB chunk cache and 1.9 GiB shard cache using default XET settings.

But back to your question there is technically nothing that speaks against having a per upload shard_cache except it being kind of pointless as then there will be basically no deduplication. Please at least use the same cache for an entire model as then there will be a lot of deduplications as many quants containing equally quantized layers. Please take a look at https://huggingface.co/spaces/xet-team/quantization-dedup - it is very fascinating how much data can be deduplicated in a HuggingFace repository containing GGUF quants thanks to XET.

Using XET the proper way would reduce upload bandwidth by quite a lot. I recommend we reenable XET for uploads as soon as feasible.

I noticed that HuggingFace fixes all issues posted under https://github.com/huggingface/xet-core/issues so if the shard_cache still grows in an out-of-control way we should create an issue there and have them fix it.

For more information, please read https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet

https://huggingface.co/spaces/xet-team/quantization-dedup is really extreamly interesting:

The deduplication savings here take a 191GB repo and cut it down to 97GB, helping to shave a few hours off the upload time.

So by having a per-model XET cache we would alone cut the upload bandwidth in half. That is quite a massive saving of resources and with us having static and imatrix quants instead of just imatrix quants the savings would probably be even greater. And this is even without considering any global deduplication and compression also added by XET. Finetunes of the same model can be globally deduplicated by quite a lot as can be seen under https://huggingface.co/spaces/xet-team/finetune-dedupe and https://huggingface.co/spaces/xet-team/repo-graph.

3am insomnia update: reminder that server is in malaysia with not the best provider. The highest speed netherlands <-> malaysia was 15 MB per second. Their (isp) routing is shit, so while connections to nearby asia is relatively fast, good luck with something else

it being kind of pointless as then there will be basically no deduplication.

That is the part that makes little sense to me. The deduplication should happen regardless. What you say means that if I upload files from different nodes (separate cache) they won't be deduplicated. That is why I wonder what is stored inside.

If it only caches checksums of local files, then it means repeated uploads of the same file will be faster. That would make sense for a cache. I can also get if some server-side checksums are cached for similar reasons. Bu it should be an optional cache, and should not affect deduplication?

If it does, then xet won't be able to deduplicate between repos from different users (because bartwoski shared no cache with us), but that is what they claim they do. Therefore, the cache must not be needed for dedup.

The highest speed netherlands <-> malaysia was 15 MB per second.

It's not bad, really. I had no problem getting 50mb to kaos, while kaos was also downloading, so that was optimal. So it's probably more like the old rich1, where it's a it spotty, but oftentimes you get good speeds (and, to be honest, much more "oftentimes" than with the old rich1 :)

The "problems" were/are clearly disk I/O speed.

So by having a per-model XET cache we would alone cut the upload bandwidth in half.

That is why I am asking, but I have burned so often by xet, I really want a solution that works. And if a per-upload cache means no deduplication, their system is beyond broken. I can't believe that.

But likewise, how would the cache grow to many many gigabytes if they only cache checkums?

Please take a look at https://huggingface.co/spaces/xet-team/quantization-dedup

I know this page, it's not helpful.

https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet

Thanks - but nothing in there supports your claim that a per-upload cache would cause worse or no deduplication. To the contrary, it just is used for resumes and for the actual deduplication of the files being uploaded.

So I think in the medium term, I'll finally look into a per-upload cache, so we can use xet.

Anyway, thanks for pointing me at the sparse documentation. It did answer my question enough so I feel confident implementing it (as a per-upload cache) :)

Internet on new rich1 is more stable, until it's not and you can only ping 127.0.0.1 for 3 days

And it is possible that most of the size increase comes from the download cache, which has close to zero value for us (other than per-download, where we hopefully already use xet).

It's a bit unwieldy, but I'll try a 512MB tmpfs for the cache, like this. Hope that's enough.

         local $ENV{HF_HUB_CACHE} = "/dev/shm/hffs-tmp/cache-hub";
         local $ENV{HF_XET_CACHE} = "/dev/shm/hffs-tmp/cache-xet";
         local $ENV{HF_XET_SHARD_CACHE_SIZE_LIMIT} = 400<<20;

         IPC::Open2::open2 ($r, $w,
            "/usr/bin/unshare", "-m", "--",
            "sh", "-ec", '
               mkdir -m 0 -p /dev/shm/hffs-tmp
               mount -t tmpfs -o size=512m,mode=1777 none /dev/shm/hffs-tmp
               exec /llmjob/share/python/bin/python3 -c "$1"
            ', "--", $code
         );

anyway knows a way how to get xet statistics for an upload?

rm -rf /*

so, if anybody would want to optimize in the future, i think the single most effective hardware optimisation would be temporary storage large enough for one model on a separate spindle (just for temp files during conversion). that also made the biggest difference on marco.

@nicoboss I don't want to harrass you with this, I just assume chances are high these got forgotten: did you have an opinion on the kv-var length limit that keeps us from quantizing long-named models, and can you look at why "(the top errored out model)" on nico2 failed, I assume OOM.

@nicoboss don't think btrfs has specific read-ahead, it would be the njormal read-ahead (afaik, fs-readahead and normal readahead are the same since linux 2.something, you can set them with blockdev --setra on the mounted device, and the higher, the better, ideally it would be tensor-sized, so a few GB, but I am sure multiple readaheads will also fight with each other)

shard cache size datapoint: during an upload of Superthoughts-lite-v2-MOE-Llama3.2 - all static quants, 38GB, the shard cache was 56MB till a few seconds before the end, where it increased to 70MB.

I just found a much better article explaining XET dedublication: https://huggingface.co/blog/from-chunks-to-blocks

But likewise, how would the cache grow to many many gigabytes if they only cache checkums?

It's one checksum per 64 KB of data which is quite a lot if we upload TB of data per day.

Anyway, thanks for pointing me at the sparse documentation. It did answer my question enough so I feel confident implementing it (as a per-upload cache) :)

You think it will dedublicate across multiple quants even if you do it per upload?

That is why I am asking, but I have burned so often by xet, I really want a solution that works. And if a per-upload cache means no deduplication, their system is beyond broken. I can't believe that.

I'm really not sure about that. I try to understand thair documentation but find no answers. Maybe read https://huggingface.co/blog/from-chunks-to-blocks and see if you understand it.

And it is possible that most of the size increase comes from the download cache, which has close to zero value for us (other than per-download, where we hopefully already use xet).

I mean it reduces some download bandwith but dedublication there will be minimal anyways so not really worth it.

It's a bit unwieldy, but I'll try a 512MB tmpfs for the cache, like this. Hope that's enough.

tempfs is perfect for this. Not sure if 512 will be enough but we will see.

anyway knows a way how to get xet statistics for an upload?

What kind of statistics you need?

so, if anybody would want to optimize in the future, i think the single most effective hardware optimisation would be temporary storage large enough for one model on a separate spindle (just for temp files during conversion). that also made the biggest difference on marco.

How large does it need to be?

@nicoboss I don't want to harrass you with this, I just assume chances are high these got forgotten: did you have an opinion on the kv-var length limit that keeps us from quantizing long-named models

I'm a bit afraid that just increasing that limit could break compatibility with any of the many libraries and applications using GGUFs but sure we could patch llama.cpp to no longer have this limit.

and can you look at why "(the top errored out model)" on nico2 failed, I assume OOM.

I will ttomorrow once it is ON again.

@nicoboss don't think btrfs has specific read-ahead, it would be the njormal read-ahead (afaik, fs-readahead and normal readahead are the same since linux 2.something, you can set them with blockdev --setra on the mounted device, and the higher, the better, ideally it would be tensor-sized, so a few GB, but I am sure multiple readaheads will also fight with each other)

I think there is BTRFS pool specific look ahead. For example on nico1 for your spool we have 4 MiB which I assume is the default as I never changed it. How much do you want to set on rich1?

root@StormPeak:~# cat /sys/fs/btrfs/15ba0231-4a65-44c2-a84d-1b8040b9e6d3/bdi/read_ahead_kb
4096

You think it will dedublicate across multiple quants even if you do it per upload?

Of course. How else would it work? They do deduplicate globally, that means if they wouldn't, we would have to upload all models of all users on huggingface together in one single upload.

The way this must work is that the client splits into chunks, then asks the server for which chunks already exist, then skips their content. That's how all these systems work (and rsync, too).

What XET does is create some kind of superchunks, to reduce the amount of checksum data even further. Who knows, it might even be a hash tree, I have not looked into it.

Sure, they could do it differently, but then they couldn't deduplicaste betwen repositores or files when we upload manually - if you look, files get converted one by one to XET on the server side if they are not uploaded via xet.

I have not really looked into the code, and their docs are... hardly helpful marketing material. So could be phenomenaly wrong. But from a technical standpoint, the design must be broken if it stores identical chunks twice on the server, and since the server needs to know about the chunks in advance anyway, it would be idiotic to design it so that the client uploads anyway and the server knows in advance it goes to dev null because it already has the data.

What kind of statistics you need?

Would be nice to see how much of the upload was saved, really that's all.

How large does it need to be?

It needs to hold one model. With the current config 700GB, which is good for ~350B typically.

I think there is BTRFS pool specific look ahead. For example on nico1 for your spool we have 4 MiB which I assume is the default as I never changed it. How much do you want to set on rich1?

I don't think there is something like a "btrfs pool" - I think you mean filesystem. The closest to a btrfs pool would be lvm2. The bdi entry you see is just a symlink to /sys/class/bdi, i.e. the backing device(s), which is the generic vfs readahead setting which exists for all devices. You can probably just echo to it, it might be simpler because it's more stable (fs-uuid) than a device name.

In any case, you can't go wrong, all of these readahead settings are one and the same internally.

I would say try 8GB and we'll see how far we get.

Another option (not implemented yet, though, would be a small hack) would be to either mlock small models, or use a tmpfs for small models. I'd prefer tuning though, less work :)

I've re-read that article, and it just doesn't say. We get:

the content-addressed store (CAS) backing all repositories

But that means not much. Of course, literally it means there is ONE cas for ALL repos, but whoever wrote that might not have cared for that level of detail and meant the "cas implementation/code backing all repos" for example.

can you look at why "(the top errored out model)" on nico2 failed, I assume OOM.

It did 32 GiB cgroup OOM. It currently blocks any other model on nico2 so I marked it as override. I will manually convert it to GGUF but first I have to create the setup to do so on nico2 as that currently only exists on nico1. Maybe we should consider raising the cgroup limit if all hosts can handle it so we have less such cases.

[  595.801814] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/111/ns/system.slice/llmjob-wrap-ROLEPL-AI-v2-Qwen2.5-72B-noquant-576.scope,task_memcg=/lxc/111/ns/system.slice/llmjob-wrap-ROLEPL-AI-v2-Qwen2.5-72B-noquant-576.scope,task=pt_main_thread,pid=5646,uid=100000
[  595.801885] Memory cgroup out of memory: Killed process 5646 (pt_main_thread) total-vm:7386280kB, anon-rss:178332kB, file-rss:593588kB, shmem-rss:0kB, UID:100000 pgtables:2300kB oom_score_adj:0

I will manually convert it to GGUF but first I have to create the setup to do so on nico2 as that currently only exists on nico1.

Or I can change the 32GB limit temporarily. strange that it blocked other models, it was starting quant jobs successfully this morning (it did block models becauese it had no status file, which I created manually. If you retry the job and it gets killed, it would explain the blocking, but I odn't know if you did :)

Maybe we should consider raising the cgroup limit if all hosts

Most hosts have 32GB, so it's already too large, kind of. It's more a runaway protection thing. I just find it fishy that we suddenly need so much memory for so many models when this wasn't an issue before. Smells like a change in llama.cpp

Great, switched to XET and suddenly we have lots of hanging downloads, all still with progressbar.

It needs to hold one model. With the current config 700GB, which is good for ~350B typically.

I just talked with Richard and we will provide you a 2 TB HDD. We used it with ZFS but stoped using it due to reliability issues. I recommend not to store any important files on it and keep in mind that there is a possibility of I/O errors. The HDD is working perfectly fine but we are too scared to trust it for any important data making it the perfect candidate for what you need.

I've disdabled XET again :( 9 uploads hanging in 12 hours is too much.

For reference, python is not doing anything, and output looks like this (it's at 100% here because it's my last example, but usually it's less):

Processing Files (1 / 1)                : 100%|██████████| 3.28GB / 3.28GB, 2.15MB/s  
New Data Upload                         : 100%|██████████| 3.06GB / 3.06GB, 2.15MB/s  
  ....107-04.05-RP-ORPO-v2.i1-IQ3_M.gguf: 100%|██████████| 3.28GB / 3.28GB            

strace shows various python threads blocking in epoll_wait, some futexes, and of course the obligatory python busy-waiting task looping in a loop calling nanosleep. Bah.

Incidentally, since this is already a retry (the first try uploaded 3.126G), it does show some resume capability - it's now "only" uploading 3.06G out of the missing 0.15GB. (Yes, there is my statistics, too :)=

As for statistics, it does something. For the single-quant successful GLM IQ4_XS quant upload, we do get savings. Not much, but would definitely be worth it. Pity it's unusably unstable.

Processing Files (2 / 2)                : 100%|██████████| 66.9GB / 66.9GB,  0.00B/s  
New Data Upload                         : 100%|██████████| 62.5GB / 62.5GB,  0.00B/s  
  ...06B-A12B-v1.i1-Q4_K_S.gguf.part2of2: 100%|██████████| 32.5GB / 32.5GB            
  ...06B-A12B-v1.i1-Q4_K_S.gguf.part1of2: 100%|██████████| 34.4GB / 34.4GB            

we will provide you a 2 TB HDD.

Wow, cool :)

I/O errors. The HDD is working perfectly fine but we are too scared to trust it for any important data making it the perfect candidate for what you need.

Ehe, not sure unreliability is what I need, but we can try. In my experience, llama.cpp has working error detection, so we will see how many jobs will fail (it should only affect noquant phases, and unlessm metadata is damaged, it will be immediately retriable).

You can even enable it yourself - once the disk is there, and no noquant jobs are running, you can rm /lljob/tdir; ln -s /new/disk/ /llmjob/tdir and it should take effect immediately. That should take a full write-read cycle off the main disk for every model, and the writes are especially "damaging" because they starve reads.

(if something goes wrong, on rich1, the default for tdir is to link to wdir/)

Some more XET hang reports, for reference. They seem to mostly hang at 100%, but the byte counter is often less (e.g. 100%|██████████| 1.54GB / 1.55GB, 6.34MB/s). The python process seems to mostly idle, the process has a few tcp sockets in closed state, likely due to the LD_PRELOAD wrapper we use to enable keepalive. Looks like whatever http library huggingface uses can't cope with the other side closing the connection. Or else.

Sucks. The only thing I can imagine now is to monitor stdout/stderr of the upload process and kill it when it has a period of inactivity. That's not trivial and would have to wait :( II had no idea it was so hard to write a http post in python.

update: experimenting on leia, on other hosts it is off

ROLEPL-AI-v2-Qwen2.5-72B needed 96GiB RAM, according to cgtop. why???? and it slowly increased while loading the model, so for some reaosn it loaded a considerable amount of tensor data (presumably) into memory, despite --use-temp-file.

but it's indeed very weird that this model does that - pn disk, it's not larger than other qwen2.5 72b derivatives.

also, looked it up, quantize no longer itself limits jobs, llmjob scheduler does. and the limit is 32G everywhere except on nico1, where it is 430G. I will bump up the limit on nico2 to 64G (which did not help, because I tried that first :), but that is pretty meaningless, because the scheduler has no clue in advance of whether a model fits.

interestingly, on rich1, the 6 quant jobs together supposedly consume 180GiB already, and all but one are pretty much at the 32GiB limit. I wonder, is it mmapping and simply using all available memory? I'm not seeing that on "my" nodes, but maybe memory pressure is simply higher.

that must be it, because the ROLEPL-AI quant and the L3.3-Joubutsu2000 noquant jobs are both at 96GiB on nico2. I wonder if both would also fail with the normal 32GiB limit.

could it be some peculiarity of nico2's config (could also be nico1, but there the limti is so high it shouldn't matter), so it calls the oom killer despite not strictly being necessary, while our other nodes, linux just throws away pagecache?

I experimented a bit on other nodes where I can "echo 3 > /proc/sys/vm/drop_caches". It seems the page cache is accounted in the cgroup memory size (we suspected that in the past as well).

That means we don't know if the job really needed 96GiB - all we know is that it crashed with 64GiB, and I think at 22 of 31 layers, so...

as for XET, apparently, the xet library prints an error to stdout (not even stderr), and then just idles, instead of returning an error or throwing an exception. i can probably work with that, somehow.

and for somehting completely different, --outtype source shows it's ugly (but expected) aspect:

40+ 282 Forbidden-Fruit-L3.3-70b-0.2a run/imatrix (GPU-2d) 4/80 38.47s/c 185.7/201.3m(216.4-217.5) [268/314] 10.0971

some big source models are now in f32, and get imatrix'ed as such. it's exactly what we want, except, usually not :) but it's better to err on the f32 side than on the f16 one. also nice to see that llama-imatrix has no issues with f32 weights.

about the 2tb drive: it will be working fine with no errors until it completely freezes and until zfs clear it will not work. nico asked to make it btrfs, so I assume the behavior will be different, and I assume it will not freeze and rather just give you an io error. so there will not be missing bytes, just it either works or fails lol. no data lost. I am a bit busy, but hopefully you will get it today in the evening or tomorrow, depending on how it goes

@RichardErkhov : not sure what freezes means, it will just take very very long? then you can adjust the I/O timeout in /sys/block/DEV/device/timeout, the default is probabyl 2 minutes.

therte is another timeout at the same place, called eh_timeout, which is used for exception handling, and might also have to be increased (it's usually 10 seconds).

if the problem is that the disk does long error handling (what does smartctl -x say?), then setting those to, say, 300 and 180 will probably "fix" it.

smartctl can tell you if the numbe rof reallocated blocks goes up, and if it's a seagate, you can get a lot more info using the farm log (smartctl -t farm, if new enough). but it's probably wd :)

and yes, if its a I/O error (due ti timeout or else), btrfs will simply hand the error through for data reads. the file can then simply be deleted. it will silently ignore write errors (and cause checksum failures later), and the default of metadata profile dup should usually help with that, but if the disk times out multiple requests, it will cause filesystem corruption.

(although the block layer usually retries errors a few times as well)

in my usage, it will siomply have unlinked temporary files on it, so if it fails, the job can be restarted

@mradermacher There currently is an ongoing outage of your system. All workers are running on local sceduler which so far is doing a great job.
All workers except kaos show as offline on the status page:

worker back unreachable, skipping.
worker nico2 unreachable, skipping.
worker nico1 unreachable, skipping.
worker marco unreachable, skipping.
worker leia unreachable, skipping.
worker rich1 unreachable, skipping.
worker rain unreachable, skipping.

All imatrix jobs stoped in the middle of computing without any error in the log:

-2000   30 InternVL3_5-14B                               error/255 (GPU-2d) 28/40 3.94s/c 177.8/20.9m(281.4-269.2) [210/318] 40.3656
-2000   66 InternVL3_5-38B                               error/255 (GPU-18) 20/64 5.89s/c 170.4/31.2m(4826.6-3869.8) [14/318] 5.1631

InternVL3_5-14B compleated at 16:11 CEST but status page has no idea about this
InternVL3_5-38B stoped computation at 16:24 CEST

I can't queue any models:

nico1 ~# llmc add -2000 si https://huggingface.co/aifeifei798/QiMing-v1.0-14B
Connection timed out at /llmjob/share/llmjob.pm line 463.

Yes, our ISP has blocked kaos. It's an ongoing problem where Hetzner wrongly claims we do network scans. I can elaborate if anybody is interested, but Hetzner is not very helpful in fixing their shitty infrastructure management.

We have to wait till they unblock it.

All workers except kaos show as offline on the status page:

Ah rght, their competency is so great that they didn't block IPv6 (again), this is why you cna reach the webserverr.

Again, unblocked without any comment or any attempt to investigate the issue. Tonight, I will go through the ripe database and edtract all iraqi networks and manually bhlock them.

should all be fixed, until it happens again.

the imatrix jobs (once kaos has network) should be restartable via llmc as soon as they have a numerical error status. wouldn't have helped you of course, since kaos didn't have (ipv4) internet,

reminds me to somehow remove the remaining traces of nfs usage, because that is why jobs on leia/marco etc. failed after each quant. (when writing the log on kaos).

well, xet first.

@nicoboss regarding xet, I have donwlaoded a 20GB file from our testrepo, and uploaded it into a new repository:

Processing Files (0 / 1)                :  98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌  | 20.0GB / 20.4GB,  178MB/s  
New Data Upload                         : |                                                                                                                                          |  0.00B /  0.00B            

it did read the file twice, though, not great. Once before it started and printed anything, and then it went through a second time for the top progressbar. I can't imagine why. There was essentially no upload at all.

However, deduplication works between uploads and between repos. All is well. If we can get it to work.

rich1 now also has xet uploads enabled again, for testing

great, now hetzner told us that we should look for another provider if we don't like it

@mradermacher currently working on rich1 hard drive and rich1 got accidentally rebooted, very sorry for that, I hope nothing broke ...

@mradermacher We yesterday added the 2 TB temporary disk and softlinked /llmjob/tdir to /twotb_pool/. Can you confirm that your system is actually using this disk because every time I checked so far it was empty.

@mradermacher currently working on rich1 hard drive and rich1 got accidentally rebooted, very sorry for that, I hope nothing broke ...

Nothing broke, I did have to manually upload two model quants that were interrupted, but everything else apparently restarted on its own.

Can you confirm that your system is actually using this disk because every time I checked so far it was empty.

Did you check with df or ls? ls will practically never show anything, but df should. I'll do a more thorough check soon.

In any case, all I do is set TMPDIR, which works on other nodes, and should no doubt work on rich1. Otherwise it's a bit hard top test other than running lsof while it's converting. But I do notice that rich1 doesn't have these long stretches of low activity during a noquant job (possibly imagined, bit at least I suspected you already added the disk before I looked here).

good timing, yes, it seems to use it:

pt_main_t 2527693 2531710 python3              root   4u      REG              0,131   1374793728      292 /twotb_pool/tmp867av0lm (deleted)
pt_main_t 2527693 2531710 python3              root   5u      REG              0,131   1374793728      292 /twotb_pool/tmp867av0lm (deleted)
pt_main_t 2527693 2531711 python3              root   4u      REG              0,131   1374793728      292 /twotb_pool/tmp867av0lm (deleted)
pt_main_t 2527693 2531711 python3              root   5u      REG              0,131   1374793728      292 /twotb_pool/tmp867av0lm (deleted)
pt_main_t 2527693 2531712 python3              root   4u      REG              0,131   1374793728      292 /twotb_pool/tmp867av0lm (deleted)
pt_main_t 2527693 2531712 python3              root   5u      REG              0,131   1374793728      292 /twotb_pool/tmp867av0lm (deleted)

And it is converting, with 6 in-use upload slots, at >50% cpu usage, instead of, like 2-5% cpu as usual.

You seem to be maxing out the gigabit upload as well, tried to use google cloud yesterday and it showed 24 hour eta for 1gb of file lol

nice :) nico should have no trouble adding some traffic shaping if it becomes an issue

I don't know why it seems so effective, either - it only saves one model write and one read, and quants are larger (all of them together) than the source model. i suspect it's somehow seeky, while quantize is strictly sequential.

i did the same split on marco, which helped a lot. i was even wondering if it helps to let some of the slower nodes (rain, back) to convert over nfs using the other node as /tmp, but haven't tried that yet.

@RichardErkhov Since we added that temporary HDD we indeed seam to max out the 1 Gbit/s upload bandwidth most of the time. If you ever need the bandwidth for yourself just rate limit the mradermacher container by editing the network interface of it inside the Proxmox web interface. Keep in mind that this will rate limit booth download and upload traffic to the specified MB/s (not Mbit/s) amount. I would be amazing if we could enable XET again as that would allow us to upload quant while using less upload bandwidth.

@RichardErkhov or it might have been internet weather. For a while now, uploads are very slow, not for lack of files to upload, nor for lack of disk bandwidth. It seems new rich1 has a similar cadence as the old one, with good times where quants bunch up, and bad times. A simple rate limit might not be very effective in that case, and neither would be traffic shaping, when the problem is outside of rich1 itself.

@nicoboss Pretty sure I already mentioned that XET uploads are enabled on rich1 and leia. On rich1 for ~3 days and on leia for ~4 days. And since it works out mostly fine so far, soon everywhere. next feature on the list is imatrix-rpc parametrisation via llmc. but my brain budget is mostly used up at the moment due to rich1 and the problems with kaos. In any case, rile of thumb, if your provider shits on you, block iraq.

@mradermacher Please update to latest llama.cpp of ouer fork if you have time. THis is for Gemma3TextModel support and to fix some Hermes chat template tool calling issues I prefer on having fixed before doing the massive 405B Hermes models.

A simple rate limit might not be very effective in that case, and neither would be traffic shaping, when the problem is outside of rich1 itself.

If the goal just is to allow Richard to use the internet when he needs it for himself it should do the job. He always wants to max out everything so he will for sure remove it again once he is done using the internet.

Pretty sure I already mentioned that XET uploads are enabled on rich1 and leia. On rich1 for ~3 days and on leia for ~4 days.

I was aware of you enabling it for testing also saw it in the upload logs where you suddenly started showing how much data was actually uploaded vs. how much data would otherwise be uploaded but now that upload logs returned to their normal form I though the test is over. It's also quite incredible that we still reach bandwidth limit on rich1 despite XET.

And since it works out mostly fine so far, soon everywhere.

I'm definitely looking forward to have it enabled on nico1/nico2. My ISP would love us for doing so. He honestly deserves it as I have never received any complaints from him in the 20 years, I'm his customer and never had my internet blocked. Even after using over 1 PB of upload bandwidth earlier this year.

In any case, rile of thumb, if your provider shits on you, block iraq.

I find it very problematic that some fully automated erroneous system can lead to customers loosing internet access. I would at least expect a company to perform a manual review before taking actions against their own customers. Thanks for warning me about Hetzner. I will for sure never rent any of their servers and not recommend them to any colleagues. The reason they only block IPv4 is likely because that's the only addresses they care about being blacklisted as they probably have a near infinite amount of IPv6 addresses to burn. No idea how blocking Iraq should help. It's not even a country I usually geoblock as I never noticed any significant amount of malicious traffic coming from Iraq.

next feature on the list is imatrix-rpc parametrisation via llmc. but my brain budget is mostly used up at the moment due to rich1 and the problems with kaos.

No hurry but be prepared for a lot of upcoming imatrix RPC jobs. We have many massive models stuck inside the queue and I'm currently preparing a batch of them.

If the goal just is to allow Richard to use the internet when he needs it for himself it should do the job.

I don't see how. For example, if he tries to reach, say, aws, and me, too, and there is a bottleneck of 10MBps then thottling me to 10MBps will not really do the job. Worse, if he tries to reach something else, and that is bottlenecked, then even shutting me off won't do the job. I made these remarks in the light of rich1's provider supposedly having bad "overseas" connectivity, as the malaysians would say :)

now that upload logs returned to their normal form I though the test is over.

That is not good, and unfortuntely, XET was indeed disabled everywhere due to a logic error in my brain when trying to re-enable it everywhere. Yay :) I've re-enabled it everywhere now. For sure. This time. Yes.

I find it very problematic that some fully automated erroneous system can lead to customers loosing internet access.

It is not fully automated. So from what I understand, hetzner's outgoing routers have a default route to some host where they run a tool that creates an alarm if too many packets originate from one host The idea is that only invalid/non-organic traffic goes to unroutable networks. They then generate an abuse report, giving us a few hours to "fix" it and submit a statement that we "fixed" it. If nothing is submitted, a human wilolo review and then potentially block.

That's not an entirely unreasonable thing to do. The problem is that a) hetzner attributes all traffic to nonroutable destinations to "network/port scans" and b) this is not negotiatable.

In our case, kaos is a member of a distributed hash table with 44 million nodes that know about each other. There are all kinds of invalid addresses - some obvious (e.g. bogons) and some not so obvious. For example, there is a node in IBM's 9.0/8 network that knows about other nodes in the 9/0 network, but only a tiny portion of that network is exposed to the public internet (via routing), so contatcing these nodes looks like probing unroutable ip space.

We tried to work with hetzner on these issues, but I suspect they just think we are doing evil script kiddie things, and/or their checker was writtenm by the boss and is sacrosanct, in any case, no help was forthcoming. Or, really, internet just means https for hetzner, everything else is a network scan apparently.

I think for us filtering bogons is entirely reasonable, which is what we implemented first. To catch cases like IBM, we are now fetching the routing table from one of RIPE's RIS collectors every 6 hours and block all unroutable traffic (1.2 million routes).

That did not reduce the abuse messages, but their contents have changed - obviously, we no longer send traffic to the addresses we block, so I had a closer look at the lists they send us. Interestingly enough, last week, most of the abuse cases were closed by hetzner themselves as "false alarm", but obviously that depends on the person investigating, so we have been blocked on friday (I think) once more. So after having a good look at all the provided address list excerpts, it dawned on my - once a day, usually at 5am, hetzner loses a route to iraq, and since iraq is pretty centrally connected (everything has to go through the state surveillance infrastructure), that cuts off a LOT of clients. https://stat.ripe.net/resource/5.1.106.220#tab=routing&overview_routing-status.resource=5.1.106.220&overview_routing-status.source=overview&overview_routing-status.min_peers_seeing=0 shows these nicely, although they have also reduced since then.

I have tried to reason with their technical department, but they just don't care, this is one of their replies (in full, without hello/footer) after me sending some example address that is routable most of the day and triggered the abuse:

wie bereits gesagt, Ihr Server wird immer wieder in dieses Problem laufen solange diese Scans weitergehen. Ein Umzug dieses Service zu einem anderen Provider ist die nachhaltigste Lösung.

Hetzner unfortnately also has us at our balls, marco already stopped reading our email correspondence out of fear he might read they will just cancel us. That would be a major disaster - moving services is a lot of work, and renting the same infrastructure would easily double the price or more.

In any case, ever since blocking iraq, the abuse reports are absent. Marco suggested getting another server somewhere else for this, but my honor does not quite allow me to rent another full server for this. So we decided to get a 1€ v-server at strato, and I will probably switch the traffic to it. Probably strato will have a similar system, or maybe not, but if they kick us out, it's not such a biggy.

Oy, that got long.

No hurry but be prepared for a lot of upcoming imatrix RPC jobs.

That is why I am trying to get this done before, preferably :( It's just a matter of writing it, documenting it, and testing it. I.e. work and time. And I'm already behind queing models (about 7000 out of 34000 are left btw.).

llama

Wriiting this took so long, the llama update is through.

Guess the disks rears its ugly head:

    return eager.tofile(*args, **kwargs)
OSError: 469762048 requested and 0 written
job finished, status 1

So it's not that data is lost, it's that the disk goes away, or simply takes too long (as you actually reported, too):

[Fri Sep  5 16:07:18 2025] sd 6:0:5:0: attempting task abort!scmd(0x00000000ee57cbcb), outstanding for 30354 ms & timeout 30000 ms

Unfortunately, this will cause all following jobs to fail until the disk is unmounted and remouted, which makes sense, as metadata writes are involved. Anyway, I've reverted /llmjob/tdir so rich1 can continue.

What is a bit weird is that the kernel apparently did not retry - but it looks like it's attached to an lsi hba, and its driver does not retry, or at least, not like the ata drivers.

Anyway, like this it is not workable - even with ext4 and "continue after errors", I think it would just corrupt metadata more and more over time.

Maybe setting timeouts to much higher than 30s might help, and the disk recovers, if indeed it's "just" the disk retrying internally. But if it's a different problem, and the disk simply forgot the outstanding requests, they will just continue to timeout.

@mradermacher It seems impossible to upload any Kimi-K2 quants with the current per-upload shared cache limitation. Please increase it or larger models simply can't be uploaded. The failure to upload Kimi-K2 caused nico1 to run out of storage due to so many Kimi-K2 quants accumulating:

nico1    nice size (static/imatrix) -- jobs 8/8-40 maxm 700 free 391 budget -701 uploads 2865 hfd 1029 29c
        -2000 1030 si Kimi-K2-Instruct-0905                        error/47 8/12,Q3_K_L [424/1096] (hfu Q3_K_M Q3_K_S Q6_K Q8_0)
nico1 /tmp/hfu-log# grep -r "MerkleDB Shard error"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q8_0.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q6_K.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_M.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:"RuntimeError('Data processing error: MerkleDB Shard error: File I/O error')"
Kimi-K2-Instruct-0905-GGUF-Kimi-K2-Instruct-0905.Q3_K_S.gguf*.log:RuntimeError('Data processing error: MerkleDB Shard error: File I/O error') at /llmjob/share/bin/llmjob line 3042.

Llama-3.2-8X3B-GATED-MOE-Reasoning-Dark-Champion-Instruct-uncensored-abliterated-18.4B is another model that ran into the hardcoded 127 char kv limit.

The failure to upload Kimi-K2 caused nico1 to run out of storage due to so many Kimi-K2 quants accumulating:

Hmm.. it can eat most free space, but it shouldn't eat all space. The heuristic currently requires 60% of the gguf size to be available in free space, otherwise it should stop quanting.

I guess the gguf was a symlink, and I used stat without -L? I've fixed that so it should follow symlinks.

I've doubled the size of the tmpfs to 1024m. If that isn't enough, you can adjust it temporarily in /llmjob/share/llmjob.pm, look for "tmpfs" and you should see the size=1024m parameter. The file gets overriden whenever I do an update of /llmjob/share, which does only happen automatically when a host gets re-enabled (e.g. nico2 every morning).

(The limit can be increased and will take effect without having to restart the hfu jobs).

As a sidenote, that means XET still ignores the cache size limit :(

In addition the job size for Kimi is also wrong (1030GB). I wonder how that happened. I guess it took it from the repository, and since it never ran a noquant phase, it never updated it to the gguf size(?)

Nope, it needs 2G, apparently. I've updated it.

Wow, 20MB/s, that's a new low for AWS. (I wish it was faster, because I am watching the kimi upload :)

Also, weird things are happening that I don't understand. There is a "/xet-cache" directory (containing "cache-xet"), which is actually used by the upload, and I have no clue why it is created, or used. It only exists on nico1 (but leia had a /xet directory). Even if it ignored all env variables, it should not put directories in /. Even if it thought HOME was "" or /, it should create a .config there, not a xet-cache.

lrwxrwxrwx 1 root root 11 Sep 6 14:00 /dev/shm/hffs-tmp -> /xet-cache/

Thanks a lot for fixing this issue.

Wow, 20MB/s, that's a new low for AWS. (I wish it was faster, because I am watching the kimi upload :)

I'm as well watching it :D

lrwxrwxrwx 1 root root 11 Sep 6 14:00 /dev/shm/hffs-tmp -> /xet-cache/

I tried making it not use the tempfs file system but failed miserably because obviously the mount command also follows symlinks. I also tried specifying HF_HUB_DISABLE_XET=1 booth globally and using .env files but it had no effect as well. I will clean up once the Kimi backlog is uploaded as changing that back while uploads are running seems like a terrible idea.

... ... ...

This is not funny, nico. The absolute disregard for my time you display is staggering. I debugged this issue for quite a while, instead of going to sleep. Because unlike you, I care about not leaving scorched earth behind and informing you of the situation. I told you precisely how this system works, by posting the shell script fragment that sets it up. You didn't even mention you mucked around with it.

And this is after you claimed you'd improve communication and tell me about changes you make.

I'm not annoyed that you broke something, or changed something. I am annoyed because, again, you didn't tell me and just let me debug it needlessly. It's just my time that is wasted, after all, apparently.

So where do we go from here? I don't have confidence in letting you do things in the vm, as long as you don't tell me about them.

If you don't know what you are doing, you can still do them. But you absolutely have to tell me about them.

Anyway, the Q8_0 upload does not use a tmpfs, it simply uses /xet-cache at the moment. After the first scan, hf created 140MB of cache. Then it slowly grew to 860MB, and it's probably at half of what it is going to use in the end.

I'm so sorry. I was trying to fix the issue myself and while still in the middle of trying to find a solution for it got interrupted with an urgent matter. I wrote you immediately after I returned back to my computer. I would have reverted and communicated all the things I tried if it wasn't for me getting interrupted in the middle of it. There are unfortunately quite a lot of unplanned activities I have to do so urgently that they leave me without the opportunity to communicate or clean up whatever I was doing before. In the future I will try to communicate things before doing them to avoided situations like this but this is hard for issues where I have no clue how to fix them. I really shouldn't even have wasted 2 hours trying to fix this myself but it bothered me that it broke and I had no clue that you will be available to fix it yourself so quickly.

Sigh. If you classify it as urgent enough, I'll just have to accept it then.

I am surprised it didn't make progress, though. Even 512MB should have been enough to upload a sizable chunk before it ran into trouble. As in, 1GB was enough for 400GB of uploads. I guess the real issue was the disk full condition (and the progress would not have been great anyways).

Anyway, tmpfs is configured at size=4096m now. That should be way overkill, but the extra gigabytes of memory per upload is a concern (not so much for nico1, but I am concerned about wild-running xet, because it has been exploding so many times already). Moving it to disk is probably not a good idea, though, as I suspect the I/O pattern is not friendly.

Anyway, the Q8_0 upload does not use a tmpfs, it simply uses /xet-cache at the moment. After the first scan, hf created 140MB of cache. Then it slowly grew to 860MB, and it's probably at half of what it is going to use in the end.

So my "fix" worked? I tested my symlink back when I created it and it didn't seam to work but my testing got cut short due to me getting interruped... In any case please revert it back to using the now increased tmpfs. Otherwise I will pause nico1 somewhere this evening and revert it myself.

Sigh. If you classify it as urgent enough, I'll just have to accept it then.

No, you shouldn't just accept it. I definitely need to improve my communication. I really should just inform you before even attempting to fix something myself. That way we also avoid the situation of booth of us trying to fix an issue at the same time.

When we are at good communication I should tell you that I put /tmp/Kimi-K2-Instruct-0905.Q6_K.gguf using zero-copy concatenation (cat Kimi-K2-Instruct-0905.Q6_K.gguf.part01of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part02of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part03of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part04of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part05of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part06of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part07of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part08of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part09of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part10of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part11of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part12of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part13of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part14of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part15of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part16of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part17of18 Kimi-K2-Instruct-0905.Q6_K.gguf.part18of18 > /root/Kimi-K2-Instruct-0905.Q6_K.gguf).

/tmp/Kimi-K2-Instruct-0905.Q6_K.gguf is the GGUF that will later be used for RPC imatrix computation. If you have time, you can already configure the imatrix task to use RPC and this GGUF.

I am surprised it didn't make progress, though. Even 512MB should have been enough to upload a sizable chunk before it ran into trouble. As in, 1GB was enough for 400GB of uploads. I guess the real issue was the disk full condition (and the progress would not have been great anyways).

It did made some process but always crashed at around 50 GB uploaded and restarted the upload. I was watching it this morning.

Anyway, tmpfs is configured at size=4096m now. That should be way overkill, but the extra gigabytes of memory per upload is a concern (not so much for nico1, but I am concerned about wild-running xet, because it has been exploding so many times already).

Thanks a lot. 4 GiB will for sure be enough for any current and future model. While during normal operation RAM is not really a concern it might be during RPC imatrix computation but as far I'm aware 4 GiB is just the limit tempfs can use while in reality it will only use as much RAM as it needs.

Moving it to disk is probably not a good idea, though, as I suspect the I/O pattern is not friendly.

I never intended to permanently move to disk. I just wanted to find a temporary solution to get the Kimi-K2 quants uploaded and free up some storage so nico1 can work on other models while waiting for you to properly fix the issue.

damn we need faster storage. my drives are getting completely abused and cpu is sitting at 5% lol. well I will be there eventually, so ig some ssd is coming. sadly not m.2 unless I manage to get m.2 card for this server

just so you know I am waiting for a terminal to open on one of the hard drives on your pool lmao. ah yes, waiting for 5 minutes already

I just paused nico1, reverted xet-cache to use tmpfs by removing the symlink and resumed nico1. All uploads should now hopefully use tmpfs for per-upload XET cache.

damn we need faster storage. my drives are getting completely abused and cpu is sitting at 5% lol.

@RichardErkhov At least the disks are emptying now instead of filling quickly with generated quants as before :-)

Even though the extra disk was not heavily used, it did make a very significant difference. So rich1 did excel, even with rotating disks...

I still wonder if, with some setting, the mpt driver could be made to retry instead of timing out after 30s. The 30s timeout might even be hardcoded into the driver. I will look around on the only system where I have an LSI controller in IT mode (which I assume is the setup in rich1).

So my "fix" worked? I tested my symlink back when I created it and it didn't seam to work but my testing got cut short due to me getting interruped...

Not sure what you wanted to achieve, it worked in the sense of not causing an error, and otherwise not having en effect, other than making me track down why it wouldn't use the xet cache dir I told it to use.

4 GiB will for sure be enough for any current and future model.

I was running a 60s du on it, btw., and it grew to 2.4GB, and in the last 60s, grew to >3.4GB quickly. So maybe not even 4096m would not be enough. On the other hand, since it seems to grow only once finished, it might not be such an issue. Arguably, we could make per-host limits, and probably should. Also for the job memory limit, although the latter does not correlate well with model size.

Anyway, the disk full did point out two problems, one was not stating the gguf file, the other was that llmjob has no concept of updating the model size if noquant is skipped, because there is no phase between noquant and quant. The only thing going wrong would be the scheduler starting the job when there isn't enough space, but since the gguf was not on the disk, that couldn't happen. quantize always stats the gguf itself, and then uses the 60% minimum disk free requirement, which was a bitch to do in shell.

It did made some process but always crashed at around 50 GB uploaded and restarted the upload. I was watching it this morning.

That is painful.

Moving it to disk is probably not a good idea, though, as I suspect the I/O pattern is not friendly.

I know, and you didn't manage to. Well, you kind did manage to, because I manually unmounted the tmpfs once the job had started to watch it, and it would have ended up in /dev/shm without the symlink. I was only thinking out loud whether we maybe should move it to disk. Even 0.5GB is significant on the other nodes, which is why I want to study the behaviour so critically instead of just blindly yanking up the limit. Also, I am concerned about xet running wild, which is why I absolutely require a low, hard limit :)

Moving it persistently to disk, e.g. into a model-specific directory, like is done with the download, might also skip the hashing phase on a retry. Might be something worth investigating.

I will make an attempt at an imatrix-rpc command today. It would allow you to switch a hasrdcoded "rpc" mode on/off and set a quant (which merely influences the source file name).

@nicoboss ,@RichardErkhov so what if you do:

echo 600 >/sys/block/sdf/device/timeout
echo 40 >/sys/block/sdf/device/eh_timeout
umount /twotb_pool
mount /twotb_pool

Seems worth a try. btrfs should just revert to the last successful transaction, but it will be empty either way.

@nicoboss btw., the job scheduler has no issues using a custom llama.cpp version. if you want to supply your own for special models or experiments, i could add an option.

Seems worth a try. btrfs should just revert to the last successful transaction, but it will be empty either way.

We tried that but mount /twotb_pool fauled with mount: /twotb_pool: wrong fs type, bad option, bad superblock on /dev/sdf, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. and dmesg showed the following error:

[295216.849122] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[295216.899489] sd 6:0:5:0: [sdf] tag#740 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[295216.949091] sd 6:0:5:0: [sdf] tag#740 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[297016.806080] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[297016.807198] sd 6:0:5:0: [sdf] tag#1461 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[297016.807774] sd 6:0:5:0: [sdf] tag#1461 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[298816.513505] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[298816.514522] sd 6:0:5:0: [sdf] tag#1764 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[298816.515127] sd 6:0:5:0: [sdf] tag#1764 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[300616.471020] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[300616.471806] sd 6:0:5:0: [sdf] tag#2108 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[300616.472392] sd 6:0:5:0: [sdf] tag#2108 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[300910.190450] sd 6:0:5:0: Power-on or device reset occurred
[300910.293275] BTRFS error (device sdf state EMA): remounting read-write after error is not allowed
[300944.035935] BTRFS error (device sdf state EMA): remounting read-write after error is not allowed

@nicoboss btw., the job scheduler has no issues using a custom llama.cpp version. if you want to supply your own for special models or experiments, i could add an option.

That would be great. Currently I always manually supply GGUFs for special models which is somewhat time consuming. All pre-quantized models like DeepSeek and Kimi based ones for example currently require the compilade/convert-prequant branch and --outtype=f16 because that branch lacks --outtype=source and even if it has it would wrongly choose f32 because that's how llama.cpp internally stores pre-quantized datatypes. So maybe there also needs to be a way to override the command line arguments used to call convert if that is possible.

[300944.035935] BTRFS error (device sdf state EMA): remounting read-write after error is not allowed

This cannot be the result of a real mount. Either it was mounted already and you tried to mount it a second time as read-write (quite common wiht all the bind mount trickery going on), or you remounted it, or the messages are old and unrelated.

For example, every time we create a new mount namespace it gets a copy of that fs, so if you umount it, all you do is decrease a reference and the fs keeps mounted. If you then mount it again the kernel interprets that as an attempt to change the existing mount into a read-write mount, which btrfs won't allow because it knows the current vfs state does not match the transaction state on disk.

I am not sure this is what happened, though, because /twotb_pool/ is still mounted in my vm on rich1, and if you really umounted it and it failed to mount again, it should still be umounted in my vm. Also, there is a bash running inside, which also would make it impossible to umount. And while running jobs (hfu) get a copy, umounting in in the main namespace might get propagated down into the slave namespaces (depending on how the container is configured, and what mood systemd is in).

So my guess here is that you never actually umounted the fs, just removed a reference to it (similarly how unlink doesn't delete files).

The power-on or device reset is a bit more worrying, that basically means the disk has rebooted (maybe it has power problems or similar), and usually the kernel severs the connection to any users because it could essentially be a different disk, and/or an unknown number of I/O's might have been ignored.

That would be great. Currently I always manually supply GGUFs for special models which is somewhat time consuming.

Right, that would be true to for the convert step as well. Ill put it ony my todo. Switches are a bit harder and require a bit more modifications, but essentialy could just be an env variable that quantize uses.

The real blocker is always that, to keep the scheduler politically correct, it has to do input validation, top protetc it from evil people like ******* who would instantly use it to hack all nodes. Otherwise I would have long exposed env variable overrides for llmc add :)

Not sure when all that materialises, but now I know it's useful.

Didn't get to do much today.

Ah, in my experience, the most common source for unexpected filesystem references is namespaced systemd services. E.g. you mount /abc, then restart journald, and suddenly you can't really umount /abc because journald has a private copy since the restart. Systemd maintainers would say you are doing it wrong, but...

grep twotb_pool /proc/[1-9]*/mounts might give a hint on who is the culprit (but that should be done once twotb is umounted in my vm, so you don't get all processes running inside). Pretty likely you will find that journald has their own copy....

This cannot be the result of a real mount. Either it was mounted already and you tried to mount it a second time as read-write (quite common wiht all the bind mount trickery going on), or you remounted it, or the messages are old and unrelated.

Thanks a lot for the explanation. The error was quite unclear to us. It probably still had a reference because we kept rich1 running while trying to fix it.

Your recommended commands worked once rich1 was off and everything in dmesg looks great. We now turned on rich1 again and reenabled it to be used as temp disk.

[333119.439232] BTRFS: device label twotb_pool devid 1 transid 1526 /dev/sdf (8:80) scanned by mount (2931928)
[333119.459367] BTRFS info (device sdf): first mount of filesystem cf6870f9-e792-436a-a6c7-558646e09eaa
[333119.460443] BTRFS info (device sdf): using crc32c (crc32c-x86) checksum algorithm
[333119.461254] BTRFS info (device sdf): using free-space-tree

Now I am curious what will happen. My prediction will be that it now times out after 600s, no change, though :( But with luck, something more interesting happens :)

I unfolded the hidden messages, and my browser hated m again. Discussion continued in https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/7

Sign up or log in to comment