mradermacher/BabyHercules-4x150M-GGUF · From Ether to Syntax: A Meta-Analytic Exploration of Linguistic Algorithmic Landscapes

mradermacher

Owner May 31

continued....

mradermacher changed discussion status to closed May 31

mradermacher

Owner May 31

•

edited May 31

Here a compleate list of the newly added architectures.

The non-mm-archs are picked up automatically when llama is updated (rather, nothing checks for these archs, other than the script that shows me daily models).

Nice. Will do in caser you forgot any vision/audio architecture.

In case yopu need it, the list/regexc is currently in /llmjob/share/llmjob.pm - search for is_vision

Also, vision is mradermacher code for multi-modal from now on.

Bert based architectures seem to be incredible

I might exclude them from the daily list for that reason, and them being likely not popular with the people who consume ggufs. (and most fail because small models tend to have custom tokenizers).

Nice I just discover an easy way to requeue previously failed archidectures:

Yup, shell-greppable logs for the win.

Update: oh, it's not even the real log file, "just" the llmc why transform of it.

mradermacher

Owner May 31

@RichardErkhov vision models should not be queued to rich1 unless they arte not being detected as such (and then no vision extraction should happen).

The non-vision jobs are limited to 32GB ram, too. No clue what happened. Very troubling.

However, this morning, only besteffort models were queued on rich1. Who knows what nico queued...

RichardErkhov

May 31

well, good to know. usually you take like 4-8gb, but something went wrong today. Peak recorded by proxmox was 24gb (so I assume it was even higher, but due to total OOM, it might not have recorded full number. I added swap on root just in case this happens again so at least other things on server dont die haha

nicoboss

May 31

llmc audit besteffort skips the besteffort models for me.

nicoboss

May 31

Please restart Audio-Reasoner imatrix computation. I killed it earlier today because it ran on CPU. I'm still not sure what makes GPUs occasionally temporary disappear but seams related to them being used on a different container.

mradermacher

Owner Jun 1

llmc audit besteffort skips the besteffort models for me.

Right, arguments were not passed to llmjob audit. Should be fixed now.

mradermacher

Owner Jun 1

@RichardErkhov

Peak recorded by proxmox was 24gb

Well, given that I was officially allowed to use 64GB, 24GB seems absolutely normal. So what is the new limit? 24GB will only allow one quant, and maybe not even that.

627 hidden messages

Expand all

mradermacher

Owner Sep 7

@nicoboss ,@RichardErkhov so what if you do:

echo 600 >/sys/block/sdf/device/timeout
echo 40 >/sys/block/sdf/device/eh_timeout
umount /twotb_pool
mount /twotb_pool

Seems worth a try. btrfs should just revert to the last successful transaction, but it will be empty either way.

mradermacher

Owner Sep 7

@nicoboss btw., the job scheduler has no issues using a custom llama.cpp version. if you want to supply your own for special models or experiments, i could add an option.

nicoboss

Sep 7

Seems worth a try. btrfs should just revert to the last successful transaction, but it will be empty either way.

We tried that but mount /twotb_pool fauled with mount: /twotb_pool: wrong fs type, bad option, bad superblock on /dev/sdf, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. and dmesg showed the following error:

[295216.849122] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[295216.899489] sd 6:0:5:0: [sdf] tag#740 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[295216.949091] sd 6:0:5:0: [sdf] tag#740 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[297016.806080] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[297016.807198] sd 6:0:5:0: [sdf] tag#1461 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[297016.807774] sd 6:0:5:0: [sdf] tag#1461 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[298816.513505] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[298816.514522] sd 6:0:5:0: [sdf] tag#1764 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[298816.515127] sd 6:0:5:0: [sdf] tag#1764 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[300616.471020] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[300616.471806] sd 6:0:5:0: [sdf] tag#2108 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[300616.472392] sd 6:0:5:0: [sdf] tag#2108 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[300910.190450] sd 6:0:5:0: Power-on or device reset occurred
[300910.293275] BTRFS error (device sdf state EMA): remounting read-write after error is not allowed
[300944.035935] BTRFS error (device sdf state EMA): remounting read-write after error is not allowed

@nicoboss btw., the job scheduler has no issues using a custom llama.cpp version. if you want to supply your own for special models or experiments, i could add an option.

That would be great. Currently I always manually supply GGUFs for special models which is somewhat time consuming. All pre-quantized models like DeepSeek and Kimi based ones for example currently require the compilade/convert-prequant branch and --outtype=f16 because that branch lacks --outtype=source and even if it has it would wrongly choose f32 because that's how llama.cpp internally stores pre-quantized datatypes. So maybe there also needs to be a way to override the command line arguments used to call convert if that is possible.

mradermacher

Owner Sep 7

•

edited Sep 7

[300944.035935] BTRFS error (device sdf state EMA): remounting read-write after error is not allowed

This cannot be the result of a real mount. Either it was mounted already and you tried to mount it a second time as read-write (quite common wiht all the bind mount trickery going on), or you remounted it, or the messages are old and unrelated.

For example, every time we create a new mount namespace it gets a copy of that fs, so if you umount it, all you do is decrease a reference and the fs keeps mounted. If you then mount it again the kernel interprets that as an attempt to change the existing mount into a read-write mount, which btrfs won't allow because it knows the current vfs state does not match the transaction state on disk.

I am not sure this is what happened, though, because /twotb_pool/ is still mounted in my vm on rich1, and if you really umounted it and it failed to mount again, it should still be umounted in my vm. Also, there is a bash running inside, which also would make it impossible to umount. And while running jobs (hfu) get a copy, umounting in in the main namespace might get propagated down into the slave namespaces (depending on how the container is configured, and what mood systemd is in).

So my guess here is that you never actually umounted the fs, just removed a reference to it (similarly how unlink doesn't delete files).

The power-on or device reset is a bit more worrying, that basically means the disk has rebooted (maybe it has power problems or similar), and usually the kernel severs the connection to any users because it could essentially be a different disk, and/or an unknown number of I/O's might have been ignored.

That would be great. Currently I always manually supply GGUFs for special models which is somewhat time consuming.

Right, that would be true to for the convert step as well. Ill put it ony my todo. Switches are a bit harder and require a bit more modifications, but essentialy could just be an env variable that quantize uses.

The real blocker is always that, to keep the scheduler politically correct, it has to do input validation, top protetc it from evil people like ******* who would instantly use it to hack all nodes. Otherwise I would have long exposed env variable overrides for llmc add :)

Not sure when all that materialises, but now I know it's useful.

Didn't get to do much today.

mradermacher

Owner Sep 7

•

edited Sep 7

Ah, in my experience, the most common source for unexpected filesystem references is namespaced systemd services. E.g. you mount /abc, then restart journald, and suddenly you can't really umount /abc because journald has a private copy since the restart. Systemd maintainers would say you are doing it wrong, but...

grep twotb_pool /proc/[1-9]*/mounts might give a hint on who is the culprit (but that should be done once twotb is umounted in my vm, so you don't get all processes running inside). Pretty likely you will find that journald has their own copy....

nicoboss

Sep 7

This cannot be the result of a real mount. Either it was mounted already and you tried to mount it a second time as read-write (quite common wiht all the bind mount trickery going on), or you remounted it, or the messages are old and unrelated.

Thanks a lot for the explanation. The error was quite unclear to us. It probably still had a reference because we kept rich1 running while trying to fix it.

Your recommended commands worked once rich1 was off and everything in dmesg looks great. We now turned on rich1 again and reenabled it to be used as temp disk.

[333119.439232] BTRFS: device label twotb_pool devid 1 transid 1526 /dev/sdf (8:80) scanned by mount (2931928)
[333119.459367] BTRFS info (device sdf): first mount of filesystem cf6870f9-e792-436a-a6c7-558646e09eaa
[333119.460443] BTRFS info (device sdf): using crc32c (crc32c-x86) checksum algorithm
[333119.461254] BTRFS info (device sdf): using free-space-tree

mradermacher

Owner Sep 8

Now I am curious what will happen. My prediction will be that it now times out after 600s, no change, though :( But with luck, something more interesting happens :)

mradermacher

Owner Sep 8

I unfolded the hidden messages, and my browser hated m again. Discussion continued in https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/7