convert it to GGUF.

#1
by Lockout - opened

The MTP tensors are skipped in that inference backend.

Tried to, there are complications. Even when copying in the mtp.safetensors from the original GLM-4.6 repo, and modifying the model.safetensors.index.json I got the following error:

INFO:hf-to-gguf:gguf: loading model part 'model-mtp.safetensors'
INFO:hf-to-gguf:blk.92.nextn.eh_proj.weight,          torch.bfloat16 --> BF16, shape = {10240, 5120}
INFO:hf-to-gguf:blk.92.nextn.enorm.weight,            torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.92.nextn.hnorm.weight,            torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.92.attn_norm.weight,              torch.bfloat16 --> F32, shape = {5120}
Traceback (most recent call last):
  File "/home/jarvis/development/llama.cpp/convert_hf_to_gguf.py", line 9544, in <module>
    main()
  File "/home/jarvis/development/llama.cpp/convert_hf_to_gguf.py", line 9538, in main
    model_instance.write()
  File "/home/jarvis/development/llama.cpp/convert_hf_to_gguf.py", line 432, in write
    self.prepare_tensors()
  File "/home/jarvis/development/llama.cpp/convert_hf_to_gguf.py", line 7332, in prepare_tensors
    super().prepare_tensors()
  File "/home/jarvis/development/llama.cpp/convert_hf_to_gguf.py", line 303, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jarvis/development/llama.cpp/convert_hf_to_gguf.py", line 7311, in modify_tensors
    datas.append(self._experts[bid][ename])
                 ~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'model.layers.92.mlp.experts.7.down_proj.weight'

So it will either require more work, or some custom shenanigans with lcpp to get it to HF to GGUF correctly. It's something I'm still looking into as time permits

@ddh0 is also helping explore this.

Isn't this multi-token prediction? It's not even supported. Can you not add dummy keys into the HF model?

Agreed, that was my thought too. That error was after adding the following to the HF model.safetensors.index.json:

    "model.layers.92.eh_proj.weight": "mtp.safetensors",
    "model.layers.92.enorm.weight": "mtp.safetensors",
    "model.layers.92.hnorm.weight": "mtp.safetensors",
    "model.layers.92.input_layernorm.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.0.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.0.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.0.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.1.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.1.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.1.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.10.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.10.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.10.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.100.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.100.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.100.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.101.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.101.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.101.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.102.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.102.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.102.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.103.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.103.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.103.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.104.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.104.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.104.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.105.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.105.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.105.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.106.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.106.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.106.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.107.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.107.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.107.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.108.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.108.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.108.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.109.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.109.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.109.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.11.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.11.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.11.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.110.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.110.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.110.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.111.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.111.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.111.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.112.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.112.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.112.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.113.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.113.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.113.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.114.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.114.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.114.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.115.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.115.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.115.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.116.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.116.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.116.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.117.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.117.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.117.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.118.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.118.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.118.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.119.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.119.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.119.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.12.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.12.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.12.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.13.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.13.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.13.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.14.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.14.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.14.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.15.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.15.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.15.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.16.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.16.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.16.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.17.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.17.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.17.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.18.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.18.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.18.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.19.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.19.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.19.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.2.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.2.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.2.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.20.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.20.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.20.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.21.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.21.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.21.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.22.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.22.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.22.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.23.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.23.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.23.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.24.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.24.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.24.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.25.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.25.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.25.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.26.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.26.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.26.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.27.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.27.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.27.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.28.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.28.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.28.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.29.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.29.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.29.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.3.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.3.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.3.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.30.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.30.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.30.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.31.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.31.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.31.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.32.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.32.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.32.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.33.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.33.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.33.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.34.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.34.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.34.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.35.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.35.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.35.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.36.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.36.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.36.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.37.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.37.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.37.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.38.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.38.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.38.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.39.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.39.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.39.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.4.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.4.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.4.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.40.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.40.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.40.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.41.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.41.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.41.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.42.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.42.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.42.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.43.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.43.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.43.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.44.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.44.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.44.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.45.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.45.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.45.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.46.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.46.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.46.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.47.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.47.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.47.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.48.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.48.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.48.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.49.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.49.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.49.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.5.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.5.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.5.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.50.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.50.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.50.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.51.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.51.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.51.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.52.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.52.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.52.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.53.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.53.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.53.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.54.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.54.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.54.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.55.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.55.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.55.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.56.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.56.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.56.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.57.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.57.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.57.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.58.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.58.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.58.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.59.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.59.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.59.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.6.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.6.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.6.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.60.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.60.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.60.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.61.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.61.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.61.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.62.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.62.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.62.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.63.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.63.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.63.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.64.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.64.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.64.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.65.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.65.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.65.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.66.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.66.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.66.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.67.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.67.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.67.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.68.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.68.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.68.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.69.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.69.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.69.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.7.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.7.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.7.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.70.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.70.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.70.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.71.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.71.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.71.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.72.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.72.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.72.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.73.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.73.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.73.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.74.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.74.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.74.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.75.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.75.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.75.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.76.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.76.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.76.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.77.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.77.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.77.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.78.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.78.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.78.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.79.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.79.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.79.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.8.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.8.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.8.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.80.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.80.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.80.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.81.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.81.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.81.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.82.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.82.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.82.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.83.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.83.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.83.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.84.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.84.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.84.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.85.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.85.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.85.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.86.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.86.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.86.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.87.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.87.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.87.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.88.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.88.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.88.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.89.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.89.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.89.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.9.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.9.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.9.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.90.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.90.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.90.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.91.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.91.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.91.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.92.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.92.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.92.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.93.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.93.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.93.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.94.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.94.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.94.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.95.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.95.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.95.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.96.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.96.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.96.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.97.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.97.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.97.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.98.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.98.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.98.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.99.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.99.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.experts.99.up_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.gate.e_score_correction_bias": "mtp.safetensors",
    "model.layers.92.mlp.gate.weight": "mtp.safetensors",
    "model.layers.92.mlp.shared_experts.down_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.shared_experts.gate_proj.weight": "mtp.safetensors",
    "model.layers.92.mlp.shared_experts.up_proj.weight": "mtp.safetensors",
    "model.layers.92.post_attention_layernorm.weight": "mtp.safetensors",
    "model.layers.92.self_attn.k_norm.weight": "mtp.safetensors",
    "model.layers.92.self_attn.k_proj.bias": "mtp.safetensors",
    "model.layers.92.self_attn.k_proj.weight": "mtp.safetensors",
    "model.layers.92.self_attn.o_proj.weight": "mtp.safetensors",
    "model.layers.92.self_attn.q_norm.weight": "mtp.safetensors",
    "model.layers.92.self_attn.q_proj.bias": "mtp.safetensors",
    "model.layers.92.self_attn.q_proj.weight": "mtp.safetensors",
    "model.layers.92.self_attn.v_proj.bias": "mtp.safetensors",
    "model.layers.92.self_attn.v_proj.weight": "mtp.safetensors",
    "model.layers.92.shared_head.norm.weight": "mtp.safetensors"

My idea was to copy the original GLM 4.6 safetensors file that has the MTP tensors over, then copy the part for the MTP tensors into the index verbatim.

I can’t try this now because I’m afk, but I can in a few days.

I tried that yesterday, both with the original index (with the 160 MTP experts) and with the index re-sized to 120 experts (as pasted above). Both forms failed in the HF to GGUF conversion. I've been busy the last couple of days but I will try to follow up more with this tonight and see if I can get something working.

Hmm, you could get the llama cpp conversion script to print ‘self._experts’ to see what’s actually in the object. It looks like it iterated through a few before erroring out at 7.

Steps:

  1. Rename mtp.safetensors to model-mtp.safetensors, and change the corresponding lines in model.safetensors.index.json
  2. (might not be necessary?) Remove all model.layers.92.mlp.experts.$EXPERT.$TYPE.weight lines from model.safetensors.index.json for all $EXPERT >= 80
  3. Apply this patch to llama.cpp. Note that I'm almost positive it takes the first 80 experts from model-mtp.safetensors rather than the 80 non-pruned experts, but since there's no MTP support in llama.cpp your gguf will work. Once there's MTP support it'll stop working.
  4. Run the conversion.
  5. Quantize the output.
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index ed99dc847..0ee0e0b48 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -230,9 +230,10 @@ class ModelBase:
                 raise ValueError(f"Missing or incomplete model files: {missing_files}\n"
                                  f"Missing tensors: {missing}")
             else:
-                raise ValueError("Mismatch between weight map and model parts for tensor names:\n"
-                                 f"Missing tensors: {missing}\n"
-                                 f"Extra tensors: {extra}")
+                # raise ValueError("Mismatch between weight map and model parts for tensor names:\n"
+                #                  f"Missing tensors: {missing}\n"
+                #                  f"Extra tensors: {extra}")
+                pass
 
     def format_tensor_name(self, key: gguf.MODEL_TENSOR, bid: int | None = None, suffix: str = ".weight") -> str:
         if key not in gguf.MODEL_TENSORS[self.model_arch]:
@@ -299,6 +300,10 @@ class ModelBase:
                 if part.isdecimal():
                     bid = int(part)
                     break
+            if len(name.split('.')) > 5 and name.split('.')[4] == 'experts':
+                print(name)
+                if int(name.split('.')[5]) >= 80:
+                    continue
 
             for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
                 # TODO: why do we squeeze here?
@@ -7279,6 +7284,8 @@ class Glm4MoeModel(TextModel):
     def modify_tensors(
         self, data_torch: Tensor, name: str, bid: int | None
     ) -> Iterable[tuple[str, Tensor]]:
+        print("modify_tensors", name)
+        
         if name.startswith("model.visual."):  # ignore visual part
             return []
         elif name.startswith("model.language_model."):
@@ -7297,8 +7304,15 @@ class Glm4MoeModel(TextModel):
                 self._experts = [{} for _ in range(self.block_count)]
 
             self._experts[bid][name] = data_torch
+            # expert_idxs = []
+            # for key in self._experts[bid].keys():
+            #     expert_idx = int(key.split('.')[5])
+            #     if expert_idx not in expert_idxs:
+            #         expert_idxs.append(expert_idx)
+            # print(len(expert_idxs))
 
             if len(self._experts[bid]) >= n_experts * 3:
+                print(len(self._experts[bid]), n_experts*3)
                 tensors: list[tuple[str, Tensor]] = []
 
                 # merge the experts into a single 3d tensor
@@ -7307,15 +7321,21 @@ class Glm4MoeModel(TextModel):
 
                     for xid in range(n_experts):
                         ename = f"model.layers.{bid}.mlp.experts.{xid}.{w_name}.weight"
-                        datas.append(self._experts[bid][ename])
-                        del self._experts[bid][ename]
+                        try:
+                            datas.append(self._experts[bid][ename])
+                            del self._experts[bid][ename]
+                        except KeyError:
+                            continue
 
-                    data_torch = torch.stack(datas, dim=0)
+                    try:
+                        data_torch = torch.stack(datas, dim=0)
 
-                    merged_name = f"model.layers.{bid}.mlp.experts.{w_name}.weight"
+                        merged_name = f"model.layers.{bid}.mlp.experts.{w_name}.weight"
 
-                    new_name = self.map_tensor_name(merged_name)
-                    tensors.append((new_name, data_torch))
+                        new_name = self.map_tensor_name(merged_name)
+                        tensors.append((new_name, data_torch))
+                    except RuntimeError:
+                        continue
                 return tensors
             else:
                 return []

You can also refer to the discussion on the cerebras GLM-4.5-Air prune, but I'm not sure how they generated their updated mtp.safetensors so it may not be trivial to reproduce.

So even if you add it to the index file of the HF model, the state dict won't have those keys. You have to dump the model's state dict, add the keys with like a size of 0-0 or whatever structure they originally had and re-save the pytorch model. If I could get the full weights, that would be my way to go about it. MTP.safetensors can be generated this way through python, as long as the keys match, GGUF shouldn't care.

I did quant a couple and uploaded them at jmoney54378256438905/AesSedai_GLM-4.6-REAP-178B-A32B_GGUF but at 3 and 4 bit the model feels a lot dumber. May need an imatrix for these pruned models.

I did two different imatrix quants and they're both astoundingly dumb

@jmoney54378256438905 : Thanks for trying! I tried your 178B Q3K model and its responses were coherent enough, but it seems not to know a lot of stuff even GLM 4.5 Air knows at 3-bit AWQ or 4.x bpw quants. I asked about a few obscure people both the Air and full models knew about, but this quant didn't. It was able to tell me about the Amiga computer system, though.

I'm currently downloading a Q2K of the larger REAP 266B GLM model. Will report back my findings, unless anyone has tried this already?

I am trying the bigger model. It's definitely strange. Guess tonight I will d/l this one. I wouldn't say so much that it's dumber but more twisted. Bizzaro-GLM.

The dataset used to determine which experts to prune must have a big impact on the "behavior".

Yeah, REAP definitely seems like a bust for anything except specialized tasks (and even then I'm very skeptical), but I wonder if it would be useful for selectively offloading experts from VRAM to system RAM? Like, this model is 80/160 experts pruned, 8 active per token, right? If you kept those 80 non-pruned experts in VRAM and moved the 80 pruned ones ones to system RAM, how often could you run a layer's foward pass completely on the GPU, and how much time would you save by doing that vs the time you lose copying intermediate data across the PCIe bus? I think it would just be token embeddings (?) and maybe some sort of attention information that you'd need to copy rather than any model weights, but my working knowledge of transformers is very much breaking down by now. Also I'd imagine quite often the router selects mostly non-pruned experts and I haven't the faintest idea whether having to run a CPU forward pass on one or two offloaded experts would slow down the entire layer's forward pass. Questions I wish I was smart enough to answer.

Experts are per layer? Hard to keep individual pieces in ram vs vram. There's 3 new reap models from cerebras itself.

I've now tried both a 178B Q3K quant and a 218B MLX 3-bit quant, and both display a pretty catastrophic loss of general knowledge. I assume the dataset training for specialist topics like coding means only the experts activated by these specialties are being preserved, and the ones that are thrown away are important for general knowledge, which is frustrating because my primary use for models is for creative writing and roleplaying purposes, which it's not likely to be very good at if it doesn't "know" anything about the characters or the universes they're part of.

The responses themselves are coherent and the model seems mostly to understand what's being asked, but it does get confused exhibits some bizarre behaviour in responding, like noticing a typo in a prompt and assuming that every word of the name I typed was therefore a typo, with the corrected version being spelt exactly the same, and not knowing who Bill Clinton is - assuming it was a typo and that "he" is known as Hillary Clinton and was Secretary of State... for Guam!! "He" was also mostly famous for being in a film called The Adventures of Hillary Clinton, an "absurdist comedy movie."

Has anyone else tested for general knowledge, esp the larger parameter ones, and gotten better results? I don't use GLM for coding so I can't really test to see how well it does with that, but I presume Cerebras wouldn't have uploaded quants that fail to do anything right.

Screenshot 2025-10-25 at 01.22.42
Screenshot 2025-10-25 at 01.26.08
Screenshot 2025-10-25 at 01.27.16
Screenshot 2025-10-25 at 01.28.10

I used the larger AesSedai quant and it forgot who fillian was. Most chinese was lost. Model stopped having positivity bias. This much gone doesn't look good though. We need a creative writing with a tiny bit of code and tool calling REAP. I'm about 20gb away from trying out the 218. Note that I do not use reasoning so see what it says without.

The best quant for GLM 4.6 I've used so far under 100GB is Unsloth's UD IQ1_S quant. I bristle at the thought of using a quant under 3bpw, but it's surprisingly functional as an RP model - very capable, with full knowledge retention. I was hoping to "step up" to a higher quant via REAP, as going by the numbers, there should be less degradation from pruning layers than the drop-out you're supposed to experience below 3bpw quantisation.

Note that it says it's IQ1_S but the naming is a bit misleading. Bartowski's quants go up to about IQ2_XS I think, at the same size. Unsloth selectively 'upgrades' various layers that are more sensitive to quantisation and lowers those that aren't critical. So in theory a 96GB quant of theirs should be better than a 96GB standard quant from someone else, regardless of size.

Yeah, and unsloth also allocates more to the “dense” layers of the MoE.

But IQ1 is a huge KLD drop. Even a 12% Reap to squeeze it in 100GB could be a massive boost.

Also, you should look into the ik_llamacpp IQ1_KT/IQ2_KT quants. They’re less devastating than mainline llama.cpp. I can even cook one for you, if you wish

The 268b doesn't know bill clinton either. I want to put this model out to pasture but it's so unhinged. I think REAP might be good in getting rid of stilted replies and alignment.. just gotta make the model not so stupid.

After trying several quants, the models don't really run faster either. I am getting the same speeds from IQ3_XXS as I am from Q3K_XL despite more being in sysram. I get prompt processing should be the same, and it about is.. but t/g should be much faster due to lower total memory footprint.

Am of the opinion that AesSedai's came out better than Cerebras. At least the bigger one.

Yep, same conclusion here. Tested and quanted multiple REAP models but they all just seem much much stupider than normal 4.6 even @ Q2_K. I'll just wait "2 more weeks" until 4.6 Air releases.

Sign up or log in to comment