GLM-4.5 MLX 8-bit

[Upload in progress, sorry my internet is slow]

Model Description

This is an 8-bit quantized MLX version of zai-org/GLM-4.5, optimized for Apple Silicon with high unified memory configurations.

Key Features

8-bit quantization (8.502 bits per weight) for memory efficiency
MLX optimized for Apple Silicon unified memory architecture
High-memory optimized: Designed for systems with 512GB+ unified memory
Long context capable: Tested with multiple 6,500+ word documents, 30K token chunks
Performance: ~11.75 tokens/second on Mac Studio with 512GB RAM

Model Details

Base Model: GLM-4.5 by ZhipuAI
Architecture: MoE (Mixture of Experts)
Quantization: 8-bit MLX with group size 64
MLX-LM Version: 0.26.3
Model Size: ~375GB
Context Length: 131,072 tokens (tested stable up to 132K+ tokens)

System Requirements

Hardware: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
Memory: 512GB+ unified memory strongly recommended
Storage: ~400GB free space
Software: macOS with MLX framework

Performance Benchmarks

Test Configuration: 2025 Mac Studio M3 Ultra with 512GB unified memory

Context Length Performance

Short Context (6.5K tokens): 11.75 tokens/second
Long Context (72K tokens): 5.0 tokens/second, 86% memory usage
Extended Context (121K tokens): 30K token input prompt, 2.53 tokens/second, 92% memory usage
Beyond Theoretical Limit (132K tokens): 11k token input prompt, 5.74 tokens/second, 85% peak memory
Proven Capability: Successfully exceeds stated 131K context window (102.2% capacity)
Quality: Full comprehension and analysis of complex, sprawling content at maximum context

Recommended Generation Settings

Temperature: 0.8
Top K: 100
Repeat Penalty: 1.1
Min P: Default/unset
Top P: Default/unset

Comparison with GGUF

MLX Version: System remains responsive during inference, stable performance
GGUF Version: System becomes unusable, frequent crashes around 30-40K tokens in context window

Usage

With MLX-LM

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit")
response = generate(model, tokenizer, "Your prompt here", max_tokens=500)

With LM Studio

Download the model files
Load in LM Studio
Set appropriate context length based on your memory
Recommended settings: [Add any specific settings you found worked well]

Limitations

Requires substantial unified memory (512GB+ recommended)
Optimized specifically for Apple Silicon; may not perform well on other architectures
Quantization may introduce minor quality differences compared to the full-precision model

Training Data & Bias

Please refer to the original GLM-4.5 model card for information about training data, intended use, and potential biases.

Citation

If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion:

@misc{glm45-mlx-8bit,
  title={GLM-4.5 MLX 8-bit},
  author={Onceler},
  year={2025},
  howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}},
}

Acknowledgments

Original model by ZhipuAI (zai-org/GLM-4.5)
MLX framework by Apple
Conversion performed on Mac Studio with 512GB unified memory

License

This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlx-community/GLM-4.5-MLX-8bit

Base model

zai-org/GLM-4.5

Finetuned

(8)

this model