GLM-4.5 MLX 8-bit
[Upload in progress, sorry my internet is slow]
Model Description
This is an 8-bit quantized MLX version of zai-org/GLM-4.5, optimized for Apple Silicon with high unified memory configurations.
Key Features
- 8-bit quantization (8.502 bits per weight) for memory efficiency
 - MLX optimized for Apple Silicon unified memory architecture
 - High-memory optimized: Designed for systems with 512GB+ unified memory
 - Long context capable: Tested with multiple 6,500+ word documents, 30K token chunks
 - Performance: ~11.75 tokens/second on Mac Studio with 512GB RAM
 
Model Details
- Base Model: GLM-4.5 by ZhipuAI
 - Architecture: MoE (Mixture of Experts)
 - Quantization: 8-bit MLX with group size 64
 - MLX-LM Version: 0.26.3
 - Model Size: ~375GB
 - Context Length: 131,072 tokens (tested stable up to 132K+ tokens)
 
System Requirements
- Hardware: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
 - Memory: 512GB+ unified memory strongly recommended
 - Storage: ~400GB free space
 - Software: macOS with MLX framework
 
Performance Benchmarks
Test Configuration: 2025 Mac Studio M3 Ultra with 512GB unified memory
Context Length Performance
- Short Context (6.5K tokens): 11.75 tokens/second
 - Long Context (72K tokens): 5.0 tokens/second, 86% memory usage
 - Extended Context (121K tokens): 30K token input prompt, 2.53 tokens/second, 92% memory usage
 - Beyond Theoretical Limit (132K tokens): 11k token input prompt, 5.74 tokens/second, 85% peak memory
 - Proven Capability: Successfully exceeds stated 131K context window (102.2% capacity)
 - Quality: Full comprehension and analysis of complex, sprawling content at maximum context
 
Recommended Generation Settings
- Temperature: 0.8
 - Top K: 100
 - Repeat Penalty: 1.1
 - Min P: Default/unset
 - Top P: Default/unset
 
Comparison with GGUF
- MLX Version: System remains responsive during inference, stable performance
 - GGUF Version: System becomes unusable, frequent crashes around 30-40K tokens in context window
 
Usage
With MLX-LM
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit")
response = generate(model, tokenizer, "Your prompt here", max_tokens=500)
With LM Studio
- Download the model files
 - Load in LM Studio
 - Set appropriate context length based on your memory
 - Recommended settings: [Add any specific settings you found worked well]
 
Limitations
- Requires substantial unified memory (512GB+ recommended)
 - Optimized specifically for Apple Silicon; may not perform well on other architectures
 - Quantization may introduce minor quality differences compared to the full-precision model
 
Training Data & Bias
Please refer to the original GLM-4.5 model card for information about training data, intended use, and potential biases.
Citation
If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion:
@misc{glm45-mlx-8bit,
  title={GLM-4.5 MLX 8-bit},
  author={Onceler},
  year={2025},
  howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}},
}
Acknowledgments
- Original model by ZhipuAI (zai-org/GLM-4.5)
 - MLX framework by Apple
 - Conversion performed on Mac Studio with 512GB unified memory
 
License
This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.
Model tree for mlx-community/GLM-4.5-MLX-8bit
Base model
zai-org/GLM-4.5