π§ Qwen3 Distilled Student (0.6B GPT2-style)
A compact, CPU-friendly student distilled from Qwen3-0.6B, optimized for lightweight deployment and real-time chat. Designed for use in browser, Colab, or mobile environments with limited resources.
π Architecture
- Based on
GPT2Configschema for compatibility - Patches applied:
n_inner,layer_norm_epsilon,activation_function, etc.- Handles missing dropout attributes gracefully
- Supports attention streaming and assistant-style prompting
π Training Setup
- Source: Qwen3-0.6B
- Distillation: direct next-token distillation using custom prompt logic
- Platform: Kaggle GPU (A100, 40GB)
- Framework: TensorFlow / PyTorch hybrid flow, minimal dependencies