--- license: apache-2.0 language: - en datasets: - wikitext - glue pipeline_tag: text-generation tags: - transformer - attention - mla - research --- # Deepseek Tiny V0.1 6-layer DeepSeek-V3 with Multihead Latent Attention (MLA) trained for research on shared subspaces in Transformer attention mechanisms. ## Model Description - **Model Type**: Transformer Decoder (DeepSeek-V3 based) - **Architecture**: 6-layer decoder with Mixture of Experts - **Parameters**: 16.26M - **Hidden Size**: 256 - **Attention Heads**: 8 - **Head Dimension**: 32 - **Sequence Length**: 1,024 tokens - **Query Latent Dimension**: 96 - **Key-Value Latent Dimension**: 64 ## Performance - **SST-2 Accuracy**: 87.96% - **WikiText-103 Perplexity**: 28.89 ## Research Context This model is part of the [shared-subspaces](https://github.com/chrisjmccormick/shared-subspaces) research project investigating the impact of shared output latent spaces in Transformer attention mechanisms. ## Usage ```python import torch from transformers import DeepseekV3ForCausalLM, AutoTokenizer # Load model and tokenizer model = DeepseekV3ForCausalLM.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1") tokenizer = AutoTokenizer.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1") # Generate text inputs = tokenizer("The future of AI is", return_tensors="pt") outputs = model.generate(**inputs, max_length=50, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details - **Pre-training Dataset**: WikiText-103 - **Fine-tuning Dataset**: SST-2 (GLUE) - **Optimizer**: AdamW - **Learning Rate**: 5e-4 (pre-training), 5e-5 (fine-tuning) - **Weight Decay**: 0.01 (pre-training), 0.05 (fine-tuning) - **Precision**: bfloat16 - **Compilation**: torch.compile with inductor backend - **Training Steps**: 12,500 (pre-training), 1,500 (fine-tuning) ## Limitations - Small scale model (16M parameters) intended for research purposes - Trained on limited data compared to production models - May require custom loading code for output subspace variants ## Citation ```bibtex @misc{mccormick2025sharedsubspaces, title={Shared Subspaces in Transformer Attention: Investigating Output Latent Spaces}, author={McCormick, Chris}, year={2025}, howpublished={\url{https://github.com/chrisjmccormick/shared-subspaces}} } ``` ## License Apache 2.0