VelocityLM π
A high-performance, custom transformer language model trained from scratch using modern architectural innovations. VelocityLM combines state-of-the-art techniques including RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE) to deliver efficient and scalable language modeling.
π― Quick Links
- π Try the Model: Interactive Demo Space
- π» Source Code: GitHub Repository
ποΈ Model Architecture
VelocityLM features a custom transformer architecture optimized for performance and efficiency:
Model Specifications
- Parameters: ~2B parameters
- Architecture: Decoder-only transformer with causal attention
- Hidden Size: 2,048
- Layers: 24 transformer layers
- Attention Heads: 32 heads per layer
- Vocabulary: 50,257 tokens (GPT-2 tokenizer compatible)
- Context Length: 2,048 tokens
- Intermediate Size: 8,192 (4x hidden size)
π¬ Key Innovations
RMSNorm (Root Mean Square Normalization)
- Replaces LayerNorm for improved training stability and efficiency
- Better gradient flow compared to traditional normalization
SwiGLU Activation Function
- Gated Linear Unit with Swish activation
- Superior performance compared to standard ReLU/GELU for language modeling
- Enhanced expressivity and gradient flow
Rotary Position Embeddings (RoPE)
- Relative position encoding with rotational invariance
- Better extrapolation capabilities to longer sequences
- More efficient than learned absolute position embeddings
π― Training Details
- Dataset: Falcon RefinedWeb - high-quality web text
- Training Steps: 5,000+ completed
- Optimization: AdamW with cosine annealing schedule
- Hardware: Trained on 4x NVIDIA A100 (80GB) GPUs
- Features: Mixed precision (FP16), gradient checkpointing, distributed training
π Usage
Basic Text Generation
# Note: This model requires custom loading code
# See the GitHub repository for complete implementation
from transformers import AutoTokenizer
import torch
# Load tokenizer (GPT-2 compatible)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# For complete usage examples and model loading:
# Visit: https://github.com/dixisouls/VelocityLM
Interactive Demo
Try the model immediately in our Hugging Face Space - no setup required!
π Performance Features
Generation Strategies
- Greedy decoding for deterministic output
- Top-k and top-p (nucleus) sampling
- Temperature control for creativity adjustment
- Repetition penalty to reduce repetitive text
Memory Optimizations
- Gradient checkpointing (40% memory reduction)
- Efficient causal attention implementation
- Streaming data processing
π§ Technical Implementation
This model implements several cutting-edge techniques:
- Distributed Training: Multi-GPU support with PyTorch DDP
- Mixed Precision: FP16 training with automatic loss scaling
- Advanced Scheduling: Cosine annealing with warm restarts
- Memory Efficiency: Gradient checkpointing and parameter grouping
π οΈ Installation & Setup
For detailed installation instructions, training scripts, and advanced usage:
π Visit the GitHub Repository
The repository includes:
- Complete training pipeline
- Inference utilities
- Configuration management
- Multi-GPU training support
- Comprehensive documentation
π Roadmap
Future enhancements planned:
- Flash Attention 2.0 integration
- Extended context length support (4K+)
- Model quantization for efficient deployment
- Fine-tuning capabilities for downstream tasks
- ONNX export for production inference
- Downloads last month
- 8