VelocityLM πŸš€

A high-performance, custom transformer language model trained from scratch using modern architectural innovations. VelocityLM combines state-of-the-art techniques including RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE) to deliver efficient and scalable language modeling.

🎯 Quick Links

πŸ—οΈ Model Architecture

VelocityLM features a custom transformer architecture optimized for performance and efficiency:

Model Specifications

  • Parameters: ~2B parameters
  • Architecture: Decoder-only transformer with causal attention
  • Hidden Size: 2,048
  • Layers: 24 transformer layers
  • Attention Heads: 32 heads per layer
  • Vocabulary: 50,257 tokens (GPT-2 tokenizer compatible)
  • Context Length: 2,048 tokens
  • Intermediate Size: 8,192 (4x hidden size)

πŸ”¬ Key Innovations

RMSNorm (Root Mean Square Normalization)

  • Replaces LayerNorm for improved training stability and efficiency
  • Better gradient flow compared to traditional normalization

SwiGLU Activation Function

  • Gated Linear Unit with Swish activation
  • Superior performance compared to standard ReLU/GELU for language modeling
  • Enhanced expressivity and gradient flow

Rotary Position Embeddings (RoPE)

  • Relative position encoding with rotational invariance
  • Better extrapolation capabilities to longer sequences
  • More efficient than learned absolute position embeddings

🎯 Training Details

  • Dataset: Falcon RefinedWeb - high-quality web text
  • Training Steps: 5,000+ completed
  • Optimization: AdamW with cosine annealing schedule
  • Hardware: Trained on 4x NVIDIA A100 (80GB) GPUs
  • Features: Mixed precision (FP16), gradient checkpointing, distributed training

πŸš€ Usage

Basic Text Generation

# Note: This model requires custom loading code
# See the GitHub repository for complete implementation

from transformers import AutoTokenizer
import torch

# Load tokenizer (GPT-2 compatible)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# For complete usage examples and model loading:
# Visit: https://github.com/dixisouls/VelocityLM

Interactive Demo

Try the model immediately in our Hugging Face Space - no setup required!

πŸ“Š Performance Features

Generation Strategies

  • Greedy decoding for deterministic output
  • Top-k and top-p (nucleus) sampling
  • Temperature control for creativity adjustment
  • Repetition penalty to reduce repetitive text

Memory Optimizations

  • Gradient checkpointing (40% memory reduction)
  • Efficient causal attention implementation
  • Streaming data processing

πŸ”§ Technical Implementation

This model implements several cutting-edge techniques:

  • Distributed Training: Multi-GPU support with PyTorch DDP
  • Mixed Precision: FP16 training with automatic loss scaling
  • Advanced Scheduling: Cosine annealing with warm restarts
  • Memory Efficiency: Gradient checkpointing and parameter grouping

πŸ› οΈ Installation & Setup

For detailed installation instructions, training scripts, and advanced usage:

πŸ‘‰ Visit the GitHub Repository

The repository includes:

  • Complete training pipeline
  • Inference utilities
  • Configuration management
  • Multi-GPU training support
  • Comprehensive documentation

πŸ“ˆ Roadmap

Future enhancements planned:

  • Flash Attention 2.0 integration
  • Extended context length support (4K+)
  • Model quantization for efficient deployment
  • Fine-tuning capabilities for downstream tasks
  • ONNX export for production inference
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support