i3-4096ctx
Model Description
i3-4096ctx is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel Latent Context Compression system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window.
Architecture Overview
The model employs a unique two-tier context processing strategy:
- Base Processing: 512-token kernel window for direct token-level computation
- Extended Context: 4096-token effective context through latent compression
- Hybrid Layers: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning
- Model Size: 1,180 dimensional embeddings, ~340M parameters
Key Innovation: Latent Context Compression
The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window:
Compression Ratio: 512:32 (16:1 compression)
Max Compressed Chunks: 8 chunks
Effective Context: 4,096 tokens
Latent Tokens per Chunk: 32 tokens
How it works:
- Input text is processed in 512-token chunks
- Each chunk is compressed into 32 latent tokens using cross-attention
- Up to 8 compressed chunks (256 latent tokens) are maintained as context
- New chunks attend to both current tokens and compressed history
This approach provides several advantages:
- Memory Efficient: Stores 4K tokens in just 256 latent representations
- Computationally Efficient: Avoids quadratic attention over long sequences
- Semantically Rich: Learned compression preserves relevant information
Model Specifications
| Attribute | Value |
|---|---|
| Architecture | Hybrid RWKV-Attention with Latent Compression |
| Parameters | ~340M |
| Embedding Dimension | 1,180 |
| RWKV Layers | 12 |
| Attention Layers | 2 |
| Attention Heads | 8 |
| Kernel Window | 512 tokens |
| Effective Context | 4,096 tokens (via compression) |
| Vocabulary Size | 32,000 (BPE) |
| Training Data | FineWeb-Edu (10BT sample) |
Performance
Final Training Metrics (Iteration 270):
- Loss: 0.0933
- Perplexity: 1.14
- Training Speed: 202 tokens/second
- Compression: 256 latent tokens active
The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression.
Architecture Details
Layer Configuration
RWKV Layers (12 layers):
- Linear-time complexity for sequential processing
- Time-mixing and channel-mixing mechanisms
- JIT-optimized parallel implementation
- Efficient for base token processing
Attention Layers (2 layers):
- Full multi-head attention with 8 heads
- 4× FFN expansion ratio
- Causal masking for autoregressive generation
- High-level reasoning and long-range dependencies
Compression Module:
- Learnable latent query vectors (32 per chunk)
- Cross-attention based compression
- Layer normalization and feedforward refinement
- Automatic head count adjustment for dimension compatibility
Training Configuration
- Sequence Length: 512 tokens (aligned with kernel window)
- Batch Size: 4
- Gradient Accumulation: 8 steps
- Learning Rate: 4e-4 (cosine schedule with warmup)
- Compression Warmup: Enabled after 100 iterations, 50-iteration warmup period
- Optimization: AdamW with gradient clipping, mixed precision training
Tokenizer
Type: Byte-Pair Encoding (BPE)
Vocabulary Size: 32,000 tokens
Special Tokens: Includes
<UNK>,<PAD>,<BOS>,<EOS>,<|im_start|>,<|im_end|>,<|system|>,<|user|>,<|assistant|>,<|endoftext|>,<|eot_id|>,[INST],[/INST]Pre-tokenizer: ByteLevel encoding
Intended Use
This model is designed for:
- Research into efficient long-context language modeling
- Applications requiring extended context understanding with limited compute
- Exploration of hybrid RWKV-attention architectures
- Investigation of learned compression techniques for language models
Limitations
- Context beyond 4,096 tokens is not accessible even through compression
- Compression is lossy and may not preserve all fine-grained details from distant context
- Generation speed depends on maintaining compressed history
- Trained primarily on English text (FineWeb-Edu)
Technical Notes
Memory Management:
- Compressed history is detached from computation graph to prevent backpropagation through time
- Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens)
- Automatic pruning when history exceeds capacity
Inference Behavior:
- During generation, compressed history accumulates progressively
- Each 512-token chunk adds 32 latent tokens to context
- Oldest chunks are dropped when exceeding 4,096 token equivalent
- Downloads last month
- 42
Model tree for i3-lab/i3-4096ctx
Base model
i3-lab/i3-500m