i3-4096ctx

Model Description

i3-4096ctx is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel Latent Context Compression system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window.

Architecture Overview

The model employs a unique two-tier context processing strategy:

Base Processing: 512-token kernel window for direct token-level computation
Extended Context: 4096-token effective context through latent compression
Hybrid Layers: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning
Model Size: 1,180 dimensional embeddings, ~340M parameters

Key Innovation: Latent Context Compression

The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window:

Compression Ratio: 512:32 (16:1 compression)
Max Compressed Chunks: 8 chunks
Effective Context: 4,096 tokens
Latent Tokens per Chunk: 32 tokens

How it works:

Input text is processed in 512-token chunks
Each chunk is compressed into 32 latent tokens using cross-attention
Up to 8 compressed chunks (256 latent tokens) are maintained as context
New chunks attend to both current tokens and compressed history

This approach provides several advantages:

Memory Efficient: Stores 4K tokens in just 256 latent representations
Computationally Efficient: Avoids quadratic attention over long sequences
Semantically Rich: Learned compression preserves relevant information

Model Specifications

Attribute	Value
Architecture	Hybrid RWKV-Attention with Latent Compression
Parameters	~340M
Embedding Dimension	1,180
RWKV Layers	12
Attention Layers	2
Attention Heads	8
Kernel Window	512 tokens
Effective Context	4,096 tokens (via compression)
Vocabulary Size	32,000 (BPE)
Training Data	FineWeb-Edu (10BT sample)

Performance

Final Training Metrics (Iteration 270):

Loss: 0.0933
Perplexity: 1.14
Training Speed: 202 tokens/second
Compression: 256 latent tokens active

The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression.

Architecture Details

Layer Configuration

RWKV Layers (12 layers):

Linear-time complexity for sequential processing
Time-mixing and channel-mixing mechanisms
JIT-optimized parallel implementation
Efficient for base token processing

Attention Layers (2 layers):

Full multi-head attention with 8 heads
4× FFN expansion ratio
Causal masking for autoregressive generation
High-level reasoning and long-range dependencies

Compression Module:

Learnable latent query vectors (32 per chunk)
Cross-attention based compression
Layer normalization and feedforward refinement
Automatic head count adjustment for dimension compatibility

Training Configuration

Sequence Length: 512 tokens (aligned with kernel window)
Batch Size: 4
Gradient Accumulation: 8 steps
Learning Rate: 4e-4 (cosine schedule with warmup)
Compression Warmup: Enabled after 100 iterations, 50-iteration warmup period
Optimization: AdamW with gradient clipping, mixed precision training

Tokenizer

Type: Byte-Pair Encoding (BPE)
Vocabulary Size: 32,000 tokens
Special Tokens: Includes <UNK>, <PAD>, <BOS>, <EOS>, <|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|>, <|endoftext|>, <|eot_id|>, [INST], [/INST]
Pre-tokenizer: ByteLevel encoding

Intended Use

This model is designed for:

Research into efficient long-context language modeling
Applications requiring extended context understanding with limited compute
Exploration of hybrid RWKV-attention architectures
Investigation of learned compression techniques for language models

Limitations

Context beyond 4,096 tokens is not accessible even through compression
Compression is lossy and may not preserve all fine-grained details from distant context
Generation speed depends on maintaining compressed history
Trained primarily on English text (FineWeb-Edu)

Technical Notes

Memory Management:

Compressed history is detached from computation graph to prevent backpropagation through time
Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens)
Automatic pruning when history exceeds capacity

Inference Behavior: