i3-4096ctx

Model Description

i3-4096ctx is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel Latent Context Compression system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window.

Architecture Overview

The model employs a unique two-tier context processing strategy:

  • Base Processing: 512-token kernel window for direct token-level computation
  • Extended Context: 4096-token effective context through latent compression
  • Hybrid Layers: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning
  • Model Size: 1,180 dimensional embeddings, ~340M parameters

Key Innovation: Latent Context Compression

The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window:

Compression Ratio: 512:32 (16:1 compression)
Max Compressed Chunks: 8 chunks
Effective Context: 4,096 tokens
Latent Tokens per Chunk: 32 tokens

How it works:

  1. Input text is processed in 512-token chunks
  2. Each chunk is compressed into 32 latent tokens using cross-attention
  3. Up to 8 compressed chunks (256 latent tokens) are maintained as context
  4. New chunks attend to both current tokens and compressed history

This approach provides several advantages:

  • Memory Efficient: Stores 4K tokens in just 256 latent representations
  • Computationally Efficient: Avoids quadratic attention over long sequences
  • Semantically Rich: Learned compression preserves relevant information

Model Specifications

Attribute Value
Architecture Hybrid RWKV-Attention with Latent Compression
Parameters ~340M
Embedding Dimension 1,180
RWKV Layers 12
Attention Layers 2
Attention Heads 8
Kernel Window 512 tokens
Effective Context 4,096 tokens (via compression)
Vocabulary Size 32,000 (BPE)
Training Data FineWeb-Edu (10BT sample)

Performance

Final Training Metrics (Iteration 270):

  • Loss: 0.0933
  • Perplexity: 1.14
  • Training Speed: 202 tokens/second
  • Compression: 256 latent tokens active

The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression.

Architecture Details

Layer Configuration

RWKV Layers (12 layers):

  • Linear-time complexity for sequential processing
  • Time-mixing and channel-mixing mechanisms
  • JIT-optimized parallel implementation
  • Efficient for base token processing

Attention Layers (2 layers):

  • Full multi-head attention with 8 heads
  • 4× FFN expansion ratio
  • Causal masking for autoregressive generation
  • High-level reasoning and long-range dependencies

Compression Module:

  • Learnable latent query vectors (32 per chunk)
  • Cross-attention based compression
  • Layer normalization and feedforward refinement
  • Automatic head count adjustment for dimension compatibility

Training Configuration

  • Sequence Length: 512 tokens (aligned with kernel window)
  • Batch Size: 4
  • Gradient Accumulation: 8 steps
  • Learning Rate: 4e-4 (cosine schedule with warmup)
  • Compression Warmup: Enabled after 100 iterations, 50-iteration warmup period
  • Optimization: AdamW with gradient clipping, mixed precision training

Tokenizer

  • Type: Byte-Pair Encoding (BPE)

  • Vocabulary Size: 32,000 tokens

  • Special Tokens: Includes <UNK>, <PAD>, <BOS>, <EOS>, <|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|>, <|endoftext|>, <|eot_id|>, [INST], [/INST]

  • Pre-tokenizer: ByteLevel encoding

Intended Use

This model is designed for:

  • Research into efficient long-context language modeling
  • Applications requiring extended context understanding with limited compute
  • Exploration of hybrid RWKV-attention architectures
  • Investigation of learned compression techniques for language models

Limitations

  • Context beyond 4,096 tokens is not accessible even through compression
  • Compression is lossy and may not preserve all fine-grained details from distant context
  • Generation speed depends on maintaining compressed history
  • Trained primarily on English text (FineWeb-Edu)

Technical Notes

Memory Management:

  • Compressed history is detached from computation graph to prevent backpropagation through time
  • Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens)
  • Automatic pruning when history exceeds capacity

Inference Behavior:

  • During generation, compressed history accumulates progressively
  • Each 512-token chunk adds 32 latent tokens to context
  • Oldest chunks are dropped when exceeding 4,096 token equivalent
Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for i3-lab/i3-4096ctx

Base model

i3-lab/i3-500m
Finetuned
(1)
this model

Dataset used to train i3-lab/i3-4096ctx

Space using i3-lab/i3-4096ctx 1

Collections including i3-lab/i3-4096ctx