English ↔ Egyptian Arabic Transformer

This project develops a translation system specifically for the Egyptian dialect (Masri), built entirely from scratch. Rather than fine-tuning an existing model, it implements a custom Encoder-Decoder architecture with stability improvements—like replacing LayerNorm with RMSNorm—and a BPE tokenizer designed to handle irregular spelling and slang. The system produces a robust model that reaches a BLEU score of 28.5, showing that reliable translation is achievable even without huge standard datasets.

Data Collection

Egyptian Arabic Sources

Reddit Scraping: Data was collected from 30+ Egyptian subreddits (e.g., r/Egypt, r/Cairo, r/AlexandriaEgy) using a multi-threaded script to handle Reddit API limits.
To avoid duplication, a SQLite database was used so that no two workers would scrape the same post or comment. This resulted in approximately 250,000 posts and comments.
YouTube Transcription: Over 1,700 hours of Egyptian videos were processed, extracting VTT subtitles to capture the real spoken dialect rather than just formal text.
Generated Parallel Data: Gemini was used with tailored prompt engineering to create parallel sentences for tricky slang and dialect-specific words, boosting the total count to 700k pairs.

English Sources

Subsets of standard English datasets (C4, FineWeb, OpenSubtitles) were selected to cover a wide range of topics and writing styles.

Preprocessing & Cleaning

Data Sanitation & Filtering

This pipeline employs rigorous cleaning steps to ensure high-quality input:

HTML & Entity Decoding: Decodes HTML entities (e.g., " → ") and splits bilingual text lines into separate lists.
Delimiter Standardization: Standardizes various dataset delimiters (e.g., pipes |, tabs \t) into a unified format for consistent parsing.
Noise Removal: Removes sentences with excessive emojis, spam repetition, or mostly numeric content (e.g., 😂😂😂, 12345).
Alignment Quality: Calculates strict length ratios between source and target sentences to remove bad alignments (e.g., rejecting pairs where English length > 3x Arabic length).
Artifact Removal: Removes subtitle artifacts (e.g., [music]) and promotional phrases (e.g., اشترك في القناة).
Deduplication: Merges cleaned files, removes exact duplicates, and downsamples frequent duplicates to prevent overfitting.

Linguistic Engineering

Normalization: Converted Eastern Arabic numerals (٠-٩) to Western (0-9).
Dialect Mapping: Created a dictionary of 2,000+ key pairs to correct common typos and map Modern Standard Arabic (MSA) to Egyptian.
- Example: لماذا → ليه
- Example: كيف → إزاي
- Example: أريد → عايز
- Example (Typo): مصطفي → مصطفى

Model Architecture

The model uses a custom Encoder-Decoder (BART configuration) built from scratch with the following specifications:

Core Specs:

Architecture: Transformer Encoder-Decoder (BART)
Parameters: ~98.4 Million
Embedding Dimension ($d_{model}$): 384
Layers: 8 Encoder / 8 Decoder
Attention Heads: 12 (Encoder) / 12 (Decoder)
Feed-Forward Dimension: 1152
Activation Function: GELU

Positional & Token Embeddings:

Max Position Embeddings: 1024 (Learned)
Scale Embedding: True
Vocab Size: 90,000 (Custom BPE)

Stability Patches:

Normalization: Root Mean Square Normalization (RMSNorm) replaces standard LayerNorm.
Embedding Tying: Encoder, Decoder, and LM Head share weights to prevent "lobotomy" during fine-tuning.

Resources

Model Link: Hugging Face: Shams03/En-Arz
Live Demo: Hugging Face Spaces

Key Engineering Challenges & Solutions

This project addresses critical failure points common in training dialectal models:

1. The "RMSNorm" Stability Fix

Problem: During pre-training with mixed-precision (FP16), the standard Transformer LayerNorm caused gradient explosions, leading to NaN losses.
Solution: A dynamic architecture patch was implemented to recursively replace all LayerNorm layers with RMSNorm (Root Mean Square Normalization) at runtime. This removes the mean-centering operation, stabilizing the gradients on T4 GPUs without sacrificing performance.

2. Morphological Pre-Tokenization

Problem: Arabic attaches prepositions (e.g., وال, لل, ال) to words. Standard BPE tokenizers treat alkitab (the book) and kitab (book) as totally different tokens, bloating the vocabulary and increasing sparsity.
Solution: A pre-tokenization rule was applied to separate these sticky prefixes before BPE training.
- Input: الكتاب
- Pre-tokenized: ال كتاب
- Result: The model learns the root word effectively.

3. The "Lobotomy" Embedding Fix

Problem: During fine-tuning, reloading weights via safetensors can sometimes untie the input/output embeddings, causing the model to lose language association ("lobotomy").
Solution: The fine-tuning script explicitly forces weight sharing between the Encoder, Decoder, and LM Head (model.shared.weight) during the loading state, ensuring semantic alignment.

4. Data Imbalance (Weighted Sampling)

Problem: High-quality transcript data is scarce compared to noisier web-scraped data.
Solution: The dataset was split into quality tiers (S-tier transcripts, A-tier social media) and weighted sampling was applied during training to prioritize high-quality conversational data while maintaining stylistic diversity.

Performance

Metric	Score	Note
BLEU Score	28.5	View Kaggle Logs
Validation Loss	2.18	Stable convergence
Training Loss	2.33

Translation Examples

Type	English Input	Model Output (Masri)	Notes
Good	"Get in the car, we have to go now!"	"ادخل العربية، لازم نمشي دلوقتي!"	Captures urgency and dialect terms.
Good	"I have a very bad feeling about this."	"عندي إحساس وحش أوي بخصوص الموضوع ده."	Natural phrasing.
Good	"Why are you doing this?"	"انت بتعمل كده ليه؟"	Correct question structure.
Bad	"The mitochondria is the powerhouse of the cell."	"الأرياريا هو كتلة الخلايا الجذعية"	Limitation: Struggles with scientific terms.
Bad	"Complex philosophical prose with archaic terms."	"الأفكار الفلسفية بالمصطلحات القديمة"	Limitation: Acceptable but not optimized for complex phrasing.

Usage

This model requires a specific patching procedure to load the custom RMSNorm architecture and ensure the correct special tokens are used.

Requirements

torch>=2.0.0, transformers>=4.30.0

Inference Code

import torch
import torch.nn as nn
import re
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

# 1. Define Architecture Components
class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.dim = dim
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        rms = torch.sqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
        return x / rms * self.scale

def load_patched_model(repo_id):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    
    # Load Config and fix Token IDs (overriding config.json errors)
    config = AutoConfig.from_pretrained(repo_id)

    tokenizer = AutoTokenizer.from_pretrained(repo_id)
    model = AutoModelForSeq2SeqLM.from_config(config)
    
    # Patch LayerNorm -> RMSNorm
    for name, child in list(model.named_children()):
        def patch_recursive(m):
            for n, c in list(m.named_children()):
                if isinstance(c, nn.LayerNorm):
                    dim = c.normalized_shape[0] if isinstance(c.normalized_shape, (tuple, list)) else c.normalized_shape
                    setattr(m, n, RMSNorm(dim))
                else:
                    patch_recursive(c)
        patch_recursive(model)

    # Load Weights
    try:
        f = hf_hub_download(repo_id, "model.safetensors")
        model.load_state_dict(load_file(f), strict=False)
    except:
        f = hf_hub_download(repo_id, "pytorch_model.bin")
        model.load_state_dict(torch.load(f, map_location="cpu"), strict=False)
        
    return model.to(device).eval(), tokenizer

def fix_arabic(text):
    if not text: return text
    # Re-connect prefixes and fix punctuation
    text = re.sub(r'(^|\s)(ال|لل|وال|بال)\s+(?=\S)', r'\1\2', text)
    text = re.sub(r'\s+([،؟!.,])', r'\1', text)
    return text.strip()

# 2. Run Inference
REPO_NAME = "Shams03/EgyLated" 
model, tokenizer = load_patched_model(REPO_NAME)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    if "token_type_ids" in inputs: del inputs["token_type_ids"]
    
    with torch.no_grad():
        out = model.generate(
            **inputs, 
            max_new_tokens=128, 
            num_beams=5, 
            early_stopping=True,
        )
    
    raw = tokenizer.decode(out[0], skip_special_tokens=True)
    return fix_arabic(raw)

print(translate("I am really happy because the model works."))
# Output: "أنا مبسوط جدا عشان الموديل شغال"

Limitations & Warnings

Scientific & Complex Text: The model struggles with scientific terminology and complex phrasing, often producing literal or inaccurate translations.
Names: Personal names may be mistranslated or inconsistently handled.
Content Warning: Due to the unfiltered nature of the training data (social media), the model may produce offensive or inappropriate language.
Planned Improvements (V2): The next version of the model is planned to be trained on a dataset of over 1M parallel sentences. This will specifically target the current model's weaknesses, improving performance on complex text, handling of names, and overall translation quality.

License

This project is licensed under the MIT License.

Downloads last month: 55

Safetensors

Model size

98.4M params

Tensor type

F32

Shams03
/

EgyLated