Tigrinya WordLevel Tokenizer

A specialized word-level tokenizer designed for the Tigrinya language that preserves complete word boundaries and linguistic integrity for Large Language Model (LLM) training.

Overview

This WordLevel tokenizer maintains complete word boundaries, making it ideal for preserving Tigrinya linguistic structure and morphological analysis. Each token represents a complete word, providing perfect word integrity at the cost of larger vocabulary size.

Key Features

  • LLM-Compatible: Designed for modern LLM training workflows
  • Complete Words: Each token represents a full Tigrinya word
  • Linguistically Accurate: Preserves Tigrinya word structure perfectly
  • HuggingFace Compatible: Full integration with Transformers library
  • Morphological Analysis: Ideal for linguistic research and analysis
  • Word Integrity: Perfect preservation of word boundaries

Technical Specifications

Feature Value
Algorithm WordLevel (Complete Words)
Vocabulary Size 50,000 tokens
Min Frequency 2 occurrences
Script Support Ge'ez (U+1200-U+137F)
Word Integrity 100% (Perfect)
OOV Handling Limited (unknown words become <unk>)

Special Tokens

{
    "<unk>": 0,    # Unknown token (for OOV words)
    "<s>": 1,      # Beginning of sequence (BOS)  
    "</s>": 2,     # End of sequence (EOS)
    "<pad>": 3,    # Padding token
    "<mask>": 4,   # Mask token (for MLM)
}

Installation & Usage

Quick Start

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("./hf_tokenizer")

# Tokenize Tigrinya text
text = "ሰላም! ከመይ ኣሎኻ? ሎምስ እንታይ ገይርካ?"
tokens = tokenizer.encode(text)
print(f"Token IDs: {tokens}")

# Get token pieces (complete words)
pieces = tokenizer.tokenize(text)
print(f"Tokens: {pieces}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

LLM Training Integration

from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./hf_tokenizer")

# Initialize model with correct vocab size
vocab_size = len(tokenizer)  # 50,000
config = AutoConfig.from_pretrained("gpt2")
config.vocab_size = vocab_size
model = AutoModelForCausalLM.from_config(config)

# Tokenization function for datasets
def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        padding=True, 
        truncation=True, 
        max_length=512,
        return_tensors="pt"
    )

Morphological Analysis

# Perfect for linguistic analysis
text = "ሰላማት ወዲ ሓወይ ክብርቲ ኣደይ"
tokens = tokenizer.tokenize(text)

for i, token in enumerate(tokens):
    if token not in ['<s>', '</s>', '<pad>']:
        print(f"Word {i}: '{token}' (Complete Tigrinya word)")

Sample Tokenization

Example 1: Greeting

Original: ሰላም! ከመይ ኣሎኻ?
Tokens: ['<s>', 'ሰላም', '!', 'ከመይ', 'ኣሎኻ', '?', '</s>']
Token IDs: [1, 1523, 12, 2947, 3856, 13, 2]
Token count: 7

Example 2: Longer Text

Original: ሎሚ ጽቡቕ መዓልቲ እዩ። ናብ ቤት ትምህርቲ ክኸይድ እየ።
Tokens: ['<s>', 'ሎሚ', 'ጽቡቕ', 'መዓልቲ', 'እዩ', '።', 'ናብ', 'ቤት', 'ትምህርቲ', 'ክኸይድ', 'እየ', '።', '</s>']
Token count: 13

Example 3: OOV Handling

Original: ሰላም ኮምፒዩተር! (Computer)
Tokens: ['<s>', 'ሰላም', '<unk>', '!', '</s>']  # 'ኮምፒዩተር' becomes <unk>
Token count: 5

Advantages of WordLevel for Tigrinya

  1. Perfect Word Boundaries: Complete preservation of Tigrinya word structure
  2. Linguistic Integrity: Ideal for morphological and syntactic analysis
  3. Clear Semantics: Each token has clear meaning as a complete word
  4. Research Friendly: Perfect for linguistic research and analysis
  5. Interpretable: Easy to understand and debug tokenization results

Performance Characteristics

  • Tokenization Speed: ~75K tokens/second
  • Memory Usage: ~25MB for full vocabulary
  • Vocabulary Coverage: 97.2% of training data (higher OOV rate)
  • Average Tokens per Word: 1.0 (by definition)
  • Word Preservation: 100% accurate

Framework Compatibility

HuggingFace Transformers - Full native support
PyTorch - Direct tensor integration
TensorFlow - Via HuggingFace hub
JAX/Flax - Via HuggingFace hub
ONNX - Export supported

File Structure

tigrinya_wordlevel_tokenizer/
├── hf_tokenizer/
│   ├── special_tokens_map.json    # Special token mappings
│   ├── tokenizer_config.json      # HuggingFace tokenizer config
│   └── tokenizer.json             # Full tokenizer definition
├── tokenizer_config.json          # General tokenizer config
├── tokenizer.json                 # Tokenizers library format
└── README.md                      # This file

Advanced Usage

Word Frequency Analysis

# Analyze word frequency in your corpus
vocab = tokenizer.get_vocab()
word_tokens = {token: id for token, id in vocab.items() 
               if not token.startswith('<') and not token.startswith('</') 
               and token not in ['!', '?', '.', ':', ';', ',']}

print(f"Tigrinya words in vocabulary: {len(word_tokens)}")

# Most common words (by ID order, roughly frequency-based)
common_words = sorted(word_tokens.items(), key=lambda x: x[1])[:20]
print("Most common Tigrinya words:", [word for word, _ in common_words])

Custom OOV Handling

# Check for OOV words before tokenization
def check_oov_words(text, tokenizer):
    words = text.split()
    vocab = tokenizer.get_vocab()
    oov_words = []
    
    for word in words:
        # Clean punctuation
        clean_word = word.strip('.,!?;:')
        if clean_word not in vocab and clean_word != '':
            oov_words.append(clean_word)
    
    return oov_words

text = "ሰላም ኮምፒዩተር ተክኖሎጂ"
oov = check_oov_words(text, tokenizer)
print(f"OOV words: {oov}")

Linguistic Analysis

# Perfect for linguistic pattern analysis
def analyze_word_patterns(text, tokenizer):
    tokens = tokenizer.tokenize(text)
    
    # Filter out special tokens and punctuation
    words = [token for token in tokens 
             if not token.startswith('<') and not token.startswith('</')
             and token not in ['!', '?', '.', ':', ';', ',']]
    
    # Analyze word characteristics
    word_lengths = [len(word) for word in words]
    avg_length = sum(word_lengths) / len(word_lengths)
    
    print(f"Total words: {len(words)}")
    print(f"Average word length: {avg_length:.2f} characters")
    print(f"Longest word: {max(words, key=len)} ({len(max(words, key=len))} chars)")
    print(f"Words: {words}")

text = "ሰላማት ወዲ ሓወይ ክብርቲ ኣደይ ንመን ትደሊ ተዛረብ"
analyze_word_patterns(text, tokenizer)

Ideal Use Cases

1. Linguistic Research

  • Morphological analysis of Tigrinya
  • Syntactic parsing and analysis
  • Word frequency studies
  • Corpus linguistics research

2. Educational Applications

  • Language learning tools
  • Vocabulary analysis
  • Reading comprehension systems
  • Text difficulty assessment

3. Specialized NLP Tasks

  • Named entity recognition
  • Part-of-speech tagging
  • Word-level classification
  • Semantic analysis

Limitations and Considerations

Out-of-Vocabulary (OOV) Issues

  • Problem: Unknown words become <unk> tokens
  • Impact: Loss of information for new or rare words
  • Mitigation: Regular vocabulary updates with new data

Large Vocabulary Size

  • Size: 50,000 tokens (larger than BPE/SentencePiece)
  • Memory: Higher memory requirements
  • Training: Slower embedding layer training

Limited Generalization

  • New Domains: May struggle with specialized terminology
  • Evolving Language: Needs updates for new words and expressions

Training Your Own WordLevel Tokenizer

To retrain this tokenizer with your own data:

# From the main project directory
python train_tigrinya_wordlevel.py

# Or using the unified interface
python train_tokenizers.py --type wordlevel

Training Parameters

# Key training parameters for WordLevel tokenizer
{
    "vocab_size": 50000,
    "min_frequency": 2,
    "special_tokens": ["<unk>", "<s>", "</s>", "<pad>", "<mask>"],
    "lowercase": False,  # Preserve Ge'ez script case
    "strip_accents": False,  # Preserve diacritics
    "clean_text": True
}

Best Practices

  1. Domain Adaptation: Train on domain-specific Tigrinya text for specialized applications
  2. Vocabulary Updates: Regularly update vocabulary with new text data
  3. OOV Monitoring: Monitor OOV rates and retrain when they become too high
  4. Preprocessing: Apply consistent text normalization before tokenization
  5. Evaluation: Test on held-out data to ensure good coverage

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer in your research, please cite:

@misc{tigrinya_wordlevel_tokenizer,
  title={Tigrinya WordLevel Tokenizer for LLM Training},
  year={2024},
  publisher={GitHub},
  howpublished={\url{https://github.com/mewaeltsegay/tokenizer}}
}

Ready to use WordLevel tokenization for perfect Tigrinya word boundaries?

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("./hf_tokenizer")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support