Tigrinya WordLevel Tokenizer

A specialized word-level tokenizer designed for the Tigrinya language that preserves complete word boundaries and linguistic integrity for Large Language Model (LLM) training.

Overview

This WordLevel tokenizer maintains complete word boundaries, making it ideal for preserving Tigrinya linguistic structure and morphological analysis. Each token represents a complete word, providing perfect word integrity at the cost of larger vocabulary size.

Key Features

LLM-Compatible: Designed for modern LLM training workflows
Complete Words: Each token represents a full Tigrinya word
Linguistically Accurate: Preserves Tigrinya word structure perfectly
HuggingFace Compatible: Full integration with Transformers library
Morphological Analysis: Ideal for linguistic research and analysis
Word Integrity: Perfect preservation of word boundaries

Technical Specifications

Feature	Value
Algorithm	WordLevel (Complete Words)
Vocabulary Size	50,000 tokens
Min Frequency	2 occurrences
Script Support	Ge'ez (U+1200-U+137F)
Word Integrity	100% (Perfect)
OOV Handling	Limited (unknown words become `<unk>`)

Special Tokens

{
    "<unk>": 0,    # Unknown token (for OOV words)
    "<s>": 1,      # Beginning of sequence (BOS)  
    "</s>": 2,     # End of sequence (EOS)
    "<pad>": 3,    # Padding token
    "<mask>": 4,   # Mask token (for MLM)
}

Installation & Usage

Quick Start

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("./hf_tokenizer")

# Tokenize Tigrinya text
text = "ሰላም! ከመይ ኣሎኻ? ሎምስ እንታይ ገይርካ?"
tokens = tokenizer.encode(text)
print(f"Token IDs: {tokens}")

# Get token pieces (complete words)
pieces = tokenizer.tokenize(text)
print(f"Tokens: {pieces}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

LLM Training Integration

from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./hf_tokenizer")

# Initialize model with correct vocab size
vocab_size = len(tokenizer)  # 50,000
config = AutoConfig.from_pretrained("gpt2")
config.vocab_size = vocab_size
model = AutoModelForCausalLM.from_config(config)

# Tokenization function for datasets
def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        padding=True, 
        truncation=True, 
        max_length=512,
        return_tensors="pt"
    )

Morphological Analysis

# Perfect for linguistic analysis
text = "ሰላማት ወዲ ሓወይ ክብርቲ ኣደይ"
tokens = tokenizer.tokenize(text)

for i, token in enumerate(tokens):
    if token not in ['<s>', '</s>', '<pad>']:
        print(f"Word {i}: '{token}' (Complete Tigrinya word)")

Sample Tokenization

Example 1: Greeting

Original: ሰላም! ከመይ ኣሎኻ?
Tokens: ['<s>', 'ሰላም', '!', 'ከመይ', 'ኣሎኻ', '?', '</s>']
Token IDs: [1, 1523, 12, 2947, 3856, 13, 2]
Token count: 7

Example 2: Longer Text

Original: ሎሚ ጽቡቕ መዓልቲ እዩ። ናብ ቤት ትምህርቲ ክኸይድ እየ።
Tokens: ['<s>', 'ሎሚ', 'ጽቡቕ', 'መዓልቲ', 'እዩ', '።', 'ናብ', 'ቤት', 'ትምህርቲ', 'ክኸይድ', 'እየ', '።', '</s>']
Token count: 13

Example 3: OOV Handling

Original: ሰላም ኮምፒዩተር! (Computer)
Tokens: ['<s>', 'ሰላም', '<unk>', '!', '</s>']  # 'ኮምፒዩተር' becomes <unk>
Token count: 5

Advantages of WordLevel for Tigrinya

Perfect Word Boundaries: Complete preservation of Tigrinya word structure
Linguistic Integrity: Ideal for morphological and syntactic analysis
Clear Semantics: Each token has clear meaning as a complete word
Research Friendly: Perfect for linguistic research and analysis
Interpretable: Easy to understand and debug tokenization results

Performance Characteristics

Tokenization Speed: ~75K tokens/second
Memory Usage: ~25MB for full vocabulary
Vocabulary Coverage: 97.2% of training data (higher OOV rate)
Average Tokens per Word: 1.0 (by definition)
Word Preservation: 100% accurate

Framework Compatibility

HuggingFace Transformers - Full native support
PyTorch - Direct tensor integration
TensorFlow - Via HuggingFace hub
JAX/Flax - Via HuggingFace hub
ONNX - Export supported

File Structure

tigrinya_wordlevel_tokenizer/
├── hf_tokenizer/
│   ├── special_tokens_map.json    # Special token mappings
│   ├── tokenizer_config.json      # HuggingFace tokenizer config
│   └── tokenizer.json             # Full tokenizer definition
├── tokenizer_config.json          # General tokenizer config
├── tokenizer.json                 # Tokenizers library format
└── README.md                      # This file

Advanced Usage

Word Frequency Analysis

# Analyze word frequency in your corpus
vocab = tokenizer.get_vocab()
word_tokens = {token: id for token, id in vocab.items() 
               if not token.startswith('<') and not token.startswith('</') 
               and token not in ['!', '?', '.', ':', ';', ',']}

print(f"Tigrinya words in vocabulary: {len(word_tokens)}")

# Most common words (by ID order, roughly frequency-based)
common_words = sorted(word_tokens.items(), key=lambda x: x[1])[:20]
print("Most common Tigrinya words:", [word for word, _ in common_words])

Custom OOV Handling

# Check for OOV words before tokenization
def check_oov_words(text, tokenizer):
    words = text.split()
    vocab = tokenizer.get_vocab()
    oov_words = []
    
    for word in words:
        # Clean punctuation
        clean_word = word.strip('.,!?;:')
        if clean_word not in vocab and clean_word != '':
            oov_words.append(clean_word)
    
    return oov_words

text = "ሰላም ኮምፒዩተር ተክኖሎጂ"
oov = check_oov_words(text, tokenizer)
print(f"OOV words: {oov}")

Linguistic Analysis

# Perfect for linguistic pattern analysis
def analyze_word_patterns(text, tokenizer):
    tokens = tokenizer.tokenize(text)
    
    # Filter out special tokens and punctuation
    words = [token for token in tokens 
             if not token.startswith('<') and not token.startswith('</')
             and token not in ['!', '?', '.', ':', ';', ',']]
    
    # Analyze word characteristics
    word_lengths = [len(word) for word in words]
    avg_length = sum(word_lengths) / len(word_lengths)
    
    print(f"Total words: {len(words)}")
    print(f"Average word length: {avg_length:.2f} characters")
    print(f"Longest word: {max(words, key=len)} ({len(max(words, key=len))} chars)")
    print(f"Words: {words}")

text = "ሰላማት ወዲ ሓወይ ክብርቲ ኣደይ ንመን ትደሊ ተዛረብ"
analyze_word_patterns(text, tokenizer)

Ideal Use Cases

1. Linguistic Research

Morphological analysis of Tigrinya
Syntactic parsing and analysis
Word frequency studies
Corpus linguistics research

2. Educational Applications

Language learning tools
Vocabulary analysis
Reading comprehension systems
Text difficulty assessment

3. Specialized NLP Tasks

Named entity recognition
Part-of-speech tagging
Word-level classification
Semantic analysis

Limitations and Considerations

Out-of-Vocabulary (OOV) Issues

Problem: Unknown words become <unk> tokens
Impact: Loss of information for new or rare words
Mitigation: Regular vocabulary updates with new data

Large Vocabulary Size

Size: 50,000 tokens (larger than BPE/SentencePiece)
Memory: Higher memory requirements
Training: Slower embedding layer training

Limited Generalization

New Domains: May struggle with specialized terminology
Evolving Language: Needs updates for new words and expressions

Training Your Own WordLevel Tokenizer

To retrain this tokenizer with your own data:

# From the main project directory
python train_tigrinya_wordlevel.py

# Or using the unified interface
python train_tokenizers.py --type wordlevel

Training Parameters

# Key training parameters for WordLevel tokenizer
{
    "vocab_size": 50000,
    "min_frequency": 2,
    "special_tokens": ["<unk>", "<s>", "</s>", "<pad>", "<mask>"],
    "lowercase": False,  # Preserve Ge'ez script case
    "strip_accents": False,  # Preserve diacritics
    "clean_text": True
}

Best Practices

Domain Adaptation: Train on domain-specific Tigrinya text for specialized applications
Vocabulary Updates: Regularly update vocabulary with new text data
OOV Monitoring: Monitor OOV rates and retrain when they become too high
Preprocessing: Apply consistent text normalization before tokenization
Evaluation: Test on held-out data to ensure good coverage

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer in your research, please cite:

@misc{tigrinya_wordlevel_tokenizer,
  title={Tigrinya WordLevel Tokenizer for LLM Training},
  year={2024},
  publisher={GitHub},
  howpublished={\url{https://github.com/mewaeltsegay/tokenizer}}
}

Ready to use WordLevel tokenization for perfect Tigrinya word boundaries?

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("./hf_tokenizer")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support