--- license: apache-2.0 base_model: llama library_name: transformers pipeline_tag: text-generation tags: - one-way-polyglot - japanese - english - bilingual - small-model --- # one-way-polyglot-8m-tied A one-way polyglot language model trained to understand Japanese but generate only English. ## Model Details - **Architecture**: LLaMA-based transformer - **Parameters**: 8,519,936 (8.5M) - **Vocabulary**: 16,384 tokens (bilingual SentencePiece) - **Context Length**: 512 tokens - **Embedding Strategy**: Tied ## Capabilities - **Semantic Transfer**: Understands Japanese input and generates contextually appropriate English - **One-Way Constraint**: Strong bias toward English-only generation - **Name Transliteration**: Can transliterate Japanese names to English (context-dependent) ## Training Data Trained on bilingual Japanese-English story data with masked loss on Japanese prefixes to enforce one-way generation. ## Usage ```python from transformers import LlamaForCausalLM, AutoTokenizer model = LlamaForCausalLM.from_pretrained("one-way-polyglot-8m-tied") tokenizer = AutoTokenizer.from_pretrained("one-way-polyglot-8m-tied") # Japanese input → English output (primary use case) prompt = "昔々、赤い傘を持った少女がいました。" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Mixed-language name transliteration prompt = "太郎は公園で花子と遊んでいました。After playing, Taro told Hanako that" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # English text (works perfectly with case folding) prompt = "Hello World" # Automatically normalized to lowercase inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Tokenizer Features - **✅ Case Folding**: "Hello", "hello", and "HELLO" produce identical tokenization - **✅ Japanese Support**: Full Japanese text support with proper normalization - **✅ No UNK Tokens**: Proper handling of uppercase/lowercase English text - **✅ SentencePiece Compatibility**: Built using proper Unigram model with normalization ## Model Variants This is part of a series exploring one-way polyglot capabilities: - 1.25M parameters (tied embeddings) - 8.5M parameters (tied embeddings) - 12.7M parameters (untied embeddings) - 15.7M parameters (tied embeddings) ## License Apache 2.0