Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for English → Spanish on the OPUS Books dataset. It uses Hugging Face transformers, datasets, and evaluate, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.

Model Details

Model Description

This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using text_target=, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).

Developed by: Amir Hossein Yousefi (GitHub: amirhossein-yousefi)
Shared by : Hugging Face user Amirhossein75
Model type: Transformer encoder–decoder (MarianMT) for machine translation
Language(s) (NLP): Source: English (en) → Target: Spanish (es) by default (configurable)
License: Not explicitly specified in the repository. The base checkpoint Helsinki-NLP/opus-mt-en-es is released under CC-BY-4.0, and the OPUS Books dataset card lists license “other”; verify compatibility for your use case.
Finetuned from model : Helsinki-NLP/opus-mt-en-es (MarianMT)

Model Sources

Repository: https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation
Model on Hugging Face : https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
Base model: https://huggingface.co/Helsinki-NLP/opus-mt-en-es
Dataset: https://huggingface.co/datasets/Helsinki-NLP/opus_books
MarianMT docs: https://huggingface.co/docs/transformers/en/model_doc/marian
Related reading : Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”.

Uses

Direct Use

Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
Prototyping translation systems for English→Spanish (or other pairs after configuration changes).

Downstream Use

Fine-tune on domain-specific parallel corpora for production MT.
Replace the base model with T5/mBART/other OPUS-MT variants by changing TrainConfig.model_name.

Out-of-Scope Use

Safety‑critical or high‑stakes scenarios without human review.
Zero-shot translation to/from languages not covered by the checkpoint or dataset.
Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.

Bias, Risks, and Limitations

Domain & recency mismatch: OPUS Books contains copyright‑free books and is dated; performance may degrade on contemporary, conversational, or domain‑specific text.
Language & register: Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
General MT caveats: Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.

Recommendations

Evaluate on your domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.

How to Get Started with the Model

Option A — Quick inference (baseline checkpoint):

from transformers import pipeline
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
translator("The sea extended to the horizon.")

Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):

git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
cd Sequence2Sequence-Transformer-Translation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.train  # or: python src/train.py

Artifacts (model, tokenizer) are saved under the configured outputs directory; you can then push them to the Hub.

Training Details

Training Data

Dataset: OPUS Books (Helsinki-NLP/opus_books) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
Preprocessing: Tokenization uses Hugging Face tokenizers with text_target= for the target (labels), avoiding leakage and ensuring correct special‑token handling.

Training Procedure

Implemented with Hugging Face Trainer and TrainingArguments. Mixed precision (fp16) is enabled automatically when CUDA is available. Logging is written to TensorBoard under outputs/.../logs.

Preprocessing

Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
Max sequence lengths (source/target) and batch size are configurable in TrainConfig.

Training Hyperparameters

Training regime: Automatic mixed precision (fp16) when CUDA is available; standard fp32 otherwise.
Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in src/config.py and can be overridden in your script.

Speeds, Sizes, Times

Hardware: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
Total FLOPs (training): 4,945,267,757,416,448
Training runtime: 2,449.291 seconds (≈ 40:45 wall‑clock)
Throughput: train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s

Evaluation

Testing Data, Factors & Metrics

Testing Data

OPUS Books test split for EN→ES.

Factors

Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.

Metrics

sacreBLEU (higher is better)
chrF (higher is better)
Average generated length (tokens)

Results

BLEU (val/test): 23.41 / 23.41
chrF (val/test): 48.20 / 48.21
Loss (train/val/test): 1.854 / 1.883 / 1.859
Avg generation length (val/test): 30.27 / 29.88 tokens
Wall‑clock: train 40:45 · val 5:16 · test 5:18

Summary

The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.

Model Examination

Qualitative samples (EN→ES) and loss curves are included under assets/ and TensorBoard logs in outputs/.../logs.
Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
Hours used: ~0.68 hours (≈ 2,449 seconds) for the reported training run
Cloud Provider: N/A (local laptop)
Compute Region: N/A
Carbon Emitted: Not estimated; depends on local energy mix

Technical Specifications

Model Architecture and Objective

Transformer encoder–decoder (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.

Compute Infrastructure

Hardware

Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).

Software

Python 3.13+, transformers 4.42+, datasets 3.0+, evaluate 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.

Citation

If you use this model or code, please consider citing the OPUS‑MT work and Marian:

BibTeX (OPUS‑MT):

@inproceedings{tiedemann-thottingal-2020-opus,
  title = "{OPUS}-{MT} -- Building open translation services for the World",
  author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
  booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
  year = "2020"
}

BibTeX (Democratizing NMT with OPUS‑MT):

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
  journal={Language Resources and Evaluation},
  year={2023}
}

Glossary

BLEU: Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
chrF: Character n‑gram F‑score; more sensitive to morphological correctness.

More Information

See the repository README for project structure, defaults, and customization tips.
The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.

Model Card Authors

Amir Hossein Yousefi (project author)
(This model card drafted for the repository consumer.)

Model Card Contact

Open an issue in the repository or contact the Hugging Face user Amirhossein75.

Downloads last month: 8

Safetensors

Model size

77.5M params

Tensor type

F32

Model tree for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

Base model

Helsinki-NLP/opus-mt-en-es

Finetuned

(29)

this model

Amirhossein75
/

Sequence2Sequence-Transformer-Translation-Opus-MT