Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for English → Spanish on the OPUS Books dataset. It uses Hugging Face transformers, datasets, and evaluate, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.

Model Details

Model Description

This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using text_target=, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).

  • Developed by: Amir Hossein Yousefi (GitHub: amirhossein-yousefi)
  • Shared by : Hugging Face user Amirhossein75
  • Model type: Transformer encoder–decoder (MarianMT) for machine translation
  • Language(s) (NLP): Source: English (en) → Target: Spanish (es) by default (configurable)
  • License: Not explicitly specified in the repository. The base checkpoint Helsinki-NLP/opus-mt-en-es is released under CC-BY-4.0, and the OPUS Books dataset card lists license “other”; verify compatibility for your use case.
  • Finetuned from model : Helsinki-NLP/opus-mt-en-es (MarianMT)

Model Sources

Uses

Direct Use

  • Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
  • Prototyping translation systems for English→Spanish (or other pairs after configuration changes).

Downstream Use

  • Fine-tune on domain-specific parallel corpora for production MT.
  • Replace the base model with T5/mBART/other OPUS-MT variants by changing TrainConfig.model_name.

Out-of-Scope Use

  • Safety‑critical or high‑stakes scenarios without human review.
  • Zero-shot translation to/from languages not covered by the checkpoint or dataset.
  • Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.

Bias, Risks, and Limitations

  • Domain & recency mismatch: OPUS Books contains copyright‑free books and is dated; performance may degrade on contemporary, conversational, or domain‑specific text.
  • Language & register: Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
  • General MT caveats: Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.

Recommendations

  • Evaluate on your domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
  • Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
  • If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.

How to Get Started with the Model

Option A — Quick inference (baseline checkpoint):

from transformers import pipeline
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
translator("The sea extended to the horizon.")

Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):

git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
cd Sequence2Sequence-Transformer-Translation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.train  # or: python src/train.py

Artifacts (model, tokenizer) are saved under the configured outputs directory; you can then push them to the Hub.

Training Details

Training Data

  • Dataset: OPUS Books (Helsinki-NLP/opus_books) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
  • Preprocessing: Tokenization uses Hugging Face tokenizers with text_target= for the target (labels), avoiding leakage and ensuring correct special‑token handling.

Training Procedure

Implemented with Hugging Face Trainer and TrainingArguments. Mixed precision (fp16) is enabled automatically when CUDA is available. Logging is written to TensorBoard under outputs/.../logs.

Preprocessing

  • Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
  • Max sequence lengths (source/target) and batch size are configurable in TrainConfig.

Training Hyperparameters

  • Training regime: Automatic mixed precision (fp16) when CUDA is available; standard fp32 otherwise.
  • Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in src/config.py and can be overridden in your script.

Speeds, Sizes, Times

  • Hardware: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
  • Total FLOPs (training): 4,945,267,757,416,448
  • Training runtime: 2,449.291 seconds (≈ 40:45 wall‑clock)
  • Throughput: train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • OPUS Books test split for EN→ES.

Factors

  • Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.

Metrics

  • sacreBLEU (higher is better)
  • chrF (higher is better)
  • Average generated length (tokens)

Results

  • BLEU (val/test): 23.41 / 23.41
  • chrF (val/test): 48.20 / 48.21
  • Loss (train/val/test): 1.854 / 1.883 / 1.859
  • Avg generation length (val/test): 30.27 / 29.88 tokens
  • Wall‑clock: train 40:45 · val 5:16 · test 5:18

Summary

The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.

Model Examination

  • Qualitative samples (EN→ES) and loss curves are included under assets/ and TensorBoard logs in outputs/.../logs.
  • Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
  • Hours used: ~0.68 hours (≈ 2,449 seconds) for the reported training run
  • Cloud Provider: N/A (local laptop)
  • Compute Region: N/A
  • Carbon Emitted: Not estimated; depends on local energy mix

Technical Specifications

Model Architecture and Objective

  • Transformer encoder–decoder (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.

Compute Infrastructure

Hardware

  • Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).

Software

  • Python 3.13+, transformers 4.42+, datasets 3.0+, evaluate 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.

Citation

If you use this model or code, please consider citing the OPUS‑MT work and Marian:

BibTeX (OPUS‑MT):

@inproceedings{tiedemann-thottingal-2020-opus,
  title = "{OPUS}-{MT} -- Building open translation services for the World",
  author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
  booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
  year = "2020"
}

BibTeX (Democratizing NMT with OPUS‑MT):

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
  journal={Language Resources and Evaluation},
  year={2023}
}

Glossary

  • BLEU: Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
  • chrF: Character n‑gram F‑score; more sensitive to morphological correctness.

More Information

  • See the repository README for project structure, defaults, and customization tips.
  • The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.

Model Card Authors

  • Amir Hossein Yousefi (project author)
  • (This model card drafted for the repository consumer.)

Model Card Contact

  • Open an issue in the repository or contact the Hugging Face user Amirhossein75.
Downloads last month
8
Safetensors
Model size
77.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

Finetuned
(29)
this model

Dataset used to train Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT