Model Card for ABDUL-Ar_20k

ABDUL-{VARIANT} is a BERT-base masked language model pretrained on phoneme-normalized Modern Standard Arabic (MSA) and optionally normalized French (depending on the variant). It is designed for North African Arabic dialects (e.g., Algerian, Moroccan Darija) even though it is trained only on formal data (MSA + French). Variants differ by vocab size (20k/30k/40k) and training mix (Ar | Ar+Fr | Ar+Fr+CS).


Model Details

Model Description

  • Developed by: [Yassine Toughrai, Kamel Smaili, David Langlois / LORIA]
  • Funded by [optional]: [ANR]
  • Shared by: [YassineToughrai]
  • Model type: BERT encoder (MLM objective only)
  • Language(s): Arabic (dialects + MSA), French
  • License: Apache 2.0
  • Finetuned from: None (trained from scratch)

Model Sources

  • Repository: [Ar_20k]
  • Paper: Modeling North African Dialects from Standard Languages (ArabicNLP 2025)

Uses

Direct Use

  • As a pretrained encoder for feature extraction (hidden states, embeddings).
  • Fill-mask experiments on normalized MSA / dialect input.

Downstream Use

  • Fine-tuning for NER (e.g., DzNER, DarNER, WikiFANE).
  • Fine-tuning for sentiment / polarity classification (e.g., TwiFil).
  • Other token-level classification tasks where North African dialects or MSA are involved.

Out-of-Scope Use

  • Performance drops significantly on unnormalized raw dialect text (requires preprocessing).
  • Not evaluated for text generation, speech, ASR, or diacritized Arabic.

Bias, Risks, and Limitations

  • Bias: Training data is OSCAR web text (Arabic + French), which may contain social, political, or cultural biases.
  • Risks: Applying the model without preprocessing can lead to high OOV rates and poor predictions.
  • Limitations: Evaluated mainly on NER and sentiment; generalization to other tasks is untested.

Recommendations

  • Always apply the same normalization procedure before tokenization.
  • Evaluate on your target domain before deployment in real-world applications.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train YassineToughrai/Ar_20k

Collection including YassineToughrai/Ar_20k