Model Card for ABDUL-Ar_20k

ABDUL-{VARIANT} is a BERT-base masked language model pretrained on phoneme-normalized Modern Standard Arabic (MSA) and optionally normalized French (depending on the variant). It is designed for North African Arabic dialects (e.g., Algerian, Moroccan Darija) even though it is trained only on formal data (MSA + French). Variants differ by vocab size (20k/30k/40k) and training mix (Ar | Ar+Fr | Ar+Fr+CS).

Model Details

Model Description

Developed by: [Yassine Toughrai, Kamel Smaili, David Langlois / LORIA]
Funded by [optional]: [ANR]
Shared by: [YassineToughrai]
Model type: BERT encoder (MLM objective only)
Language(s): Arabic (dialects + MSA), French
License: Apache 2.0
Finetuned from: None (trained from scratch)

Model Sources

Repository: [Ar_20k]
Paper: Modeling North African Dialects from Standard Languages (ArabicNLP 2025)

Uses

Direct Use

As a pretrained encoder for feature extraction (hidden states, embeddings).
Fill-mask experiments on normalized MSA / dialect input.

Downstream Use

Fine-tuning for NER (e.g., DzNER, DarNER, WikiFANE).
Fine-tuning for sentiment / polarity classification (e.g., TwiFil).
Other token-level classification tasks where North African dialects or MSA are involved.

Out-of-Scope Use

Performance drops significantly on unnormalized raw dialect text (requires preprocessing).
Not evaluated for text generation, speech, ASR, or diacritized Arabic.

Bias, Risks, and Limitations

Bias: Training data is OSCAR web text (Arabic + French), which may contain social, political, or cultural biases.
Risks: Applying the model without preprocessing can lead to high OOV rates and poor predictions.
Limitations: Evaluated mainly on NER and sentiment; generalization to other tasks is untested.

Recommendations

Always apply the same normalization procedure before tokenization.
Evaluate on your target domain before deployment in real-world applications.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train YassineToughrai/Ar_20k

Collection including YassineToughrai/Ar_20k

ABDUL pretrained models

Collection

These are the models pretrained and evaluated in the paper "Modeling North African Dialects from Standard Languages" @ArabicNLP2025 • 6 items • Updated Sep 19