--- datasets: - oscar-corpus/OSCAR-2201 language: - ar - fr --- # Model Card for ABDUL-Ar_20k ABDUL-{VARIANT} is a **BERT-base masked language model** pretrained on **phoneme-normalized Modern Standard Arabic (MSA)** and optionally **normalized French** (depending on the variant). It is designed for **North African Arabic dialects (e.g., Algerian, Moroccan Darija)** even though it is trained **only on formal data (MSA + French)**. Variants differ by **vocab size (20k/30k/40k)** and **training mix (Ar | Ar+Fr | Ar+Fr+CS)**. --- ## Model Details ### Model Description - **Developed by:** [Yassine Toughrai, Kamel Smaili, David Langlois / LORIA] - **Funded by [optional]:** [ANR] - **Shared by:** [YassineToughrai] - **Model type:** BERT encoder (MLM objective only) - **Language(s):** Arabic (dialects + MSA), French - **License:** Apache 2.0 - **Finetuned from:** None (trained from scratch) ### Model Sources - **Repository:** [[Ar_20k](https://huggingface.co/YassineToughrai/Ar_20k)] - **Paper:** *Modeling North African Dialects from Standard Languages* (ArabicNLP 2025) --- ## Uses ### Direct Use - As a pretrained encoder for **feature extraction** (hidden states, embeddings). - Fill-mask experiments on normalized MSA / dialect input. ### Downstream Use - Fine-tuning for **NER** (e.g., DzNER, DarNER, WikiFANE). - Fine-tuning for **sentiment / polarity classification** (e.g., TwiFil). - Other token-level classification tasks where **North African dialects** or **MSA** are involved. ### Out-of-Scope Use - Performance drops significantly on **unnormalized raw dialect text** (requires preprocessing). - Not evaluated for **text generation, speech, ASR, or diacritized Arabic**. --- ## Bias, Risks, and Limitations - **Bias:** Training data is OSCAR web text (Arabic + French), which may contain social, political, or cultural biases. - **Risks:** Applying the model without preprocessing can lead to high OOV rates and poor predictions. - **Limitations:** Evaluated mainly on NER and sentiment; generalization to other tasks is untested. ### Recommendations - Always apply the same **normalization procedure** before tokenization. - Evaluate on your target domain before deployment in real-world applications. --- ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### Training Data [More Information Needed] ### Training Procedure #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Model Examination [optional] [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]