From Manual Annotation to LLM-Aided Pipelines: A Retrospective on Dataset Curation at Typica.ai
Founder, Typica.ai | External Research Associate, UQAM (CRIA AI Lab)
Email: [email protected]; [email protected]
Abstract
In this retrospective, we present the evolution of dataset curation at Typica.ai, from early manual annotation efforts to modern AI-assisted pipelines for building high-quality NLP datasets. In 2020, we bootstrapped bilingual NER datasets (Moroccan Arabic and French) using Doccano, Heroku, and an active learning loop involving a small team of local annotators. These datasets enabled the training of MAGBERT-based NER models, one of which is publicly available. We reflect on the challenges faced—such as data loss due to a Heroku account issue—and the lessons learned. More recently, we transitioned to using Oracle Cloud Infrastructure (OCI) for cloud-based labeling and adopted tools like Distilabel and LLMs such as DeepSeek to generate instruction-tuning datasets. This paper summarizes five years of practical experience in NLP dataset curation across different stages of maturity and tooling.
1. Introduction
While large language models (LLMs) capture global headlines, the foundational work of dataset creation remains a bottleneck—especially for low-resource languages and domains. At Typica.ai, our journey began with a commitment to build NLP for the Moroccan context (Arabic MSA, French and Darija). In this article, we trace our trajectory from grassroots annotation projects to modern AI-assisted data pipelines, and share the practices and insights that shaped our evolution.
2. Dataset Goals and Curation Strategy (2020–2025)
Over the past five years, our goal at Typica.ai has been to build high-quality, culturally grounded NLP datasets for Modern Standard Arabic (MSA), French as used in formal Moroccan contexts, and Moroccan Colloquial Arabic (Darija). These datasets have supported multiple generations of models—from early Named Entity Recognition (NER) systems to more recent instruction-tuned large language models (LLMs).
Our curation strategy evolved through three major phases:
Manual Bootstrapping for NER (2020)
We started by developing two core NER datasets focused on:- MSA (Modern Standard Arabic): Using formal Arabic found in Moroccan media.
- Moroccan French: Capturing region-specific linguistic and named entity patterns.
These datasets were manually annotated using a small team of Moroccan students. Text was collected from publicly available online sources such as Moroccan news websites, ensuring a diverse mix of domains and writing styles.
Iterative Annotation with Automation
Using Doccano, we implemented an active learning pipeline to accelerate the annotation process:- Begin with a manually labeled seed sample.
- Train a lightweight model to pre-label the next batch.
- Manually review and correct those labels.
- Retrain and iterate.
This approach enabled scalable NER annotation with consistent quality and faster turnaround.
LLM-Era Curation and Darija Expansion (2023–2025)
In recent years, we expanded our focus to:- Moroccan Colloquial Arabic (Darija): Creating curated and synthetic datasets for low-resource generative tasks.
- Instruction-tuning datasets: Covering Arabic, French, and Darija.
We adopted:
- Cloud-based labeling via Oracle Cloud Infrastructure (OCI) for supervised annotation workflows.
- Distilabel for reliable and scalable pipelines for LLM-based synthetic data generation.
- LLMs (e.g. DeepSeek, OpenAI) for synthetic data generation and evaluation.
Throughout this evolution, our NLP-oriented dataset labeling work has spanned multiple tasks, including Named Entity Recognition (NER), sentiment analysis, part-of-speech tagging, and instruction-based generation tasks, all tailored to the linguistic and cultural nuances of the Moroccan context.
What began as a focused NER effort in formal languages has evolved into a broader, scalable pipeline for building multilingual and culturally relevant datasets for the Moroccan AI ecosystem.
3. Challenges and Lessons: Infrastructure Failures and Data Loss
At one point in our early workflow, we experienced a platform-related issue that led to the sudden loss of access to a critical annotation environment. In this case, our setup was hosted on Heroku, but the lesson applies broadly: when using third-party services or cloud platforms, failure to implement robust backup and monitoring strategies can result in irreversible data loss.
Without automated backups in place, we lost several weeks of manually labeled data—highlighting just how fragile early-stage datasets can be when treated informally.
Key Lessons:
- Always automate backups. Manual exports and ad hoc saves are not sufficient for high-value data.
- Treat annotation projects like production systems. Labeling environments deserve the same level of reliability and observability as any critical application.
- Version, monitor, and protect your datasets. Labeled data should be stored securely, version-controlled, and regularly audited.
This experience reshaped how we think about dataset infrastructure. Since then, we’ve adopted best practices for data resilience, with cloud-native pipelines that include automatic backups, storage redundancy, and tight version control.
4. Cloud-Based Labeling and Model Development with OCI
After the limitations we faced with earlier self-hosted tools, we migrated our entire data labeling and model development workflow to Oracle Cloud Infrastructure (OCI). This transition marked a turning point: it allowed us to consolidate annotation, training, and deployment into a cohesive, scalable, and secure environment.
Using OCI Data Labeling and OCI Data Science, as described in detail in my book Natural Language Processing on Oracle Cloud Infrastructure, we were able to:
- Centralize dataset management in an enterprise-grade cloud environment.
- Scale training using cost-effective GPU/CPU compute instances on demand.
- Integrate model deployment and monitoring directly into the development lifecycle using OCI’s built-in MLOps tools.
- Enforce secure storage, versioning, and team collaboration, eliminating earlier risks tied to fragile self-hosted platforms.
This cloud-native approach allowed us to move from exploratory workflows to reproducible, production-grade NLP pipelines—all while maintaining control over data privacy, cost, and governance.
Moreover, using OCI helped us bridge the gap between manual annotation loops and automated LLM-era pipelines by giving us the infrastructure needed to iterate faster, track quality, and operationalize our models confidently.
5. Enter the LLM Era: Using Distilabel with Both Open-Source and Commercial Models
As we moved into training instruction-tuned models, we adopted:
- Distilabel for reliable and scalable pipelines for LLM-based synthetic data generation
- LLMs (e.g. DeepSeek, OpenAI) for synthetic data generation and evaluation
Distilabel allows us to:
- Generate thousands of data samples programmatically
- Automate feedback with multiple LLMs
- Maintain high data quality with selective human review
This shift marked our transition from human-driven to AI-augmented dataset development—dramatically reducing turnaround time and annotation costs.
6. Reflections and Takeaways
| Phase | Tools & Infrastructure | Outcome |
|---|---|---|
| 2020 Manual Annotation | Doccano + self-hosted infrastructure (e.g., Heroku) | NER datasets for MSA and French; early Darija experimentation |
| Infrastructure Challenges | Limited backup & monitoring | Significant annotation loss; led to stricter data governance |
| OCI Transition | OCI Data Labeling + Data Science | End-to-end reproducible training and deployment workflows |
| 2024–2025 LLM Era | Distilabel + LLMs (e.g. DeepSeek, OpenAI) | Scalable pipelines for synthetic data and instruction tuning |
Core Lessons
What we learned—sometimes the hard way—from 5 years of hands-on dataset curation at Typica.ai. These insights continue to shape how we approach NLP for underrepresented languages like Darija and how we build robust AI systems.
- Your data is your moat. Models are everywhere—what makes you competitive is your unique, well-curated data.
- Automate smartly. Use tools like Doccano and Distilabel to scale, but keep humans involved where it matters most.
- Auto-label iteratively. Start small, train, label more, refine—this loop saves time and improves quality with each cycle.
- Train your annotators. Well-trained, culturally aware annotators outperform generic labeling teams every time.
- Garbage in, garbage out. Bad data leads to biased, brittle, or untrustworthy models.
- Human-in-the-loop is essential. Even when using LLMs to generate or rate data, human oversight ensures trust and accuracy.
- Datasets are code. Version them, back them up, log changes. They’re your IP.
- Context is critical. For languages like Darija, you need local and cultural understanding baked into the data from day one.
- Hybrid pipelines win. Combine expert knowledge, automation, LLMs, and feedback loops for best results.
7. Conclusion
Our dataset curation journey mirrors the broader evolution of NLP—from small-scale, hand-labeled datasets to scalable, model-assisted pipelines capable of supporting cutting-edge LLM applications. What began as a grassroots effort with manual annotation has matured into a fully integrated ecosystem combining human expertise, cloud infrastructure, and AI-driven automation.
For teams working in low-resource languages or culturally specific domains, our experience provides a practical blueprint: start lean, automate where it matters, and treat your data as your most valuable asset. Every dataset decision—from annotation strategy to infrastructure choice—has a lasting impact on model quality and system trustworthiness.
We hope this retrospective helps others facing similar challenges—whether in open-source NLP efforts, responsible AI development, or national-scale digital transformation initiatives.
Looking back, we’ve learned that even the smallest experiments—if designed with care and purpose—can evolve into robust, future-proof pipelines. Foundational work only becomes obsolete if you stop building on it.
Resources
- MAGBERT-NER (French) Demo: https://huggingface.co/spaces/TypicaAI/MagBERT-NER-Fr
- Book: Natural Language Processing on Oracle Cloud Infrastructure, Apress, 2025
Author Bio
Hicham Assoudi is an AI researcher, Oracle expert, author and founder of Typica.ai, a startup committed to building NLP tools for low-resource languages. He holds a Ph.D. in Artificial Intelligence and is an External Research Associate at UQAM's AI Lab (CRIA) in Montreal.
Contact
For questions, collaborations, or feedback, feel free to reach out:
📧 Email: [email protected]
🌐 Website: https://typica.ai
🔗 LinkedIn: linkedin.com/in/assoudi