Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition
Gilbert-FR-Source is a French automatic speech recognition (ASR) model used as the research foundation for the Gilbert project.
It is designed as an internal scientific baseline enabling controlled experimentation, reproducible evaluation, and rigorous comparison across ASR architectures, datasets, and adaptation methods.
This model is not a fine-tuned derivative, but a curated research anchor used to support systematic studies in:
- domain adaptation,
- robustness to spontaneous and long-form speech,
- accented and low-resource linguistic profiles,
- telephony and bandwidth-constrained speech,
- multi-speaker and meeting transcription.
1. Research Motivation
The Gilbert project aims to build highly specialized ASR systems optimized for:
- professional meeting transcription (hybrid/remote),
- long-form multi-speaker discourse,
- institutional environments (education, public sector),
- constrained audio conditions (telephony, VoIP, low SNR),
- sociolinguistic diversity (African, Canadian, Belgian and other French accents).
While Whisper Large V3 provides strong baseline performance, its behavior under domain shifts (spontaneous interactions, overlapping speech, degraded microphones) requires systematic study.Gilbert-FR-Source provides the frozen starting point for this line of research, ensuring controlled comparisons between experiments.
2. Scientific Goals and Research Questions
This model is used to answer a series of research questions:
Q1. Long-form modeling
How does Whisper-L3 behave on meetings lasting 30–120 minutes, with natural topic shifts, interruptions, and pragmatic markers?
Q2. Accent robustness
Which classes of French accents induce the strongest WER degradation?
How does robustness vary across FLEURS, African French, and Common Voice subsets?
Q3. Telephony adaptation
What is the degradation curve when downsampling to 16 kHz / 8 kHz / μ-law compressed audio?
Q4. Domain adaptation efficiency
What is the marginal gain of targeted fine-tuning on professional meeting datasets (education, administration, healthcare)?
Q5. Multilingual side-effects
To what extent does French fine-tuning affect cross-lingual generalization?
These research axes structure the development of future specialized Gilbert models.
3. Benchmark Reference Results
The following WER scores originate from established open benchmarks and serve as a reference baseline for future experiments:
| Dataset | WER |
|---|---|
| MLS (FR) | 3.98 % |
| Common Voice FR (v13.0) | 7.28 % |
| VoxPopuli (FR) | 8.91 % |
| Fleurs (FR) | 4.84 % |
| African Accented French | 4.20 % |
These results provide upper bounds before targeted fine-tuning.
Future Gilbert variants will be evaluated using:
- internal meeting datasets,
- domain-specific corpora (administration, higher education, healthcare),
- accented speech corpora,
- telephony datasets,
- long-form evaluation methods (> 1 hour audio).
4. Architecture
The model is based on the Whisper Large V3 encoder–decoder architecture, offering:
- large multilingual pretraining,
- long-context modeling capacity,
- robust cross-lingual alignment,
- stable decoding for long outputs,
- strong zero-shot performance on French.
It is compatible with:
- Hugging Face Transformers,
- CTranslate2,
- ONNX Runtime,
- MLX (Apple Silicon),
- quantization-based acceleration pipelines.
5. Methodology and Reproducibility
Gilbert-FR-Source is used in strict research settings emphasizing:
Reproducible training protocols
- frozen weights for baseline comparison,
- controlled hyperparameter schedules,
- consistent evaluation datasets,
- deterministic decoding configurations.
Evaluation methodology
WER is computed with standard normalization (lowercasing, punctuation removal).
More advanced metrics (diarization error rate, long-context drift) are included in internal research pipelines.
Versioning policy
This repository represents version 0.1 of the research baseline.
All future fine-tuned models will explicitly reference this version for traceability.
6. Limitations
This baseline inherits the known limitations of Whisper and of the underlying datasets:
- sensitivity to overlapping speech,
- occasional hallucinations in long-form decoding,
- domain shift on spontaneous dialogue,
- potential biases related to accent distribution in training data,
- suboptimal performance in telephony bandwidth.
Understanding and quantifying these limitations is one of the core objectives of the Gilbert research roadmap.
7. Future Work (Planned Research Directions)
The following models will be developed as independent checkpoints:
Gilbert-FR-Longform-v1
Long meetings, multi-speaker interaction, discourse-level context stability.Gilbert-FR-Accents-v1
Robustness to regional and international French accents.Gilbert-FR-Telephone-v1
Optimized for 8 kHz VoIP/call-center speech.Gilbert-Multilingual-v1
Extended cross-lingual performance with optimized French anchors.
Each model will include detailed evaluation reports and will adhere to research reproducibility standards.
8. License
This repository includes files distributed under the MIT License.
A copy of the MIT License is included.
Some files were originally released under MIT.
All future Gilbert models built on top of this baseline are the exclusive property of Lexia France.
9. Contact
For research collaboration, evaluation access, or technical inquiries:
- Website: https://gilbert-assistant.fr
- Email: [email protected]
Model tree for MEscriva/gilbert-fr-source
Base model
openai/whisper-large-v3