Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition

Gilbert-FR-Source is a French automatic speech recognition (ASR) model used as the research foundation for the Gilbert project.
It is designed as an internal scientific baseline enabling controlled experimentation, reproducible evaluation, and rigorous comparison across ASR architectures, datasets, and adaptation methods.

This model is not a fine-tuned derivative, but a curated research anchor used to support systematic studies in:

  • domain adaptation,
  • robustness to spontaneous and long-form speech,
  • accented and low-resource linguistic profiles,
  • telephony and bandwidth-constrained speech,
  • multi-speaker and meeting transcription.

1. Research Motivation

The Gilbert project aims to build highly specialized ASR systems optimized for:

  • professional meeting transcription (hybrid/remote),
  • long-form multi-speaker discourse,
  • institutional environments (education, public sector),
  • constrained audio conditions (telephony, VoIP, low SNR),
  • sociolinguistic diversity (African, Canadian, Belgian and other French accents).

While Whisper Large V3 provides strong baseline performance, its behavior under domain shifts (spontaneous interactions, overlapping speech, degraded microphones) requires systematic study.
Gilbert-FR-Source provides the frozen starting point for this line of research, ensuring controlled comparisons between experiments.


2. Scientific Goals and Research Questions

This model is used to answer a series of research questions:

Q1. Long-form modeling

How does Whisper-L3 behave on meetings lasting 30–120 minutes, with natural topic shifts, interruptions, and pragmatic markers?

Q2. Accent robustness

Which classes of French accents induce the strongest WER degradation?
How does robustness vary across FLEURS, African French, and Common Voice subsets?

Q3. Telephony adaptation

What is the degradation curve when downsampling to 16 kHz / 8 kHz / μ-law compressed audio?

Q4. Domain adaptation efficiency

What is the marginal gain of targeted fine-tuning on professional meeting datasets (education, administration, healthcare)?

Q5. Multilingual side-effects

To what extent does French fine-tuning affect cross-lingual generalization?

These research axes structure the development of future specialized Gilbert models.


3. Benchmark Reference Results

The following WER scores originate from established open benchmarks and serve as a reference baseline for future experiments:

Dataset WER
MLS (FR) 3.98 %
Common Voice FR (v13.0) 7.28 %
VoxPopuli (FR) 8.91 %
Fleurs (FR) 4.84 %
African Accented French 4.20 %

These results provide upper bounds before targeted fine-tuning.
Future Gilbert variants will be evaluated using:

  • internal meeting datasets,
  • domain-specific corpora (administration, higher education, healthcare),
  • accented speech corpora,
  • telephony datasets,
  • long-form evaluation methods (> 1 hour audio).

4. Architecture

The model is based on the Whisper Large V3 encoder–decoder architecture, offering:

  • large multilingual pretraining,
  • long-context modeling capacity,
  • robust cross-lingual alignment,
  • stable decoding for long outputs,
  • strong zero-shot performance on French.

It is compatible with:

  • Hugging Face Transformers,
  • CTranslate2,
  • ONNX Runtime,
  • MLX (Apple Silicon),
  • quantization-based acceleration pipelines.

5. Methodology and Reproducibility

Gilbert-FR-Source is used in strict research settings emphasizing:

Reproducible training protocols

  • frozen weights for baseline comparison,
  • controlled hyperparameter schedules,
  • consistent evaluation datasets,
  • deterministic decoding configurations.

Evaluation methodology

WER is computed with standard normalization (lowercasing, punctuation removal).
More advanced metrics (diarization error rate, long-context drift) are included in internal research pipelines.

Versioning policy

This repository represents version 0.1 of the research baseline.
All future fine-tuned models will explicitly reference this version for traceability.


6. Limitations

This baseline inherits the known limitations of Whisper and of the underlying datasets:

  • sensitivity to overlapping speech,
  • occasional hallucinations in long-form decoding,
  • domain shift on spontaneous dialogue,
  • potential biases related to accent distribution in training data,
  • suboptimal performance in telephony bandwidth.

Understanding and quantifying these limitations is one of the core objectives of the Gilbert research roadmap.


7. Future Work (Planned Research Directions)

The following models will be developed as independent checkpoints:

  • Gilbert-FR-Longform-v1
    Long meetings, multi-speaker interaction, discourse-level context stability.

  • Gilbert-FR-Accents-v1
    Robustness to regional and international French accents.

  • Gilbert-FR-Telephone-v1
    Optimized for 8 kHz VoIP/call-center speech.

  • Gilbert-Multilingual-v1
    Extended cross-lingual performance with optimized French anchors.

Each model will include detailed evaluation reports and will adhere to research reproducibility standards.


8. License

This repository includes files distributed under the MIT License.

A copy of the MIT License is included.
Some files were originally released under MIT.

All future Gilbert models built on top of this baseline are the exclusive property of Lexia France.


9. Contact

For research collaboration, evaluation access, or technical inquiries:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MEscriva/gilbert-fr-source

Finetuned
(654)
this model

Datasets used to train MEscriva/gilbert-fr-source