Whisper modified what folks anticipated from open-source ASR. 3 years later, the leaderboard seems very other.
What it’s: MiMo-V2.5-ASR is an 8B open-source speech popularity style from Xiaomi MiMo, MIT-licensed and to be had on HuggingFace, constructed for bilingual Chinese language-English transcription throughout dialects, noisy audio, code-switched speech, and music lyrics.
The issue: maximum ASR fashions are benchmarked on blank studio information and deployed into the true international, the place audio is noisy, audio system overlap, and folks transfer languages mid-sentence. The space between benchmark accuracy and manufacturing accuracy is the place voice merchandise quietly fail.
The answer: staged working towards combining large-scale mid-training, supervised fine-tuning, and a reinforcement studying set of rules particularly concentrated on the eventualities the place typical fashions damage down. Local punctuation from prosody approach transcripts arrive in a position to make use of.
What makes it other: at the Open ASR Leaderboard, MiMo-V2.5-ASR posts 5.73% moderate WER on English, under Whisper large-v3 at 7.44%. On Wu dialect it rankings 19.55% vs FunASR-1.5 at 29.08%. On lyrics, 3.95% on m4singer vs Gemini 2.5 Professional at 4.25%. Those don’t seem to be cherry-picked eventualities — they’re the arduous ones.
Key options:
-
8 Chinese language dialects natively supported, together with Wu, Cantonese, Hokkien, Sichuanese
-
Chinese language-English code-switching with out a language tags
-
Lyrics transcription underneath accompaniment and pitch variation
-
Multi-speaker and noisy setting robustness
-
Local punctuation, no post-processing wanted
-
MIT license, Python API, Gradio demo, self-hostable
Advantages:
-
Manufacturing-grade accuracy at the audio prerequisites that in reality exist within the box
-
One style replaces more than one regional or domain-specific ASR answers
-
Self-hosting gets rid of per-call API prices and assists in keeping information for your infra
-
In a position-to-use punctuated output cuts one step from each and every downstream pipeline
Who it is for: ML engineers and voice product groups development bilingual or Chinese language-language transcription pipelines who want accuracy that holds up out of doors the lab.
Open-source ASR has been catching as much as closed fashions for years. MiMo-V2.5-ASR is an information level that the distance is now very small, and in some eventualities long past.



