Bilingual ASR for dialects, code-switching, and songs – MiMo-V2.5 Voice

a6d1d769 1507 4648 9a30 e1b413e9a540.jpeg


Whisper modified what folks anticipated from open-source ASR. 3 years later, the leaderboard seems very other.

What it’s: MiMo-V2.5-ASR is an 8B open-source speech popularity style from Xiaomi MiMo, MIT-licensed and to be had on HuggingFace, constructed for bilingual Chinese language-English transcription throughout dialects, noisy audio, code-switched speech, and music lyrics.

The issue: maximum ASR fashions are benchmarked on blank studio information and deployed into the true international, the place audio is noisy, audio system overlap, and folks transfer languages mid-sentence. The space between benchmark accuracy and manufacturing accuracy is the place voice merchandise quietly fail.

The answer: staged working towards combining large-scale mid-training, supervised fine-tuning, and a reinforcement studying set of rules particularly concentrated on the eventualities the place typical fashions damage down. Local punctuation from prosody approach transcripts arrive in a position to make use of.

What makes it other: at the Open ASR Leaderboard, MiMo-V2.5-ASR posts 5.73% moderate WER on English, under Whisper large-v3 at 7.44%. On Wu dialect it rankings 19.55% vs FunASR-1.5 at 29.08%. On lyrics, 3.95% on m4singer vs Gemini 2.5 Professional at 4.25%. Those don’t seem to be cherry-picked eventualities — they’re the arduous ones.

Key options:

  • 8 Chinese language dialects natively supported, together with Wu, Cantonese, Hokkien, Sichuanese

  • Chinese language-English code-switching with out a language tags

  • Lyrics transcription underneath accompaniment and pitch variation

  • Multi-speaker and noisy setting robustness

  • Local punctuation, no post-processing wanted

  • MIT license, Python API, Gradio demo, self-hostable

Advantages:

  • Manufacturing-grade accuracy at the audio prerequisites that in reality exist within the box

  • One style replaces more than one regional or domain-specific ASR answers

  • Self-hosting gets rid of per-call API prices and assists in keeping information for your infra

  • In a position-to-use punctuated output cuts one step from each and every downstream pipeline

Who it is for: ML engineers and voice product groups development bilingual or Chinese language-language transcription pipelines who want accuracy that holds up out of doors the lab.

Open-source ASR has been catching as much as closed fashions for years. MiMo-V2.5-ASR is an information level that the distance is now very small, and in some eventualities long past.


Leave a Comment

Your email address will not be published. Required fields are marked *