2nd MLC-SLM Challenge Launches, Advancing Multilingual Conversational Speech Understanding

From:Nexdata Date: 04/13/2026

As large language models continue to advance speech AI, Speech LLMs are evolving beyond basic speech recognition toward deeper understanding of real-world conversations. For researchers, the key bottleneck to further progress often lies not only in model architecture itself, but also in the availability of high-quality, diverse, and realistic training and evaluation data.

The Second Multilingual Dialogue Speech Language Model Challenge (MLC-SLM Challenge 2026) is centered on this issue. Compared with the first edition, this year’s challenge will release a larger-scale multilingual dialogue speech dataset with broader language coverage and richer accent diversity, further supporting research and evaluation of Speech LLMs in speaker diarization, speech recognition, acoustic understanding, and semantic understanding.

Why Data Remains a Key Bottleneck for Speech LLM

In recent years, Speech LLMs have made significant progress in automatic speech recognition tasks, and challenge results show that transcription-centered modeling capabilities are maturing rapidly. However, when research moves into real-world dialogue scenarios, the problems become more complex: who is speaking, when speakers change turns, how speech conveys meaning, and how the entire dialogue is understood in context. These challenges are all far more difficult than single-utterance transcription.

The first MLC-SLM workshop validated the importance of real-world multilingual dialogue data. The results showed that speech recognition performance is relatively strong, but speaker diarization remains a core challenge in complex multilingual, multi-turn dialogue scenarios. Therefore, the focus of the second competition is no longer limited to “hearing the content clearly,” but instead further pushes models to understand the interplay among dialogue structure, acoustic cues, and semantic information.

A Dataset More Suitable for Real-World Research Problems

This year’s challenge training set covers 14 languages, including English, French, German, Italian, Portuguese, Spanish, Japanese, Korean, Russian, Thai, and Vietnamese, as well as the newly added Tagalog, Urdu, and Turkish. The dataset totals approximately 2,100 hours, including about 500 hours of English, around 100 hours for most other languages, and approximately 200 hours each for French, Portuguese, and Spanish.

More importantly, this is not a “laboratory-style” speech dataset, but a data resource built around real-world dialogue. All recordings consist of natural two-speaker conversations on randomly assigned topics, with an emphasis on natural fluency, semantic coherence, and authentic communication characteristics. As a result, the dataset more closely reflects the real-world input conditions that future voice interaction systems are likely to encounter.

Not Only Multilingual, but Also Multi-Accent

For global Speech LLM research, multilinguality alone is not sufficient. In real-world applications, many of the key challenges models face arise from regional variation, pronunciation differences, and speaking-style diversity within the same language.

This dataset further strengthens that dimension. The English portion not only remains large in scale, but also covers five accent varieties: American, British, Australian, Indian, and Filipino English. In addition, the dataset includes Canadian French, Mexican Spanish, and Brazilian Portuguese. This design makes the dataset valuable not only for cross-lingual training, but also for studying a model’s ability to generalize across regional varieties.

The value of this dataset also lies in the fact that it is not designed for a single benchmark, but explicitly supports two key task directions.

Task 1: Multilingual Conversational Speech Diarization and Recognition

Task 2: Multilingual Conversational Speech Understanding

For more details, please visit the challenge page.

From “Transcription” to “Dialogue Understanding”

If the previous stage of research focused on improving speech recognition performance, the key question at this stage has become: can a model truly understand a multi-speaker, multi-turn, multilingual dialogue?

The second MLC-SLM Challenge sends a clear signal: future Speech LLMs should not only pursue lower transcription error rates, but also develop a more complete and unified set of capabilities across speaker modeling, acoustic cue extraction, semantic reasoning, and contextual understanding. The multilingual dialogue speech dataset released for this purpose provides the foundation for advancing the next stage of research.

From broader language coverage and more realistic dialogue scenarios to support for the joint modeling of speaker, acoustic, and semantic information, the dataset released in the second MLC-SLM Challenge is not only a core resource for the challenge itself, but also provides a more reliable experimental foundation for the next stage of Speech LLM research. We expect this dataset to help more research teams develop new methods, establish new baselines, and jointly advance multilingual Speech LLMs toward a more complete and practical stage of development.

For more details, please visit the challenge page.