Speech Lab

Mixat: A Data Set of Bilingual Emirati-English Speech

15 hours
5+ male speakers (train), 1 female speaker (test)
5,316 utterances
ASR, Code-switching

Mixat PolyWER Dataset Mixat @inproceedings{al-ali-aldarmaki-2024-mixat, title = "Mixat: A Data Set of Bilingual Emirati-{E}nglish Speech", author = "Al Ali, Maryam Khalifa and Aldarmaki, Hanan", booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024", year = "2024", } PolyWER @inproceedings{kadaoui-etal-2024-polywer, title = "{P}oly{WER}: A Holistic Evaluation Framework for Code-Switched Speech Recognition", author = "Kadaoui, Karima and Ali, Maryam Al and Toyin, Hawau Olamide and Mohammed, Ibrahim and Aldarmaki, Hanan", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", year = "2024", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.356/", }

Mixat is an ASR dataset of Emirati Arabic speech, code-mixed with English. The dataset consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, extracted with the hosts' permission (to be used for research purposes). The first podcast (used as a train set) consists of conversations between the host and various guests, from which we extracted 3,728 utterances. The second podcast (test set) is a single-speaker podcast, from which we extracted 1,588 segments. Check out the Mixat paper for more details on dataset construction.
In addition to the original reference transcriptions, where code-switched English segments are transcribed in the Latin script, we added two additional transcription types: one with transliterations of English segments into Arabic script, and one with their translation into Emirati Arabic. The transliterated/translated segments are marked with brackets to easily separate them from the original Arabic text. These additional annotations were added to support the implementation of the PolyWER metric. ❗ Note that the transcriptions in Huggingface have been modified from the original ones provided in Mixat, where transcription errors have been found and corrected. If you use the dataset or transcriptions provided in Huggingface, place cite both papers.