Speech Lab

Classical Arabic Text-to-Speech Corpus

12 hours
1 male speaker
9,705 utterances
TTS, ASR

PDF Dataset BibTeX @inproceedings{kulkarni2023clartts, title={ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus}, author={Kulkarni, Ajinkya and Kulkarni, Atharva and Shatnawi, Sara Abedalmon'em Mohammad and Aldarmaki, Hanan}, booktitle={Proc. Interspeech 2023}, pages={5511--5515}, year={2023}, doi={10.21437/Interspeech.2023-2224} }

The Classical Arabic Text-to-Speech corpus is constructed using audio from the LibriVox project (public domain). Specifically, we used a single audiobook, Kitab Adab al-Dunya w'al-Din (972 - 1058 AD), recorded by a male speaker. The audio is sampled at 40100 Hz. We processed and segmented the original audio into shorter segments from 2 to 10 seconds, and discarded some samples that diverge in speaking style. In total, we kept around 12 hours of audio, and split it into train:test subsets (9,500:205 utterances). Before segmentation, we recruited native Arabic speakers to manually transcribe and validate the audio, including full diacritics. The dataset has been used for research on Arabic text-to-speech, ASR, and diacritic restoration. Check out the paper for more details on dataset construction and text-to-speech baselines.