ArTST: Unified Arabic Speech Recognition and Synthesis Models

Our group is developing various datasets and models for supporting Arabic speech processing, such as Automatic Speech Recognition (ASR), Text-to-Speech synthesis (TTS), and diacritic restoration. We try to widen coverage of spoken varieties by including regional dialects and code-switching. Arabic Speech and Text Transformer (ArTST) is a project built with the idea that optimizing performance for Arabic speech requires building models for Arabic from the get-go, rather than fine-tuning English-centric on multilingual models. While multi-lingual models are impressive, they are inferior for specific target languages compared to monolingual models trained with the same amounts of data.
Our models are currently all based on the SpeechT5 architecture, notable for its unified multi-modal pre-training approach. It can handle both speech and text as input and output, allowing a single pre-trained foundation model to be effectively fine-tuned for diverse tasks like Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). This offers flexibility compared to many frameworks limited to a single output modality (e.g. Whisper, at the time of this writing, only supports text output). Our first version of ArTST (v1), which received the best paper award at ArabicNLP 2023, supported only Modern Standard Arabic (MSA). Subsequent versions include multiple dialects (v2) and languages (v3) to support dialectal speech and code-switching. We also recently added a version pre-trained with diacritics.
ArTST Versions: Choosing the Right Pre-Trained Model
Version | Pre-training Data Focus | Key Highlight | Recommended Use Case |
---|---|---|---|
v1 | Modern Standard Arabic (MSA) | Foundational MSA model (12.8% WER on MGB2) | High-quality MSA ASR/TTS (undiacritized) |
v2 | MSA + Arabic Dialects (11+) | Best average dialectal ASR performance | General Dialectal ASR |
v3 | MSA + Dialects + EN/FR | Handles Arabic-EN/FR code-switching | Code-Switching ASR |
v1.5 | MSA + Diacritized Text | Optimized for tasks requiring diacritics | High-quality Diacritized MSA TTS |
Getting Started with ArTST
Selecting task specific ArTST
Version (Fine-tuned Model ID) | Based On (Pre-trained) | Fine-tuning Focus | Task |
---|---|---|---|
MBZUAI/artst_asr | v1 (MSA Pre-trained) | MGB2 dataset (MSA) | ASR |
MBZUAI/artst_asr_v2 | v2 (Dialectal Pre-trained) | MGB2 dataset (MSA) | ASR |
MBZUAI/artst_asr_v3 | v3 (Multilingual Pre-trained) | MGB2 dataset (MSA) | ASR |
MBZUAI/artst_asr_v2_qasr | v2 (Dialectal Pre-trained) | QASR dataset (Dialectal/MSA) | ASR |
MBZUAI/artst_asr_v3_qasr | v3 (Multilingual Pre-trained) | QASR dataset (Dialectal/MSA) | ASR |
MBZUAI/speecht5_tts_clartts_ar | v1 (MSA Pre-trained) | MGB2 + ClArTTS dataset (MSA) | TTS |
transformers
library, like in the examples below. If you need to use the Fairseq toolkit methods shown in our papers, you can find code examples in our GitHub demo notebooks:
Example: Automatic Speech Recognition (ASR)
pip install transformers torch datasets soundfile librosa
.
from transformers import pipeline
import soundfile as sf
import librosa # Needed for resampling
import torch
import os
# --- Configuration ---
# 1. Select the appropriate ArTST ASR model ID from Hugging Face Hub
# Example: Using v1 for MSA ASR. Find more: https://huggingface.co/collections/MBZUAI/artst-arabic-text-speech-transformer-672cb44bb4215fd38814aeef
model_id = "MBZUAI/artst_asr"
# 2. Specify the path to your audio file
audio_path = "path/to/your/arabic_audio.wav" # IMPORTANT: Replace with your audio file path
# 3. Set target sample rate (ArTST models require 16kHz)
TARGET_SR = 16000
# --- End Configuration ---
# --- Audio Loading and Preprocessing ---
speech = None
if not os.path.exists(audio_path):
print(f"Error: Audio file not found at {audio_path}")
else:
try:
speech, sample_rate = sf.read(audio_path)
print(f"Loaded audio: {audio_path}, Sample Rate: {sample_rate}Hz, Duration: {len(speech)/sample_rate:.2f}s")
# Ensure mono audio
if speech.ndim > 1:
print("Audio appears to be stereo, converting to mono...")
speech = speech.mean(axis=1)
# Resample if necessary
if sample_rate != TARGET_SR:
print(f"Resampling audio from {sample_rate}Hz to {TARGET_SR}Hz...")
speech = librosa.resample(speech, orig_sr=sample_rate, target_sr=TARGET_SR)
sample_rate = TARGET_SR # Update sample rate after resampling
print("Resampling complete.")
except Exception as e:
print(f"Error loading or processing audio file: {e}")
print("Ensure 'libsndfile' (Linux: sudo apt-get install libsndfile1) and 'ffmpeg' are installed for broader format support.")
# --- End Audio Processing ---
# --- ASR Inference ---
if speech is not None:
print(f"\nInitializing ASR pipeline with model: {model_id}")
# Use GPU if available (device=0), otherwise CPU (device=-1)
device_id = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'cuda:0' if device_id == 0 else 'cpu'}")
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model_id,
device=device_id
)
print("Transcribing audio (this may take a moment for longer files)...")
# Chunking is recommended for long audio files to manage memory
transcription_result = asr_pipeline(
speech.copy(), # Pass a copy if you need the original array later
stride_length_s=(5, 0) # Overlap chunks slightly (5s here) for smoother transcription
)
print("\n--- Transcription Result ---")
print(transcription_result["text"])
print("---------------------------")
else:
print("\nCannot proceed with transcription due to audio loading/processing errors.")
# --- End ASR Inference ---
Example: Text-to-Speech (TTS)
pip install transformers datasets torch soundfile
.
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
import time
# --- Configuration ---
# 1. Select the ArTST TTS model ID (e.g., v1 for MSA)
# Find more: https://huggingface.co/models?search=mbzuai-nlp/artst
model_id = "MBZUAI/speecht5_tts_clartts_ar"
vocoder_id = "microsoft/speecht5_hifigan" # Standard HiFi-GAN vocoder for SpeechT5
# 2. Input text (use diacritized text for v1.5)
text_input = "لأنه لا يرى أنه على السفه ثم من بعد ذلك حديث منتشر"
# 3. Output audio file path
output_filename = "artst_tts_output.wav"
# --- End Configuration ---
# --- Model Loading ---
start_load_time = time.time()
print(f"Loading TTS components: {model_id} & {vocoder_id}")
try:
processor = SpeechT5Processor.from_pretrained(model_id)
model = SpeechT5ForTextToSpeech.from_pretrained(model_id)
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_id)
# Move models to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
vocoder.to(device)
print(f"Models loaded to {device} in {time.time() - start_load_time:.2f}s")
except Exception as e:
print(f"Error loading models: {e}")
exit()
# --- End Model Loading ---
# --- Speaker Embedding Loading ---
print("Loading speaker embeddings (required by SpeechT5)...")
try:
# Using a standard English dataset for embeddings; quality may vary for Arabic.
# Fine-tuning with Arabic speaker embeddings would yield better results.
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
# Example speaker embedding (index 7306: female US English). Experiment with different indices.
speaker_embeddings = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0).to(device)
print("Speaker embeddings loaded.")
except Exception as e:
print(f"Warning: Could not load speaker embeddings dataset: {e}. Using random embeddings as fallback.")
speaker_embeddings = torch.randn((1, 512)).to(device) # Fallback
# --- End Speaker Embedding Loading ---
# --- Speech Generation ---
print("Processing text and generating speech...")
start_gen_time = time.time()
inputs = processor(text=text_input, return_tensors="pt").to(device)
with torch.no_grad():
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
generation_time = time.time() - start_gen_time
print(f"Speech generated in {generation_time:.2f}s")
# --- End Speech Generation ---
# --- Saving Audio ---
print(f"Saving generated audio to {output_filename}...")
try:
sf.write(output_filename, speech.cpu().numpy(), samplerate=16000)
print(f"Successfully saved audio file. Duration: {len(speech)/16000:.2f}s")
except Exception as e:
print(f"Error saving audio file: {e}")
# --- End Saving Audio ---
Advanced Usage with Fairseq
For specific research experiments, particularly those involving Language Model (LM) fusion during ASR decoding (as described in our papers), the Fairseq toolkit was employed. This generally requires a deeper setup, including cloning the ArTST repository and running specific command-line scripts provided therein.
If your work requires these advanced capabilities, please consult the detailed instructions and scripts within the ArTST GitHub repository.
# Note: This is a conceptual example. Refer to the ArTST repo for actual commands.
# Example: Running Fairseq generation for ASR with an external LM
# fairseq-generate /path/to/audio/manifest.tsv \
# --config-yaml config_asr.yaml \
# --gen-subset test_set_name \
# --task speech_to_text \
# --path /path/to/artst_model_checkpoint.pt \
# --max-tokens 1000000 \
# --beam 10 \
# --scoring wer \
# --lm-path /path/to/external_language_model.pt \
# --lm-weight 0.5 \
# --word-score -1 \
# --results-path /path/to/output/results
Research Highlights & Key Findings
Our research with ArTST has yielded several key insights into Arabic speech processing:
MSA Performance (ArTST v1)
TTS: We also fine-tuned the model for Text-to-Speech synthesis on the ClArTTS dataset. One of our interesting findings is that reasonable TTS performance can be achieved without the inclusion of diacritics. Prior efforts on Arabic TTS rely on diacritic restoration as TTS systems are generally trained with relatively small amounts of data and rely on short contexts for sound styntehsis. The lack of diacritics means that the model has to infer the pronunciation of short vowels from context, which is far-fetched unless large amounts of data (and large models) are used. Due to ArTST's pre-training on ~1000 hours of Arabic speech, we were able to achieve decent TTS performance without the use of any diacritics, using roughly 12 hours of TTS training data. Furthermore, we experimented with using ASR data from MGB2 to do "pre-fine-tuning" for TTS, where we first fine-tune on MGB2 data, then ClArTTS. Note that ASR data are generally not suitable for TTS training since the data is generally noisy and inconsistent in style, speaking rate, emotion, etc. What we find is that this process results in further improvements in TTS quality. We refer to this model a ArTST*; you can listen to some samples here.
Dialectal ASR and Code-Switching (ArTST v2 & v3)
- For MSA, v1 and v2 perform equally well.
- For dialects, the joint v2 model with dialect inference achieves the best performance on average.
- Neither v1 nor v2 can handle code-switching with other languages.
- v3 performs well on Arabic-English and Arabic-French code-switching, at the cost of somewhat lower performance on monolinual Arabic.
Diacritics in Speech Processing (ArTST v1.5 & Multimodal Methods)
Multimodal Diacritic Restoration: Subsequently, we explored the potential of speech as an additional signal for diacritic restoration. As a use case, consider all the resources available for Arabic ASR, such as MGB2 (1000 hours) and QASR (2000 hours); these datasets contain speech and text transcripts, but most of the text contains no diacritics. Could we use both the speech and text for accurate diacritic restoration? Indeed, our paper describing such model was published at NAACL 2024. The proposed model incorporated a Whisper model fine-tuned on our ClArTTS dataset to produce diacritized transcripts, in addition to the raw text input. Using cross-attention, the network integrates predictions from ASR, as well as the correct undiacritized reference text, to restore the missing diacritics. Our findings reveal that such approach is effective and reduced diacritic error rates by half on the ClArTTS test set. We also tested on out-of-domain MSA data and observed some reduction in DER, but the effect was smaller. Overall, DER on MSA data was rather high, even when using popular open and closed text-based diacritic restoration model.
Data Augmentation:Our subsequent paper explored a data augmentation technique to improve the generaalization of the system. The proposed augmentation method consists of random diacritics applied to text, then synthesizing speech based on these randomly diacritized text using a commercial TTS system. The intuition behind this approach is to reduce the dependency of the model's predictions on the textual context to push the model towards better model of the acoustic properties that correspond to a given diacritic. This technique consistently improved diacritic recognition performance on all models and datasets.
Pre-training with Diacritics: Since our ArTST models are pre-trained without diacritics (s most speech resourced don't include diacritics), this may have an effect on fine-tuning performance when diacritics are included, as that requires some unlearning of the patterns aquired in the pre-training phase. Indeed, we find that this effect is most evident for TTS, where pre-training hampers the model's performance when diacritics are included. That is one reason why our original TTS models based on ArTST were undiacritized. Since pre-training does not require alignment between the speech and text data, We pre-trained another version of ArTST (v1.5) using MGB2 audio and diacritized text from Tashkeela. We then fine-tuned TTS with diacritics and compared with v1. Our hypothesis was supported, and the model fine-tuned from v1.5 was significantly better than the one fine-tuned on v1, everything else being equal.
Conclusion
The ArTST project provides open-source models specifically designed for the complexities of Arabic speech. By focusing on Arabic data from the start and tailoring versions for MSA, dialects, code-switching, and diacritics, ArTST offers state-of-the-art performance across various tasks. We encourage researchers and developers to explore the models on the Hugging Face Hub and contribute to the ongoing development on GitHub.
Comments