Multilingual Automatic Speech Recognition for Kinyarwanda, Swahili, and Luganda: Advancing ASR in Select East African Languages

This paper presents a multilingual Automatic Speech Recognition (ASR) model for three East African languages—Kinyarwanda, Swahili, and Luganda. The Common Voice project’s African languages datasets were used to produce a curated code-switched dataset of 3,900 hours on which the ASR model was trained. The work included validating the Kinyarwanda dataset and developing a model that achieves a 17.57 Word Error Rate (WER) on the language. Across all three languages, the Kinyarwanda model was finetuned and achieved a WER of 21.91 on the three curated datasets, with a WER of 25.48 for Kinyarwanda, 17.22 for Swahili, and 21.95 for Luganda. The paper emphasizes the necessity of considering the African environment when developing effective ASR systems and the significance of supporting many languages when developing ASR for languages with limited resources.