Frame Level Speech Recognition
In this project, we leverage the power of feedforward neural networks to tackle the challenging task of speech recognition. The primary dataset used comprises phoneme state (subphoneme) labels derived from audio recordings of Wall Street Journal (WSJ) articles that have been read aloud. Each audio recording, referred to as an utterance, has been meticulously labeled using the original language, providing a rich dataset for training and evaluation.
The core objective is to accurately predict the phoneme state label for each frame in the test dataset. This task is particularly complex due to the variability in the length of utterances, which necessitates robust handling by the neural network model.
By applying our expertise in feedforward neural networks, we aim to improve the accuracy and efficiency of phoneme state recognition. This project not only demonstrates the practical application of deep learning techniques but also highlights the importance of high-quality labeled datasets in developing advanced speech recognition systems.