Paper ID | SPE-51.6 | ||
Paper Title | TOWARDS AN ASR APPROACH USING ACOUSTIC AND LANGUAGE MODELS FOR SPEECH ENHANCEMENT | ||
Authors | Khandokar Md. Nayem, Donald S. Williamson, Indiana University, United States | ||
Session | SPE-51: Speech Enhancement 7: Single-channel Processing | ||
Location | Gather.Town | ||
Session Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-ENHA] Speech Enhancement and Separation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Recent work has shown that deep-learning based speech enhancement performs best when a time-frequency mask is estimated. Unlike speech, these masks have a small range of values that better facilitate regression-based learning. The question remains whether neural-network based speech estimation should be treated as a regression problem. In this work, we propose to modify the speech estimation process, by treating speech enhancement as a classification problem in an ASR-style manner. More specifically, we propose a quantized speech prediction model that classifies speech spectra into a corresponding quantized class. We then train and apply a language-style model that learns the transition probabilities of the quantized classes to ensure more realistic speech spectra. We compare our approach against time-frequency masking approaches, and the results show that our quantized spectra approach leads to improvements. |