Paper ID | SPE-44.2 |
Paper Title |
END-TO-END DEREVERBERATION, BEAMFORMING, AND SPEECH RECOGNITION WITH IMPROVED NUMERICAL STABILITY AND ADVANCED FRONTEND |
Authors |
Wangyou Zhang, Shanghai Jiao Tong University, China; Christoph Boeddeker, Paderborn University, Germany; Shinji Watanabe, Johns Hopkins University, United States; Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, NTT Corporation, Japan; Reinhold Haeb-Umbach, Paderborn University, Germany; Yanmin Qian, Shanghai Jiao Tong University, China |
Session | SPE-44: Speech Recognition 16: Robust Speech Recognition 2 |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-ROBU] Robust Speech Recognition |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR = 12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion. |