Paper ID | SPE-28.1 |
Paper Title |
REDAT: Accent-Invariant Representation for End-to-End ASR by Domain Adversarial Training with Relabeling |
Authors |
Hu Hu, Georgia Institute of Technology, United States; Xuesong Yang, Zeynab Raeesy, Jinxi Guo, Gokce Keskin, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Roland Maas, Amazon Alexa Speech, United States |
Session | SPE-28: Speech Recognition 10: Robustness to Human Speech Variability |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 16:30 - 17:15 |
Presentation Time: | Wednesday, 09 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial train- ing (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent- invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions. Motivated by the proof of equivalence, we introduce reDAT, a novel technique based on DAT, which relabels data using either unsupervised clustering or soft labels. Experiments on 23K hours of multi-accent data show that DAT achieves competitive results over accent-specific baselines on both native and non-native English accents but up to 13% relative WER reduction on unseen accents; our reDAT yields further improvements over DAT by 3% and 8% relatively on non-native accents of American and British English. |