Paper ID | SPE-21.1 |
Paper Title |
TOP-DOWN ATTENTION IN END-TO-END SPOKEN LANGUAGE UNDERSTANDING |
Authors |
Yixin Chen, University of California, Los Angeles, United States; Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, Belinda Zeng, Amazon Alexa, United States |
Session | SPE-21: Speech Recognition 7: Training Methods for End-to-End Modeling |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 15:30 - 16:15 |
Presentation Time: | Wednesday, 09 June, 15:30 - 16:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. Recently, End-to-End SLU (E2E SLU) was proposed, performing SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcomings. A key challenge of this approach is how to best integrate the feature learning of the ASR and NLU sub-tasks to maximize performance. While generally, ASR models focus on low-level features, and NLU models need higher-level contextual information, ASR models can nonetheless also leverage top-down syntactic and semantic information to improve their recognition. Based on this insight, we propose Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features. We have validated our model using the FluentSpeech set and a large internal dataset. Results show TD-SLU is able to outperform selected baselines in terms of ASR and NLU quality metrics, and suggest that the added high-level information can improve the model's performance. |