2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-21.1
Paper Title TOP-DOWN ATTENTION IN END-TO-END SPOKEN LANGUAGE UNDERSTANDING
Authors Yixin Chen, University of California, Los Angeles, United States; Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, Belinda Zeng, Amazon Alexa, United States
SessionSPE-21: Speech Recognition 7: Training Methods for End-to-End Modeling
LocationGather.Town
Session Time:Wednesday, 09 June, 15:30 - 16:15
Presentation Time:Wednesday, 09 June, 15:30 - 16:15
Presentation Poster
Topic Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. Recently, End-to-End SLU (E2E SLU) was proposed, performing SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcomings. A key challenge of this approach is how to best integrate the feature learning of the ASR and NLU sub-tasks to maximize performance. While generally, ASR models focus on low-level features, and NLU models need higher-level contextual information, ASR models can nonetheless also leverage top-down syntactic and semantic information to improve their recognition. Based on this insight, we propose Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features. We have validated our model using the FluentSpeech set and a large internal dataset. Results show TD-SLU is able to outperform selected baselines in terms of ASR and NLU quality metrics, and suggest that the added high-level information can improve the model's performance.