Paper ID | HLT-6.2 | ||
Paper Title | END-TO-END SPOKEN LANGUAGE UNDERSTANDING USING TRANSFORMER NETWORKS AND SELF-SUPERVISED PRE-TRAINED FEATURES | ||
Authors | Edmilson Morais, Hong Kwang J Kuo, Samuel Thomas, IBM, Brazil | ||
Session | HLT-6: Language Understanding 2: End-to-end Speech Understanding 2 | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 13:00 - 13:45 | ||
Presentation Time: | Wednesday, 09 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Human Language Technology: [HLT-UNDE] Spoken Language Understanding and Computational Semantics | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization. |