2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-24.1
Paper Title	A NOVEL END-TO-END SPEECH EMOTION RECOGNITION NETWORK WITH STACKED TRANSFORMER LAYERS
Authors	Xianfeng Wang, Min Wang, Wenbo Qi, Wanqi Su, Xiangqian Wang, Huan Zhou, Artificial Intelligence Application Research Center, Huawei Technologies, China
Session	SPE-24: Speech Emotion 2: Neural Networks for Speech Emotion Recognition
Location	Gather.Town
Session Time:	Wednesday, 09 June, 15:30 - 16:15
Presentation Time:	Wednesday, 09 June, 15:30 - 16:15
Presentation	Poster
Topic	Speech Processing: [SPE-ANLS] Speech Analysis
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Speech emotion recognition (SER) aims to automatically recognize emotional category for a given speech utterance. The performance of a SER system heavily relies on the effectiveness of global representation expressed at utterance level. To effectively extract such a global feature, the mainstream of recent SER architectures adopts a pipeline with two key modules, feature extraction and aggregation. Although variant module designs have brought impressive progresses, SER is still a challenging task. In contrast with those previous works, herein we propose a novel strategy for global SER feature extraction by applying an additional enhancement module on top of the current SER pipeline. To verify its effect, an end-to-end SER architecture is proposed where stacked multiple transformer layers are explored to enhance the aggregated global feature. Such an architecture is evaluated on IEMOCAP and results strongly substantiate the effectiveness of our proposal. In terms of weighted accuracy on four emotion categories, our proposed SER system outperforms the prior arts by a large margin of relatively 20% improvement. Our codes and the pre-trained SER models are made publicly available.