Paper ID | SPE-24.1 |
Paper Title |
A NOVEL END-TO-END SPEECH EMOTION RECOGNITION NETWORK WITH STACKED TRANSFORMER LAYERS |
Authors |
Xianfeng Wang, Min Wang, Wenbo Qi, Wanqi Su, Xiangqian Wang, Huan Zhou, Artificial Intelligence Application Research Center, Huawei Technologies, China |
Session | SPE-24: Speech Emotion 2: Neural Networks for Speech Emotion Recognition |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 15:30 - 16:15 |
Presentation Time: | Wednesday, 09 June, 15:30 - 16:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-ANLS] Speech Analysis |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Speech emotion recognition (SER) aims to automatically recognize emotional category for a given speech utterance. The performance of a SER system heavily relies on the effectiveness of global representation expressed at utterance level. To effectively extract such a global feature, the mainstream of recent SER architectures adopts a pipeline with two key modules, feature extraction and aggregation. Although variant module designs have brought impressive progresses, SER is still a challenging task. In contrast with those previous works, herein we propose a novel strategy for global SER feature extraction by applying an additional enhancement module on top of the current SER pipeline. To verify its effect, an end-to-end SER architecture is proposed where stacked multiple transformer layers are explored to enhance the aggregated global feature. Such an architecture is evaluated on IEMOCAP and results strongly substantiate the effectiveness of our proposal. In terms of weighted accuracy on four emotion categories, our proposed SER system outperforms the prior arts by a large margin of relatively 20% improvement. Our codes and the pre-trained SER models are made publicly available. |