Paper ID | MMSP-6.3 | ||
Paper Title | MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION | ||
Authors | Licai Sun, University of Chinese Academy of Sciences, China; Bin Liu, Jianhua Tao, Zheng Lian, Institute of Automation, Chinese Academy of Sciences, China | ||
Session | MMSP-6: Human Centric Multimedia 2 | ||
Location | Gather.Town | ||
Session Time: | Thursday, 10 June, 14:00 - 14:45 | ||
Presentation Time: | Thursday, 10 June, 14:00 - 14:45 | ||
Presentation | Poster | ||
Topic | Multimedia Signal Processing: Human Centric Multimedia | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Speech Emotion Recognition (SER) requires a thorough understanding of both the linguistic content of an utterance (i.e., textual information) and how the speaker utters it (i.e., acoustic information). The one vital challenge in SER is how to effectively fuse these two kinds of information. In this paper, we propose a novel Multimodal Cross- and Self-Attention Network (MCSAN) to tackle this problem. The core of MCSAN is to employ the parallel cross- and selfattention modules to explicitly model both inter- and intra-modal interactions of audio and text. Specifically, the cross-attention module utilizes the cross-attention mechanism to guide one modality to attend to the other modality and update the features accordingly. Similarly, the self-attention module employs the self-attention mechanism to propagate information within each modality. We evaluate MCSAN on two benchmark datasets, IEMOCAP and MELD. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on both datasets. |