Paper ID | SPE-25.5 | ||
Paper Title | MULTIMODAL EMOTION RECOGNITION WITH CAPSULE GRAPH CONVOLUTIONAL BASED REPRESENTATION FUSION | ||
Authors | Jiaxing Liu, Sen Chen, Longbiao Wang, Zhilei Liu, Yahui Fu, Lili Guo, Tianjin University, China; Jianwu Dang, Japan Advanced Institute of Science and Technology, Japan | ||
Session | SPE-25: Speech Emotion 3: Emotion Recognition - Representations, Data Augmentation | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 15:30 - 16:15 | ||
Presentation Time: | Wednesday, 09 June, 15:30 - 16:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-ANLS] Speech Analysis | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Due to the more robust characteristics compared to unimodal, audio-video multimodal emotion recognition (MER) has attracted a lot of attention. The efficiency of representation fusion algorithm often determines the performance of MER. Although there are many fusion algorithms, information redundancy and information complementarity are usually ignored. In this paper, we propose a novel representation fusion method, Capsule Graph Convolutional Network (CapsGCN). Firstly, after unimodal representation learning, the extracted audio and video representations are distilled by capsule network and encapsulated into multimodal capsules respectively. Multimodal capsules can effectively reduce data redundancy by the dynamic routing algorithm. Secondly, the multimodal capsules with their inter-relations and intra-relations are treated as a graph structure. The graph structure is learned by Graph Convolutional Network (GCN) to get hidden representation which is a good supplement for information complementarity. Finally, the multimodal capsules and hidden relational representation learned by CapsGCN are fed to multihead self-attention to balance the contributions of source representation and relational representation. To verify the performance, visualization of representation, the results of commonly used fusion methods, and ablation studies of the proposed CapsGCN are provided. Our proposed fusion method achieves 80.83% accuracy and 80.23% F1 score on eNTERFACE05’. |