Paper ID | MMSP-7.5 | ||
Paper Title | ROBUST LATENT REPRESENTATIONS VIA CROSS-MODAL TRANSLATION AND ALIGNMENT | ||
Authors | Vandana Rajan, Queen Mary University of London, United Kingdom; Alessio Brutti, FBK, Italy; Andrea Cavallaro, Queen Mary University of London, United Kingdom | ||
Session | MMSP-7: Multimodal Perception, Integration and Multisensory Fusion | ||
Location | Gather.Town | ||
Session Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Multimedia Signal Processing: Human Centric Multimedia | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of a worse performing (or weaker) modality. The translation from the weaker to the better performing (or stronger) modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representation in a shared latent space. We validate the proposed framework on the AVEC 2016 dataset (RECOLA) for continuous emotion recognition and show the effectiveness of the framework that achieves state-of-the-art (uni-modal) performance for weaker modalities. |