Paper ID | MMSP-7.5 |
Paper Title |
ROBUST LATENT REPRESENTATIONS VIA CROSS-MODAL TRANSLATION AND ALIGNMENT |
Authors |
Vandana Rajan, Queen Mary University of London, United Kingdom; Alessio Brutti, FBK, Italy; Andrea Cavallaro, Queen Mary University of London, United Kingdom |
Session | MMSP-7: Multimodal Perception, Integration and Multisensory Fusion |
Location | Gather.Town |
Session Time: | Friday, 11 June, 13:00 - 13:45 |
Presentation Time: | Friday, 11 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Multimedia Signal Processing: Human Centric Multimedia |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of a worse performing (or weaker) modality. The translation from the weaker to the better performing (or stronger) modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representation in a shared latent space. We validate the proposed framework on the AVEC 2016 dataset (RECOLA) for continuous emotion recognition and show the effectiveness of the framework that achieves state-of-the-art (uni-modal) performance for weaker modalities. |