2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-36.4
Paper Title AUDIO-VISUAL SPEECH ENHANCEMENT METHOD CONDITIONED ON THE LIP MOTION AND SPEAKER-DISCRIMINATIVE EMBEDDINGS
Authors Koichiro Ito, Masaaki Yamamoto, Kenji Nagamatsu, Hitachi, Ltd., Japan
SessionSPE-36: Speech Enhancement 6: Multi-modal Processing
LocationGather.Town
Session Time:Thursday, 10 June, 14:00 - 14:45
Presentation Time:Thursday, 10 June, 14:00 - 14:45
Presentation Poster
Topic Speech Processing: [SPE-ENHA] Speech Enhancement and Separation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract We propose an audio-visual speech enhancement (AVSE) method conditioned both on the speaker's lip motion and on speaker-discriminative embeddings. We particularly explore a method of extracting the embeddings directly from noisy audio in the AVSE setting without an enrollment procedure. We aim to improve speech-enhancement performance by conditioning the model with the embedding. To achieve this goal, we devise an AV voice activity detection (AV-VAD) module and a speaker identification module for the AVSE model. The AV-VAD module assesses reliable frames from which the identification module can extract a robust embedding for achieving an enhancement with the lip motion. To effectively train our modules, we propose multi-task learning between the AVSE, speaker identification, and VAD. Experimental results show that (1) our method directly extracted robust speaker embeddings from the noisy audio without an enrollment procedure and (2) improved the enhancement performance compared with the conventional AVSE methods.