2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-49.3
Paper Title	SPEECH PREDICTION IN SILENT VIDEOS USING VARIATIONAL AUTOENCODERS
Authors	Ravindra Yadav, Indian Institute of Technology, Kanpur, India; Ashish Sardana, NVIDIA, India; Vinay P Namboodiri, University of Bath, United Kingdom; Rajesh M Hegde, Indian Institute of Technology, Kanpur, India
Session	SPE-49: Speech Synthesis 7: General Topics
Location	Gather.Town
Session Time:	Friday, 11 June, 11:30 - 12:15
Presentation Time:	Friday, 11 June, 11:30 - 12:15
Presentation	Poster
Topic	Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.