2021 IEEE International Conference on Acoustics, Speech and Signal Processing

Technical Program

Paper ID	MMSP-3.4
Paper Title	SHOW AND SPEAK: DIRECTLY SYNTHESIZE SPOKEN DESCRIPTION OF IMAGES
Authors	Xinsheng Wang, Xi’an Jiaotong University, China; Siyuan Feng, Delft University of Technology, Netherlands; Jihua Zhu, Xi’an Jiaotong University, China; Mark Hasegawa-Johnson, University of Illinois at Urbana-Champaign, United States; Odette Scharenborg, Delft University of Technology, Netherlands
Session	MMSP-3: Multimedia Synthesis and Enhancement
Location	Gather.Town
Session Time:	Wednesday, 09 June, 14:00 - 14:45
Presentation Time:	Wednesday, 09 June, 14:00 - 14:45
Presentation	Poster
Topic	Multimedia Signal Processing: Emerging Areas in Multimedia
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.