Paper ID | HLT-10.4 | ||
Paper Title | ALIGN OR ATTEND? TOWARD MORE EFFICIENT AND ACCURATE SPOKEN WORD DISCOVERY USING SPEECH-TO-IMAGE RETRIEVAL | ||
Authors | Liming Wang, University of Illinois, Urbana-Champaign, United States; Xinsheng Wang, Delft University of Technology, Netherlands; Mark Hasegawa-Johnson, University of Illinois, Urbana-Champaign, United States; Odette Scharenborg, Delft University of Technology, Netherlands; Najim Dehak, Johns Hopkins University, Netherlands | ||
Session | HLT-10: Multi-modality in Language | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 16:30 - 17:15 | ||
Presentation Time: | Wednesday, 09 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-GASR] General Topics in Speech Recognition | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively. |