2021 IEEE International Conference on Acoustics, Speech and Signal Processing

Technical Program

Paper ID	MLSP-44.1
Paper Title	MULTIMODAL PUNCTUATION PREDICTION WITH CONTEXTUAL DROPOUT
Authors	Andrew Silva, Georgia Institute of Technology, United States; Barry-John Theobald, Nicholas Apostoloff, Apple, United States
Session	MLSP-44: Multimodal Data and Applications
Location	Gather.Town
Session Time:	Friday, 11 June, 13:00 - 13:45
Presentation Time:	Friday, 11 June, 13:00 - 13:45
Presentation	Poster
Topic	Machine Learning for Signal Processing: [MLR-LMM] Learning from multimodal data
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Automatic speech recognition (ASR) is widely used in consumer electronics. ASR greatly improves the utility and accessibility of technology, but usually the output is only word sequences without punctuation. This can result in ambiguity in inferring user-intent. We first present a transformer-based approach for punctuation prediction that achieves 8% improvement on the IWSLT 2012 TED Task, beating the previous state of the art [1]. We next describe our multimodal model that learns from both text and audio, which achieves 8% improvement over the text-only algorithm on an internal dataset for which we have both the audio and transcriptions. Finally, we present an approach to learning a model using contextual dropout that allows us to handle variable amounts of future context at test time.