2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	MMSP-2.4
Paper Title	Rule-embedded network for audio-visual voice activity detection in live musical video streams
Authors	Yuanbo Hou, Ghent University, Belgium; Yi Deng, New York University, United States; Bilei Zhu, Zejun Ma, Bytedance AI Lab, China; Dick Botteldooren, Ghent University, Belgium
Session	MMSP-2: Deep Learning for Multimedia Analysis and Processing
Location	Gather.Town
Session Time:	Tuesday, 08 June, 14:00 - 14:45
Presentation Time:	Tuesday, 08 June, 14:00 - 14:45
Presentation	Poster
Topic	Multimedia Signal Processing: Emerging Areas in Multimedia
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Detecting anchor’s voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.