2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-41.7
Paper Title	Robust Voice Activity Detection Using A Masked Auditory Encoder Based Convolutional Neural Network
Authors	Nan Li, Longbiao Wang, Tianjin university, China; Masashi Unoki, Japan Advanced Institute of Science and Technology, Japan; Sheng Li, National Institute of Information and Communications Technology, Japan; Rui Wang, Japan Advanced Institute of Science and Technology, Japan; Meng Ge, Tianjin university, China; Jianwu Dang, Japan Advanced Institute of Science and Technology and Tianjin University, Japan
Session	SPE-41: Voice Activity and Disfluency Detection
Location	Gather.Town
Session Time:	Thursday, 10 June, 15:30 - 16:15
Presentation Time:	Thursday, 10 June, 15:30 - 16:15
Presentation	Poster
Topic	Speech Processing: [SPE-VAD] Voice Activity Detection and End-pointing
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Voice activity detection (VAD) based on deep learning has achieved remarkable success. However, when the traditional features (e.g., raw waveforms and MFCCs) are directly fed to the deep neural network model, the performance decreases because of noise interference. Here, we propose a robust VAD approach using a masked auditory encoder based convolutional neural network (M-AECNN). First, we analyze the effectiveness of using auditory features as deep learning encoder. These features can roughly simulate the transmission of sound to human inner-ear hair cells; thus, they are more robust than the raw waveform and frequency domain features designed as encoders. Second, similar to the human ear’s masking effect for different speech frequencies, the proposed auditory encoder can further improve the robustness of VAD by increasing the gain for cleaner speech frequencies. Extensive experimental results demonstrate that this approach achieves about 10.5% absolute improvement in the area under the curve on the AURORA-2J dataset compared with a VAD method based on a CNN and MFCCs.