2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-37.1
Paper Title	AN EFFECTIVE DEEP EMBEDDING LEARNING METHOD BASED ON DENSE-RESIDUAL NETWORKS FOR SPEAKER VERIFICATION
Authors	Ying Liu, Yan Song, National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, China; Ian McLoughlin, ICT Cluster, Singapore Institute of Technology, Sint Maarten; Lin Liu, iFLYTEK Research, iFLYTEK CO., LTD., China; Li-rong Dai, National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, China
Session	SPE-37: Speaker Recognition 5: Neural Embedding
Location	Gather.Town
Session Time:	Thursday, 10 June, 14:00 - 14:45
Presentation Time:	Thursday, 10 June, 14:00 - 14:45
Presentation	Poster
Topic	Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	In this paper, we present an effective end-to-end deep embedding learning method based on Dense-Residual networks, which combine the advantages of a densely connected convolutional network (DenseNet) and a residual network (ResNet), for speaker verification (SV). Unlike a model ensemble strategy which merges the results of multiple systems, the proposed Dense-Residual networks perform feature fusion on every basic DenseR building block. Specifically, two types of DenseR blocks are designed. A sequential-DenseR block is constructed by densely connecting stacked basic units in a residual block of ResNet. A parallel-DenseR comprises split and concatenation operations on residual and dense components via corresponding skip connections. These building blocks are stacked into deep networks to exploit the complementary information with different receptive field sizes and growth rates. Extensive experiments have been conducted on the VoxCeleb1 dataset to evaluate the proposed methods. The SV performance achieved by the proposed Dense-Residual networks is shown to outperform corresponding ResNet, DenseNet or fusions of them, with similar model complexity, by a significant margin.