Paper ID | SPE-17.6 |
Paper Title |
Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals |
Authors |
Meng Ge, Tianjin University, China; Chenglin Xu, National University of Singapore, China; Longbiao Wang, Tianjin University, China; Eng Siong Chng, Nanyang Technological University, Singapore; Jianwu Dang, Tianjin University, Japan; Haizhou Li, National University of Singapore, Singapore |
Session | SPE-17: Speech Enhancement 3: Target Speech Extraction |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 14:00 - 14:45 |
Presentation Time: | Wednesday, 09 June, 14:00 - 14:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-ENHA] Speech Enhancement and Separation |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Speaker extraction uses a pre-recorded reference speech as the reference signal for target speaker extraction. In real-world applications, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. Furthermore, for the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines. |