2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-5.1
Paper Title	Dual-Path Modeling for Long Recording Speech Separation in Meetings
Authors	Chenda Li, Shanghai Jiao Tong University, China; Zhuo Chen, Microsoft, United States; Yi Luo, Cong Han, Columbia University, United States; Tianyan Zhou, Microsoft, United States; Keisuke Kinoshita, Marc Delcroix, NTT Corporation, Japan; Shinji Watanabe, Johns Hopkins University, United States; Yanmin Qian, Shanghai Jiao Tong University, China
Session	SPE-5: Speech Enhancement 1: Speech Separation
Location	Gather.Town
Session Time:	Tuesday, 08 June, 14:00 - 14:45
Presentation Time:	Tuesday, 08 June, 14:00 - 14:45
Presentation	Poster
Topic	Speech Processing: [SPE-ENHA] Speech Enhancement and Separation
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.