Paper ID | SPE-5.1 | ||
Paper Title | Dual-Path Modeling for Long Recording Speech Separation in Meetings | ||
Authors | Chenda Li, Shanghai Jiao Tong University, China; Zhuo Chen, Microsoft, United States; Yi Luo, Cong Han, Columbia University, United States; Tianyan Zhou, Microsoft, United States; Keisuke Kinoshita, Marc Delcroix, NTT Corporation, Japan; Shinji Watanabe, Johns Hopkins University, United States; Yanmin Qian, Shanghai Jiao Tong University, China | ||
Session | SPE-5: Speech Enhancement 1: Speech Separation | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 14:00 - 14:45 | ||
Presentation Time: | Tuesday, 08 June, 14:00 - 14:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-ENHA] Speech Enhancement and Separation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline. |