2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	AUD-7.2
Paper Title	SURROGATE SOURCE MODEL LEARNING FOR DETERMINED SOURCE SEPARATION
Authors	Robin Scheibler, Masahito Togami, LINE Corporation, Japan
Session	AUD-7: Audio and Speech Source Separation 3: Deep Learning
Location	Gather.Town
Session Time:	Wednesday, 09 June, 13:00 - 13:45
Presentation Time:	Wednesday, 09 June, 13:00 - 13:45
Presentation	Poster
Topic	Audio and Acoustic Signal Processing: [AUD-SEP] Audio and Speech Source Separation
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	We propose to learn surrogate functions of universal speech priors for determined blind speech separation. Deep speech priors are highly desirable due to their superior modelling power, but are not compatible with state-of-the-art independent vector analysis based on majorization-minimization (AuxIVA), since deriving the required surrogate function is not easy, nor always possible. Instead, we do away with exact majorization and directly approximate the surrogate. Taking advantage of iterative source steering (ISS) updates, we back propagate the permutation invariant separation loss through multiple iterations of AuxIVA. ISS lends itself well to this task due to its lower complexity and lack of matrix inversion. Experiments show large improvements in terms of scale invariant signal-to-distortion (SDR) ratio and word error rate compared to baseline methods. Training is done on two speakers mixtures and we experiment with two losses, SDR and coherence. We find that the learnt approximate surrogate generalizes well on mixtures of three and four speakers without any modification. We also demonstrate generalization to a different variation of the AuxIVA update equations. The SDR loss leads to fastest convergence in iterations, while coherence leads to the lowest word error rate (WER). We obtain as much as 36% reduction in WER.