Paper ID | SPE-53.4 | ||
Paper Title | MULTI-SCALE SPEAKER DIARIZATION WITH NEURAL AFFINITY SCORE FUSION | ||
Authors | Taejin Park, Manoj Kumar, Shrikanth Narayanan, University of Southern California, United States | ||
Session | SPE-53: Speaker Diarization | ||
Location | Gather.Town | ||
Session Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Predicting the speaker's identity of short speech segments in human dialogue has been considered one of the most challenging problems in speech signal processing. Speaker representations of short speech segments tend to be unreliable, resulting in poor fidelity of speaker representations in tasks requiring speaker recognition. In this paper, we propose an unconventional method that tackles the trade-off between temporal resolution and the quality of the speaker representations. To find a set of weights that balance the scores from multiple temporal scales of segments, a neural affinity score fusion model is presented. Using the CALLHOME dataset, we show that our proposed multi-scale segmentation and integration approach can achieve a state-of-the-art diarization performance. |