2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-40.1
Paper Title ENSEMBLE COMBINATION BETWEEN DIFFERENT TIME SEGMENTATIONS
Authors Jeremy Heng Meng Wong, Dimitrios Dimitriadis, Kenichi Kumatani, Yashesh Gaur, George Polovets, Partha Parthasarathy, Eric Sun, Jinyu Li, Yifan Gong, Microsoft, United States
SessionSPE-40: Speech Recognition 14: Acoustic Modeling 2
LocationGather.Town
Session Time:Thursday, 10 June, 15:30 - 16:15
Presentation Time:Thursday, 10 June, 15:30 - 16:15
Presentation Poster
Topic Speech Processing: [SPE-RECO] Acoustic Modeling for Automatic Speech Recognition
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Hypothesis-level combination between multiple models can often yield gains in speech recognition. However, all models in the ensemble are usually restricted to use the same audio segmentation times. This paper proposes to generalise hypothesis-level combination, allowing the use of different audio segmentation times between the models, by splitting and re-joining the hypothesised N-best lists in time. A hypothesis tree method is also proposed to distribute hypothesis posteriors among the constituent words, to facilitate such splitting when per-word scores are not available. The approach is assessed on a Microsoft meeting transcription task, by performing combination between a streaming first-pass recognition and an offline second-pass recognition. The experimental results show that the proposed approach can yield gains when combining over different segmentation times. Furthermore, the results also show that a combination between a hybrid model and an end-to-end neural network model yields a greater improvement than a combination between two hybrid models.