2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-1.5
Paper Title Phoneme based Neural Transducer for Large Vocabulary Speech Recognition
Authors Wei Zhou, Simon Berger, Ralf Schlüter, Hermann Ney, RWTH Aachen University, Germany
SessionSPE-1: Speech Recognition 1: Neural Transducer Models 1
LocationGather.Town
Session Time:Tuesday, 08 June, 13:00 - 13:45
Presentation Time:Tuesday, 08 June, 13:00 - 13:45
Presentation Poster
Topic Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.