2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-28.2
Paper Title AISPEECH-SJTU ASR system for the Accented English Speech Recognition Challenge
Authors Tian Tan, Aispeech, China; Yizhou Lu, Rao Ma, Shanghai Jiao Tong University, China; Sen Zhu, Jiaqi Guo, Aispeech, China; Yanmin Qian, Shanghai Jiao Tong University, China
SessionSPE-28: Speech Recognition 10: Robustness to Human Speech Variability
LocationGather.Town
Session Time:Wednesday, 09 June, 16:30 - 17:15
Presentation Time:Wednesday, 09 June, 16:30 - 17:15
Presentation Poster
Topic Speech Processing: [SPE-ROBU] Robust Speech Recognition
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract This paper describes the AISpeech-SJTU ASR system for the Interspeech-2020 Accented English Speech Recognition Challenge~(AESRC). This task is challenging due to the diversity of pronunciation accuracy, intonation speed and pronunciation of some syllables. All participants were restricted to develop their systems based on the speech and text corpora provided by the organizer. To work around the data-scarcity problem, data augmentation was first explored including noise simulation, SpecAugment, speed perturbation and TTS simulation. Moreover, SOTA CNN-transformer-based joint CTC-attention system was built and accent adaptation was proposed to train an accent robust system. Finally, the first-pass recognition hypotheses generated from CTC head were rescored by forward, backward LSTM-LM and the attention head. Our system with the best configuration achieves second place in the challenge, resulting in a word error rate (WER) of 4.00\% on dev set and 4.47\% WER on test set, while WER on test set of the top-performing, second runner-up and official baseline systems are 4.06\%, 4.52\%, 8.29\%, respectively.