2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-9.4
Paper Title RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER
Authors Pengcheng Guo, Northwestern Polytechnical University; Johns Hopkins University, China; Florian Boyer, LaBRI, University of Bordeaux; Airudit, France; Xuankai Chang, Johns Hopkins University, United States; Tomoki Hayashi, Nagoya University; Human Dataware Lab. Co., Ltd., Japan; Yosuke Higuchi, Waseda University, Japan; Hirofumi Inaguma, Kyoto University, Japan; Naoyuki Kamo, NTT Corporation, Japan; Chenda Li, Shanghai Jiao Tong University, China; Daniel Garcia-Romero, Jiatong Shi, Johns Hopkins University, United States; Jing Shi, Institute of Automation, Chinese Academy of Sciences, China and Johns Hopkins University, United States; Shinji Watanabe, Johns Hopkins University,, United States; Kun Wei, Northwestern Polytechnical University, China; Wangyou Zhang, Shanghai Jiao Tong University, China; Yuekai Zhang, Johns Hopkins University, United States
SessionSPE-9: Speech Recognition 3: Transformer Models 1
LocationGather.Town
Session Time:Tuesday, 08 June, 16:30 - 17:15
Presentation Time:Tuesday, 08 June, 16:30 - 17:15
Presentation Poster
Topic Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.