Paper ID | SPE-47.3 | ||
Paper Title | EXTENDING PARROTRON: AN END-TO-END, SPEECH CONVERSION AND SPEECH RECOGNITION MODEL FOR ATYPICAL SPEECH | ||
Authors | Rohan Doshi, Youzheng Chen, Liyang Jiang, Xia Zhang, Fadi Biadsy, Bhuvana Ramabhadran, Fang Chu, Andrew Rosenberg, Google, United States; Pedro J. Moreno, Google Inc., United States | ||
Session | SPE-47: Speech Recognition 17: Speech Adaptation and Normalization | ||
Location | Gather.Town | ||
Session Time: | Friday, 11 June, 11:30 - 12:15 | ||
Presentation Time: | Friday, 11 June, 11:30 - 12:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-ADAP] Speech Adaptation/Normalization | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | We present an extended Parrotron model: a single, end-to-end network that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in a target vocabulary. We study the performance of this novel architecture, which jointly predicts speech and text, on atypical (e.g. dysarthric) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield a 77% relative reduction in Word Error Rate (WER), measured by ASR performance on the converted speech. We also show that data augmentation using a customized synthesizer built on atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show how these methods generalize across 8 types of atypical speech for a range of speech impairment severities. |