Paper ID | SPE-3.6 | ||
Paper Title | A NEW HIGH QUALITY TRAJECTORY TILING BASED HYBRID TTS IN REAL TIME | ||
Authors | Feng-Long Xie, Xin-Hui Li, Wen-Chao Su, Li Lu, Tencent, China; Frank K. Soong, Microsoft, China | ||
Session | SPE-3: Speech Synthesis 1: Architecture | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | A trajectory tiling based, hybrid TTS is revisited in this study for improving its synthesis performance. A combination of Transformer encoder and RNN based decoder architecture where two-level, at both word and Chinese phonetic alphabet letter levels, linguistic representation is exploited to generate a cogent and smooth speech parameter trajectory. And then a segment candidate lattice is constructed by minimizing the log spectral distortion of mel-spectrograms and RMSE of F0 between the generated trajectory and candidates. Normalized cross-correlation is used to find the best sequence of “waveform tiles” in the lattice for synthesizing the final speech waveforms. Subjective A/B preference tests show that the new hybrid system outperforms our earlier trajectory-tiling hybrid baseline TTS (67% vs 11%) and the state-of-the-art, real-time TTS system constructed with Tacotron 2 and LPCNet (56% vs 27%). |