Paper ID | SPE-3.6 |
Paper Title |
A NEW HIGH QUALITY TRAJECTORY TILING BASED HYBRID TTS IN REAL TIME |
Authors |
Feng-Long Xie, Xin-Hui Li, Wen-Chao Su, Li Lu, Tencent, China; Frank K. Soong, Microsoft, China |
Session | SPE-3: Speech Synthesis 1: Architecture |
Location | Gather.Town |
Session Time: | Tuesday, 08 June, 13:00 - 13:45 |
Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-SYNT] Speech Synthesis and Generation |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
A trajectory tiling based, hybrid TTS is revisited in this study for improving its synthesis performance. A combination of Transformer encoder and RNN based decoder architecture where two-level, at both word and Chinese phonetic alphabet letter levels, linguistic representation is exploited to generate a cogent and smooth speech parameter trajectory. And then a segment candidate lattice is constructed by minimizing the log spectral distortion of mel-spectrograms and RMSE of F0 between the generated trajectory and candidates. Normalized cross-correlation is used to find the best sequence of “waveform tiles” in the lattice for synthesizing the final speech waveforms. Subjective A/B preference tests show that the new hybrid system outperforms our earlier trajectory-tiling hybrid baseline TTS (67% vs 11%) and the state-of-the-art, real-time TTS system constructed with Tacotron 2 and LPCNet (56% vs 27%). |