2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-3.6
Paper Title A NEW HIGH QUALITY TRAJECTORY TILING BASED HYBRID TTS IN REAL TIME
Authors Feng-Long Xie, Xin-Hui Li, Wen-Chao Su, Li Lu, Tencent, China; Frank K. Soong, Microsoft, China
SessionSPE-3: Speech Synthesis 1: Architecture
LocationGather.Town
Session Time:Tuesday, 08 June, 13:00 - 13:45
Presentation Time:Tuesday, 08 June, 13:00 - 13:45
Presentation Poster
Topic Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract A trajectory tiling based, hybrid TTS is revisited in this study for improving its synthesis performance. A combination of Transformer encoder and RNN based decoder architecture where two-level, at both word and Chinese phonetic alphabet letter levels, linguistic representation is exploited to generate a cogent and smooth speech parameter trajectory. And then a segment candidate lattice is constructed by minimizing the log spectral distortion of mel-spectrograms and RMSE of F0 between the generated trajectory and candidates. Normalized cross-correlation is used to find the best sequence of “waveform tiles” in the lattice for synthesizing the final speech waveforms. Subjective A/B preference tests show that the new hybrid system outperforms our earlier trajectory-tiling hybrid baseline TTS (67% vs 11%) and the state-of-the-art, real-time TTS system constructed with Tacotron 2 and LPCNet (56% vs 27%).