Paper ID | SPE-30.5 | ||
Paper Title | How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers? | ||
Authors | Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Junichi Yamagishi, National Institute of Informatics, Japan | ||
Session | SPE-30: Speech Processing 2: General Topics | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 16:30 - 17:15 | ||
Presentation Time: | Wednesday, 09 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluation methodology using synthesized rakugo speech and real rakugo speech uttered by professional performers of three different ranks. The naturalness of the synthesized speech was comparable to that of the human speech, but the synthesized speech entertained listeners less than the performers of any rank. However, we obtained some interesting insights into challenges to be solved in order to achieve a truly entertaining rakugo synthesizer. For example, naturalness was not the most important factor, even though it has generally been emphasized as the most important point to be evaluated in the conventional speech synthesis field. More important factors were the understandability of the content and distinguishability of the characters in the rakugo story, both of which the synthesized rakugo speech was relatively inferior at as compared with the professional performers. We also found that fundamental frequency fo modeling should be further improved to better entertain audiences. These results show important steps to reaching authentically entertaining speech synthesis. |