Paper ID | SPE-11.3 | ||
Paper Title | NON-PARALLEL MANY-TO-MANY VOICE CONVERSION USING LOCAL LINGUISTIC TOKENS | ||
Authors | Chao Wang, Yibiao Yu, Soochow University, China | ||
Session | SPE-11: Voice Conversion 1: Non-parallel Conversion | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | The VQ-VAE based voice conversion models have lately received increasing attention in non-parallel many to many voice conversion, where the encoder extracts the speaker-invariant linguistic content from the input speech using vector quantization and the decoder produces the target speech from the encoder output, conditioned on the target speaker representation. However, it is challenging for the encoder to find a proper balance between removing the speaker information and preserving the linguistic content, which degrades the converted speech quality. To address this issue, we propose the Local Linguistic Tokens (LLTs) model to learn high-quality speaker-invariant linguistic embeddings using the multi-head attention module, which has shown great success in extracting speaking style embeddings in Global Style Tokens (GSTs). Instead of vector quantization, the multi-head attention module makes the encoder preserve more linguistic content to enhance the converted speech quality. Both objective and subjective experimental results revealed that, compared with the state-of-the-art VQ-VAE model, the proposed LLTs model achieved significantly better speech quality and comparable speaker similarity. The converted samples are available online for listening. |