IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

SPE-11.3

Paper Title

NON-PARALLEL MANY-TO-MANY VOICE CONVERSION USING LOCAL LINGUISTIC TOKENS

Authors

Chao Wang, Yibiao Yu, Soochow University, China

Session

SPE-11: Voice Conversion 1: Non-parallel Conversion

Location

Gather.Town

Session Time:

Tuesday, 08 June, 16:30 - 17:15

Presentation Time:

Tuesday, 08 June, 16:30 - 17:15

Presentation

Poster

Topic

Speech Processing: [SPE-SYNT] Speech Synthesis and Generation

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

The VQ-VAE based voice conversion models have lately received increasing attention in non-parallel many to many voice conversion, where the encoder extracts the speaker-invariant linguistic content from the input speech using vector quantization and the decoder produces the target speech from the encoder output, conditioned on the target speaker representation. However, it is challenging for the encoder to find a proper balance between removing the speaker information and preserving the linguistic content, which degrades the converted speech quality. To address this issue, we propose the Local Linguistic Tokens (LLTs) model to learn high-quality speaker-invariant linguistic embeddings using the multi-head attention module, which has shown great success in extracting speaking style embeddings in Global Style Tokens (GSTs). Instead of vector quantization, the multi-head attention module makes the encoder preserve more linguistic content to enhance the converted speech quality. Both objective and subjective experimental results revealed that, compared with the state-of-the-art VQ-VAE model, the proposed LLTs model achieved significantly better speech quality and comparable speaker similarity. The converted samples are available online for listening.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail