IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

SPE-34.6

Paper Title

INVESTIGATION OF FAST AND EFFICIENT METHODS FOR MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION

Authors

Yibin Zheng, Xinhui Li, Li Lu, Tencent Inc, China

Session

SPE-34: Speech Synthesis 6: Data Augmentation & Adaptation

Location

Gather.Town

Session Time:

Thursday, 10 June, 13:00 - 13:45

Presentation Time:

Thursday, 10 June, 13:00 - 13:45

Presentation

Poster

Topic

Speech Processing: [SPE-SYNT] Speech Synthesis and Generation

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

In this paper, we propose a novel method for fast and efficient few-shot TTS task, which is able to disentangle linguistic and speaker representations. Specifically, an adversarial training strategy is firstly employed to wipe out speaker information from the linguistic representations. Then the speaker representations are extracted from audio signals by a speaker encoder with a random sampling mechanism and a speaker classifier, aiming to extract speaker embedding features that independent of content information (such as prosody and style etc). Meanwhile, for faster and efficient adaptation, we further introduce the prior alignment knowledge between the text and audio pairs and propose a multi-alignment guided attention to help the attention learning. The Experimental results show the proposed method not only could generate higher quality and speaker similarity with an average absolute improvement of 0.26 and 0.30 in MOS respectively, when adapting to new speakers with 20 utterances, but also converge much faster and efficient. Moreover, we can achieve a MOS of 4.45 for a premium voice which has enough training data, which outperforms a single speaker model of 4.23.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail