IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

CHLG-3.3

Paper Title

DIAN: DURATION INFORMED AUTO-REGRESSIVE NETWORK FOR VOICE CLONING

Authors

Wei Song, Xin Yuan, Zhengchen Zhang, Chao Zhang, Youzheng Wu, Xiaodong He, Bowen Zhou, JD Technology Group, China

Session

CHLG-3: Multi-Speaker Multi-Style Voice Cloning Challenge (M2VoC)

Location

Zoom

Session Time:

Monday, 07 June, 15:30 - 17:45

Presentation Time:

Monday, 07 June, 15:30 - 17:45

Presentation

Poster

Topic

Grand Challenge: Multi-Speaker Multi-Style Voice Cloning Challenge (M2VoC)

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

In this paper, we propose a novel end-to-end speech synthesis approach, Duration Informed Auto-regressive Network (DIAN), which consists of an acoustic model and a separate duration model. Unlike other auto-regressive TTS methods, the duration information of phonemes is provided as part of the input to the acoustic model, which enables the removal of the attention mechanism between its encoder and decoder parts. This eliminates the common seen skipping and repeating issues and improves speech intelligibility while ensuring high speech quality. A Transformer-based duration model is used to predict the phoneme duration for the attention-free acoustic model. We developed our TTS systems for the M2VoC using the proposed DIAN approach. In our procedure, a multi-speaker attention-free acoustic model and its Transformer-based duration model are first separately trained based on the training data released by M2VoC. Next, the multi-speaker models are adapted to form the speaker-specific models with the speaker-dependent data and transfer learning. At last, a speaker-specific LPCNet is estimated and used to synthesize the speech of the corresponding speaker. The M2VoC results showed that our proposed approach achieved the 3rd-place in the speech quality ranking and the 4th-place in the speaker similarity and style similarity ranking in the Track1-a task.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail