Paper ID | SPE-45.5 |
Paper Title |
ENCODER-DECODER BASED PITCH TRACKING AND JOINT MODEL TRAINING FOR MANDARIN TONE CLASSIFICATION |
Authors |
Hao Huang, Kai Wang, Ying Hu, Xinjiang University, China; Sheng Li, National Institute of Information and Communications Technology, Japan |
Session | SPE-45: Speech Analysis |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-ANLS] Speech Analysis |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
We pursue an interpretable pitch tracking model and a jointly trained tone model for Mandarin tone classification. For pitch tracking, present deep learning based pitch model structure seldom considers the Viterbi decoding commonly implemented in prevalent manually designed pitch tracking algorithms. We propose RNN based Encoder-Decoder framework with gating mechanism which underlying models both the state cost estimation and Viterbi back-tracing pass implemented in the RAPT algorithm. Then we apply the pitch extractor to a down-stream Mandarin tone classification task. The basic motivation is to combine together the two conventional components in tone classification (i.e., the pitch extractor and tone classifier) and then the whole network are trained simultaneously in an end-to-end fashion. Various cascade methods are evaluated. We carry out pitch extraction and tone classification experiments on Mandarin continuous speech database to show the superiority of the proposed models. Experimental results on pitch extraction show proposed pitch tracking model outperforms the DNN-RNN and bi-directional variants. Tone classification experimental results show the composite model outperforms the traditional cascade tone classification framework which makes use of pitch related feature and a back-end classifier. |