2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-45.5
Paper Title ENCODER-DECODER BASED PITCH TRACKING AND JOINT MODEL TRAINING FOR MANDARIN TONE CLASSIFICATION
Authors Hao Huang, Kai Wang, Ying Hu, Xinjiang University, China; Sheng Li, National Institute of Information and Communications Technology, Japan
SessionSPE-45: Speech Analysis
LocationGather.Town
Session Time:Thursday, 10 June, 16:30 - 17:15
Presentation Time:Thursday, 10 June, 16:30 - 17:15
Presentation Poster
Topic Speech Processing: [SPE-ANLS] Speech Analysis
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract We pursue an interpretable pitch tracking model and a jointly trained tone model for Mandarin tone classification. For pitch tracking, present deep learning based pitch model structure seldom considers the Viterbi decoding commonly implemented in prevalent manually designed pitch tracking algorithms. We propose RNN based Encoder-Decoder framework with gating mechanism which underlying models both the state cost estimation and Viterbi back-tracing pass implemented in the RAPT algorithm. Then we apply the pitch extractor to a down-stream Mandarin tone classification task. The basic motivation is to combine together the two conventional components in tone classification (i.e., the pitch extractor and tone classifier) and then the whole network are trained simultaneously in an end-to-end fashion. Various cascade methods are evaluated. We carry out pitch extraction and tone classification experiments on Mandarin continuous speech database to show the superiority of the proposed models. Experimental results on pitch extraction show proposed pitch tracking model outperforms the DNN-RNN and bi-directional variants. Tone classification experimental results show the composite model outperforms the traditional cascade tone classification framework which makes use of pitch related feature and a back-end classifier.