Paper ID | HLT-15.2 | ||
Paper Title | ATTENTION-BASED MULTI-ENCODER AUTOMATIC PRONUNCIATION ASSESSMENT | ||
Authors | Binghuai Lin, Liyuan Wang, Tencent Technology Co., Ltd, China | ||
Session | HLT-15: Language Assessment | ||
Location | Gather.Town | ||
Session Time: | Thursday, 10 June, 16:30 - 17:15 | ||
Presentation Time: | Thursday, 10 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Human Language Technology: [HLT-LACL] Language Acquisition and Learning | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Automatic pronunciation assessment plays an important role in Computer-Assisted Pronunciation Training (CAPT). Traditional methods for pronunciation assessment of reading aloud tasks utilize features derived from automatic speech recognition (ASR) and thus are sensitive to the accuracy of ASR and the effectiveness of features. Moreover, the representation capability of the features is also affected by the inconsistent optimization goals between the ASR and scoring tasks. In this paper we propose an end-to-end (E2E) pronunciation scoring network based on attention mechanism and multi-encoder consisting of audio and text encoders. The network optimized by a multi-task learning (MTL) framework can provide scoring at sentence-level as well as detailed scoring at word-level. Due to data scarcity for pronunciation scoring, we utilize ASR data and synthetic data to pre-train the network in two steps, and then fine-tune the network using the limited high-quality scoring data. Experimental results based on the dataset recorded by Chinese English-as-second-language (ESL) learners and labeled by three experts demonstrate that the proposed model outperforms the baseline in Pearson correlation coefficient (PCC). |