Paper ID | SPE-55.3 | ||
Paper Title | SPOKEN LANGUAGE IDENTIFICATION IN UNSEEN TARGET DOMAIN USING WITHIN-SAMPLE SIMILARITY LOSS | ||
Authors | Muralikrishna H, Indian Institue of Technology Mandi, India; Shantanu Kapoor, Manipal Institute of Technology Manipal, India; Dileep Aroor Dinesh, Padmanabhan Rajan, Indian Institue of Technology Mandi, India | ||
Session | SPE-55: Language Identification and Low Resource Speech Recognition | ||
Location | Gather.Town | ||
Session Time: | Friday, 11 June, 14:00 - 14:45 | ||
Presentation Time: | Friday, 11 June, 14:00 - 14:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-MULT] Multilingual Recognition and Identification | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | State-of-the-art spoken language identification (LID) networks are vulnerable to channel-mismatch that occurs due to the differences in the channels used to obtain the training and testing samples. The effect of channel-mismatch is severe when the training dataset contains very limited channel diversity. One way to address channel-mismatch is by learning a channel-invariant representation of the speech using adversarial multi-task learning (AMTL). But, AMTL approach cannot be used when the training samples do not contain the corresponding channel labels. To address this, we propose an auxiliary within-sample similarity loss (WSSL) which encourages the network to suppress the channel-specific contents in the speech. This does not require any channel labels. Specifically, WSSL gives the similarity between a pair of embeddings of same sample obtained by two separate embedding extractors. These embedding extractors are designed to capture similar information about the channel, but dissimilar LID-specific information in the speech. Furthermore, the proposed WSSL improves the noise-robustness of the LID-network by suppressing the background noise in the speech to some extent. We demonstrate the effectiveness of the proposed approach in both seen and unseen channel conditions using a set of datasets having significant channel-mismatch. |