Paper ID | DEMO-1.2 | ||
Paper Title | Speech Data Explorer: Interactive Analysis Tool for Speech Datasets | ||
Authors | Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg, NVIDIA, United States | ||
Session | DEMO-1: Show and Tell Demonstrations 1 | ||
Location | Zoom | ||
Session Time: | Wednesday, 09 June, 08:00 - 09:45 | ||
Presentation Time: | Wednesday, 09 June, 08:00 - 09:45 | ||
Presentation | Poster | ||
Topic | Show and Tell Demonstration: Demo | ||
Abstract | Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) models require large labeled speech datasets for training. It is very important to have accurate reference transcripts that correspond to audio recordings. Otherwise, models might learn errors from training data and reproduce those errors during inference. We have developed Speech Data Explorer (SDE) to help examine quality of speech datasets and do interactive error analysis of ASR models’ predictions. Its core strengths include the following: - an interactive table that contains dataset’s utterances and supports filtering (thresholding) and sorting; - interactive visualization of metrics and a signal in time and frequency domains (with a built-in audio player); - easiness of extensibility (it is straightforward to add new metrics as table’s columns and have all interactive features). To the best of our knowledge, SDE is the first open source tool for interactive exploration of speech datasets and error analysis of ASR models’ predictions. It is implemented as a web application based on Plotly Dash framework. SDE is an essential tool for the analysis of speech datasets and ASR models in our own research. It has already helped us to quickly identify labeling issues in many public and commercial speech datasets, analyze accuracy of ASR models and construct new datasets (for example, Russian LibriSpeech [http://www.openslr.org/96/]). We believe that SDE with its interactivity and extensibility could be beneficial for the wide speech processing community. We will demonstrate how SDE could be used for: - interactive analysis of a speech dataset; - interactive error analysis of transcripts generated by an ASR model; - analysis with custom metrics that is useful for different tasks (for example, long utterance segmentation). |