Paper ID | HLT-10.5 | ||
Paper Title | TOWARDS PRACTICAL LIPREADING WITH DISTILLED AND EFFICIENT MODELS | ||
Authors | Pingchuan Ma, Imperial College London, United Kingdom; Brais Martinez, Samsung AI Research Center, United Kingdom; Stavros Petridis, Maja Pantic, Imperial College London, United Kingdom | ||
Session | HLT-10: Multi-modality in Language | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 16:30 - 17:15 | ||
Presentation Time: | Wednesday, 09 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-GASR] General Topics in Speech Recognition | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent work has placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work, we propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.6% and 46.6%, respectively, through careful optimization. Secondly, we propose a series of architectural changes, including a novel depthwise-separable TCN head, that slashes the computational cost to a fraction of the (already quite efficient) original model. Thirdly, we show that knowledge distillation is a very effective tool for recovering performance of the lightweight models. This results in a range of models with different accuracy-efficiency trade-offs. However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8 and 4x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications. |