Paper ID | AUD-30.4 |
Paper Title |
A TWO-STAGE APPROACH TO DEVICE-ROBUST ACOUSTIC SCENE CLASSIFICATION |
Authors |
Hu Hu, Chao-Han Yang, Georgia Institute of Technology, United States; Xianjun Xia, Tencent Media Lab, China; Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, University of Science and Technology of China, China; Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Tencent Media Lab, China; Sabato Marco Siniscalchi, University of Enna Kore, Italy; Yannan Wang, Tencent Media Lab, China; Jun Du, University of Science and Technology of China, China; Chin-Hui Lee, Georgia Institute of Technology, United States |
Session | AUD-30: Detection and Classification of Acoustic Scenes and Events 5: Scenes |
Location | Gather.Town |
Session Time: | Friday, 11 June, 13:00 - 13:45 |
Presentation Time: | Friday, 11 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Audio and Acoustic Signal Processing: [AUD-CLAS] Detection and Classification of Acoustic Scenes and Events |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finer-grained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set, where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models. |