Paper ID | MMSP-7.1 | ||
Paper Title | CROSS-MODAL KNOWLEDGE DISTILLATION FOR FINE-GRAINED ONE-SHOT CLASSIFICATION | ||
Authors | Jiabao Zhao, Xin Lin, East China Normal University, China; Yifan Yang, Transwarp Technology, China; Jing Yang, Liang He, East China Normal University, China | ||
Session | MMSP-7: Multimodal Perception, Integration and Multisensory Fusion | ||
Location | Gather.Town | ||
Session Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation Time: | Friday, 11 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Multimedia Signal Processing: Human Centric Multimedia | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Few-shot learning can recognize a novel category based on only a few samples because it learns to learn from a lot of labeled samples during the training process. When data is insufficient, the performance is affected. And it is expensive to obtain a large-scale fine-grained dataset with annotation. In this paper, we adopt domain-specific knowledge to fill the gap of insufficient annotated data. We propose a cross-modal knowledge distillation (CMKD) framework to do fine-grained one-shot classification and propose the Spatial Relation Loss (SRL) to transfer cross-modal information, which can tackle the semantic gap between multimodal features. The teacher network distills the spatial relationship of the samples as a soft target for training a unimodal student network. Notably, the student network makes predictions only based on a few samples without any external knowledge in the application. This model-agnostic framework will be well adapted to other few-shot models. Extensive experimental results on benchmarks demonstrate that CMKD can make full use of cross-modal knowledge in image and text few-shot classification. CKMD improves the performances of the student networks significantly, even if it is a state-of-the-art student network. |