Paper ID | MLSP-16.3 | ||
Paper Title | CHANNEL-WISE MIX-FUSION DEEP NEURAL NETWORKS FOR ZERO-SHOT LEARNING | ||
Authors | Guowei Wang, Tianjin University, China; Naiyang Guan, National Innovation Institute of Defense Technology, China; Hanjia Ye, Nanjing University, China; Xiaodong Yi, Hang Cheng, Junjie Zhu, National Innovation Institute of Defense Technology, China | ||
Session | MLSP-16: ML and Graphs | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 14:00 - 14:45 | ||
Presentation Time: | Wednesday, 09 June, 14:00 - 14:45 | ||
Presentation | Poster | ||
Topic | Machine Learning for Signal Processing: [MLR-TRL] Transfer learning | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Zero-shot learning (ZSL), with the assistance of the seen class image and additional semantic knowledge, generalizes its classification ability to the unseen class by aligning the visual-semantic space embeddings. Few previous methods have researched whether discriminative visual features are helpful to recognize different classes while neglecting the rich semantic information from the surrounding background. This paper proposes a channel-wise mix-fusion ZSL model (CMFZ) to contextualize the ZSL classifier’s discriminative information by incorporating much richer visual semantic information from both objects and their semantic surrounding environments. In particular, the channel-wise connection module (CCM) learns to construct the relationship between the object and its surroundings. A collaborative channel-wise activation module (CAM) is adopted to learn from a more delicate scale image attained from the cropping module. It highlights the most distinct channels representing the object’s discriminative regions to eliminate inadvertently introduced background noise. Furthermore, the representation ability of the learned mapping is enhanced by integrating the visual semantic features processed by CCM and CAM. Experimental results show that CMFZ outperforms the state-of-the-art ZSL methods and verifies the effectiveness of incorporating visual semantic information. |