Paper ID | SS-5.1 | ||
Paper Title | EXPLORING VISUAL-AUDIO COMPOSITION ALIGNMENT NETWORK FOR QUALITY FASHION RETRIEVAL IN VIDEO | ||
Authors | Yanhao Zhang, Jianmin Wu, Xiong Xiong, Dangwei Li, Chenwei Xie, Yun Zheng, Pan Pan, Yinghui Xu, Alibaba group, China | ||
Session | SS-5: Domain Adaptation for Multimedia Signal Processing | ||
Location | Gather.Town | ||
Session Time: | Wednesday, 09 June, 13:00 - 13:45 | ||
Presentation Time: | Wednesday, 09 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Special Sessions: Domain Adaptation for Multimedia Signal Processing | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval. |