Paper ID | SS-5.1 |
Paper Title |
EXPLORING VISUAL-AUDIO COMPOSITION ALIGNMENT NETWORK FOR QUALITY FASHION RETRIEVAL IN VIDEO |
Authors |
Yanhao Zhang, Jianmin Wu, Xiong Xiong, Dangwei Li, Chenwei Xie, Yun Zheng, Pan Pan, Yinghui Xu, Alibaba group, China |
Session | SS-5: Domain Adaptation for Multimedia Signal Processing |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 13:00 - 13:45 |
Presentation Time: | Wednesday, 09 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Special Sessions: Domain Adaptation for Multimedia Signal Processing |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval. |