2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDMMSP-8.4
Paper Title BIDIRECTIONAL FOCUSED SEMANTIC ALIGNMENT ATTENTION NETWORK FOR CROSS-MODAL RETRIEVAL
Authors Shuli Cheng, Liejun Wang, Anyu Du, Yongming Li, Xinjiang University, China
SessionMMSP-8: Multimedia Retrieval and Signal Detection
LocationGather.Town
Session Time:Friday, 11 June, 13:00 - 13:45
Presentation Time:Friday, 11 June, 13:00 - 13:45
Presentation Poster
Topic Multimedia Signal Processing: Multimedia Applications
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract Cross-modal retrieval is a very challenging and significant task in intelligent understanding. Researchers have tried to capture modal semantic information through a weighted attention mechanism. Still, they cannot eliminate irrelevant semantic information's negative effects and cannot capture fine-grained modal semantic information. In order to further accurately capture the multi-modal semantic information, a bidirectional focused semantic alignment attention network (BFSAAN) is proposed to handle cross-modal retrieval tasks. Core ideas of BFSAAN are as follows: 1) Bidirectional focused attention mechanism is adopted to share modal semantic information, further eliminating the negative influence of irrelevant semantic information. 2) Strip pooling is applied to image and text modalities, a lightweight spatial attention mechanism to capture modal spatial semantic information. 3) Second-order covariance pooling is explored to obtain multi-modal semantic representation, capturing modal channel semantic information and achieving semantic alignment between image-text modalities. The experiment is executed in two standard cross-modal retrieval datasets (Flickr30K and MS COCO). The experimental design includes four aspects: performance comparison, ablation analysis, algorithm convergence, and visual analysis. Experimental results show that BFSAAN has better cross-modal retrieval performance.