MAIN: Multi-Attention Instance Network for video segmentation

J. León Alcázar, Maria A. Bravo, G. Jeanneret, A. Thabet, Thomas Brox, P. Arbeláez, B. Ghanem
Computer Vision and Image Understanding, 210: 103240, 2021
Abstract: Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overcomes challenging segmentation scenarios over arbitrary videos without modeling sequence- or instance-specific knowledge. We design MAIN to segment multiple instances in a single forward pass, and optimize it with a novel loss function that favors class agnostic predictions and assigns instance-specific penalties. We achieve state-of-the-art performance on the challenging Youtube-VOS dataset and benchmark, improving the unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at real-time (30.3 FPS).
Publisher's link

Other associated files : MAIN_paper.pdf [5.9MB]  

Images and movies


BibTex reference

  author       = "J. Le{\'o}n Alc{\'a}zar and M. Bravo and G. Jeanneret and A. Thabet and T. Brox and P. Arbel{\'a}ez and B. Ghanem",
  title        = "MAIN: Multi-Attention Instance Network for video segmentation",
  journal      = "Computer Vision and Image Understanding",
  volume       = "210",
  pages        = "103240",
  month        = " ",
  year         = "2021",
  keywords     = "Video object segmentation",
  url          = "http://lmb.informatik.uni-freiburg.de/Publications/2021/BB21"

Other publications in the database