Questions for the seminar Paper "Video Object Segmentation with Language Referring Expressions"
-----------------------------------------------------------------------------------------------
Please send your answers to: tatarchm@cs.uni-freiburg.de by 10:00 on 29.05.19

1. Describe all input data types and ground truth data types required to train the proposed approach. (~ 2-3 sentences)

2. What is the purpose of the proposed temporal consistency module? Is the box proposal re-ranking performed taking into account frames from the entire video or only frames adjacent to the current one? (~ 2 sentences)

3. The authors transform the bounding boxes predicted by the grounding model into binary images and feed those to the segmentation network. Can you suggest an alternative way of feeding the bounding boxes to the network? How would this change the network architecture? (~ 2-3 sentences)