Questions for the seminar paper "V‑JEPA 2: Self‑Supervised Video Models Enable Understanding, Prediction and Planning". ----------------------------------------------------------------------------------------------- Please send your answers to: jesslen@cs.uni-freiburg.de 1) The authors train the model using a masked latent feature prediction objective rather than pixel-level reconstruction. How does this choice contribute to learning representations suited for downstream tasks of understanding, prediction and planning? (~2 sentences) 2) In the action-conditioned world-model stage, the model is trained on about 62 hours of raw robot video and end-state, without any task-specific rewards. What is special about this data and training setup, and how does it allow the model to perform new manipulation tasks in a zero-shot way? (~2 sentences) 3) The paper emphasizes that V-JEPA 2 learns from raw videos without labels or rewards at scale. What architectural or training mechanisms prevent the model from collapsing to trivial solutions? (~1 sentence)