Block-Seminar on Deep Learning
apl. Prof. Olaf Ronneberger (Google DeepMind)In this seminar you will learn about recent developments in deep learning with a focus on images and videos and their combination with other modalities like language. The surprising emerging capabilities of large language models (like GPT-4) open up new design spaces. Many classic computer vision tasks can be translated into the language domain and can be (partially) solved there. Understanding the current capabilities, the shortcomings and approaches in the language domain will be essential for the future Computer Vision research. So the selected papers this year focus on the key concepts used in todays large language models as well as the approaches to combine computer vision with language.
For each paper there will be three persons, who perform a more detailed investigation of the research paper and its background, and who will give a presentation. The presentation is followed by a discussion with all participants about the merits, limitations, and perspectives of the respective paper. You will learn to read and understand contemporary research papers, to give a good oral presentation, to ask questions, and to openly discuss a research problem.
Note that the mode of the seminar changes this semester to accomodate more slots for students. Rather than one student presenting a paper, three students will cover three different aspects, typically (but not necessarily) (1) the paper's historical background, related work, and motivation, (2) the core methodology including background methodology, and (3) experimental results and their assessment. In order to provide a strong overall presentation, you will also learn to work in a team, the assembly of which is out of your control. The success of the team effort is part of the grading. The maximum number of students that can participate in the seminar is 30.
The introduction meeting (together with Thomas Brox's seminar) will be in person, while the mid semester meeting will be online. The block seminar itself will be in person to give you the chance to practise your real-world presentation skills and to have more lively discussions
Contact person: Karim Farid
|
![]() Vision model architecture from Qwen3-VL |
Material
from Thomas Brox's seminar:
- Giving a good presentation
- Proper scientific behavior
- Powerpoint template for your presentation (optional)
Papers:
The seminar has space for 30 students
| No | Paper title and link | Comments | Students | Advisor |
|---|---|---|---|---|
| B1 | VGGT: Visual Geometry Grounded Transformer | Adhithya Rajagopalan Florian Brügel Julian Krippes | Jelena Bratulić | |
| B2 | Efficiently Reconstructing Dynamic Scenes One D4RT at a Time | Justin Kinn Sepuh Hovhannisyan Panagiotis Drivas | Sudhanshu Mittal | |
| B3 | 360Anything: Geometry-Free Lifting of Images and Videos to 360° | Khushi Bisani Levin Ben Heining Johannes Christof Weisbarth | Artur Jesslen | |
| B4 | Robot Learning from a Physical World Model | Dron Dasgupta Egor Alekseev Emad Alkhashab | Sudhanshu Mittal | |
| B5 | TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment | Rahul Gopinath Urmika Bhattacharya Anurag Parag Riswadkar | Simon Schrodi | |
| B6 | LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels | Andi Alidema Maryam Azimpour Tobias Hoffmann | Karim Farid | |
| B7 | Recurrent Video Masked Autoencoders | Ioan Oleksii Kelier Thejaswini Raju Sandra Elizabeth Sabu | Rajat Sahay | |
| B8 | T5Gemma 2: Seeing, Reading, and Understanding Longer | Philipp Bähr Prashanth Premakumar SHOBHIT KUMAR | Simon Ging | |
| B9 | Kimi K2.6 | Henri Oberpaur Jan Sander Surbhi Nair | Karim Farid | |
| B10 | Qwen3.5-Omni Technical Report | Damien Leu Franka Müller Robin Sonner | Elias Kempf |


