Home
Uni-Logo
 

Block-Seminar on Deep Learning

apl. Prof. Olaf Ronneberger (Google DeepMind)

In this seminar you will learn about recent developments in deep learning with a focus on images and videos and their combination with other modalities like language. The surprising emerging capabilities of large language models (like GPT-4) open up new design spaces. Many classic computer vision tasks can be translated into the language domain and can be (partially) solved there. Understanding the current capabilities, the shortcomings and approaches in the language domain will be essential for the future Computer Vision research. So the selected papers this year focus on the key concepts used in todays large language models as well as the approaches to combine computer vision with language.

For each paper there will be three persons, who perform a more detailed investigation of the research paper and its background, and who will give a presentation. The presentation is followed by a discussion with all participants about the merits, limitations, and perspectives of the respective paper. You will learn to read and understand contemporary research papers, to give a good oral presentation, to ask questions, and to openly discuss a research problem.

Note that the mode of the seminar changes this semester to accomodate more slots for students. Rather than one student presenting a paper, three students will cover three different aspects, typically (but not necessarily) (1) the paper's historical background, related work, and motivation, (2) the core methodology including background methodology, and (3) experimental results and their assessment. In order to provide a strong overall presentation, you will also learn to work in a team, the assembly of which is out of your control. The success of the team effort is part of the grading. The maximum number of students that can participate in the seminar is 30.

The introduction meeting (together with Thomas Brox's seminar) will be in person, while the mid semester meeting will be online. The block seminar itself will be in person to give you the chance to practise your real-world presentation skills and to have more lively discussions


Contact person: Karim Farid

Blockseminar:
(2 SWS)
In person.
Date: Thursday, 17 September 2026, 9:30 to 15:30
Friday, 18 September 2026, 9:30 to 15:30

Room:Building 106, Room 00-007

Beginning: If you want to participate, attend the mandatory introduction meeting (Will be held jointly with Seminar on Current Works in Computer Vison) on April 22, 14:00, register in HisInOne, and submit your paper preferences before April, 27th.

Mid-Semester Lecture: (tbd, via video conference) Introduction to Generative models by apl. Prof. Olaf Ronneberger (Google DeepMind)

Recommended semester:

6 (Bachelor), any (Master)
Requirements: Background in computer vision

Remarks: This course is offered to both Bachelor and Master students. The language of this course is English. All presentations must be given in English.

Topics will be assigned for both seminars via a preference voting. If there are more interested students than places, first priority will be given to students who attended the intrdocution meeting. Afterwards, we follow the assignments of the HisInOne system. We want to avoid that people grab a topic and then jump off during the semester. Please have a coarse look at all available papers to make an informed decision before you commit. If you don't attend the meeting (or not send a paper preference) but choose this seminar together with only other overbooked seminars in HisInOne, you may end up without a seminar place this semester.

Students who just need to attend (failed SL from previous semester), need not send a preference for a paper, but just reply with "SL only".

All participants must read all papers and answer a few questions. The questions will be available here. The answers must be sent to the corresponding advisor before the beginning of the seminar. We highly recommend to read and understand all papers first, before you start to prepare your presentation.

   
Vision model architecture from Qwen3-VL

Material

from Thomas Brox's seminar:

Papers:

The seminar has space for 30 students

NoPaper title and linkCommentsStudentsAdvisor
B1VGGT: Visual Geometry Grounded TransformerAdhithya Rajagopalan
Florian Brügel
Julian Krippes
Jelena Bratulić
B2Efficiently Reconstructing Dynamic Scenes One D4RT at a TimeJustin Kinn
Sepuh Hovhannisyan
Panagiotis Drivas
Sudhanshu Mittal
B3360Anything: Geometry-Free Lifting of Images and Videos to 360°Khushi Bisani
Levin Ben Heining
Johannes Christof Weisbarth
Artur Jesslen
B4Robot Learning from a Physical World ModelDron Dasgupta
Egor Alekseev
Emad Alkhashab
Sudhanshu Mittal
B5TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text AlignmentRahul Gopinath
Urmika Bhattacharya
Anurag Parag Riswadkar
Simon Schrodi
B6LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from PixelsAndi Alidema
Maryam Azimpour
Tobias Hoffmann
Karim Farid
B7Recurrent Video Masked AutoencodersIoan Oleksii Kelier
Thejaswini Raju
Sandra Elizabeth Sabu
Rajat Sahay
B8T5Gemma 2: Seeing, Reading, and Understanding LongerPhilipp Bähr
Prashanth Premakumar
SHOBHIT KUMAR
Simon Ging
B9Kimi K2.6Henri Oberpaur
Jan Sander
Surbhi Nair
Karim Farid
B10Qwen3.5-Omni Technical ReportDamien Leu
Franka Müller
Robin Sonner
Elias Kempf