VLG | Computer Vision and Learning Group

Available Projects

Scaling up egocentric multimodal foundation model

Type: Master or Semester Project

We aim to scale EgoM2P (https://egom2p.github.io/) to higher-quality datasets. Students will be responsible for data preprocessing and training the model on the Swiss AI Alps cluster, which provides access to over 1,000 GPUs. This project offers hands-on experience in training foundation models and working with state-of-the-art multimodal modeling techniques. Ideal candidates should have strong programming skills in Python and PyTorch, demonstrate responsibility and clear communication, and be able to work efficiently under supervision.

Supervisors: gen.li@inf.ethz.ch

Learn more

Leveraging 2D generative models for novel view synthesis

Type: Master or Semester Project

Novel view synthesis (NVS), a fundamental problem in computer vision, seeks to generate renderings from novel target viewpoints given a set of input viewpoints. Achieving this requires addressing several complex challenges: (1) inferring the geometric structure of a scene from 2D observations, (2) rendering the inferred 3D reconstruction from new viewpoints in a physically plausible manner, and (3) inpainting or extrapolating missing regions that are not observed in the input viewpoints. To tackle these challenges, diverse 3D representations, along with classical geometric constraints, advanced optimization techniques, and deep stereo priors, have been extensively studied. In recent years, diffusion generative models for 2D images and videos have demonstrated remarkable capabilities in generating photorealistic images. These advancements have opened new avenues for enhancing NVS by leveraging the priors encoded in these models. This project aims to investigate the types of prior knowledge encoded within 2D generative models that can most effectively benefit NVS. Unlike many contemporary approaches that fine-tune pretrained generative models for specific NVS tasks, this research adopts a zero-shot framework.

Supervisors: yutong.chen@inf.ethz.ch

Learn more

Synthesizing Large-scale Human Motions in Environments via 2D Foundation Models

Type: Master or Semester Project

In the era of autonomy, the creation of a 3D digital world that faithfully replicates our physical reality becomes increasingly critical. Central to this endeavor is the incorporation of realistic human behaviors. Moreover, human behaviors are intricately rooted in environments - our movements are influenced by our interactions with various objects and the spatial arrangement of our surroundings. Therefore, it is essential not only to model human motion itself but also to model how humans interact with the surrounding environment. Creating human motions within diverse environments has significant applications across numerous fields, including augmented reality (AR), virtual reality (VR), assistive robotics, biomechanics, filmmaking, and the gaming industry. However, capturing human motions in environments require expensive devices, complicated hardware setup and significant manual efforts, thus not scalable to create large-scale human-scene interaction datasets. In this project, we explore how to leverage 2D foundation models to synthesize 3D human motions in various environments in an efficient and scalable way. The project starts from December 2024 or January 2025.

Supervisor: siwei.zhang@inf.ethz.ch

Learn more

Large Language and Vision Models for Zero-Shot Human Motion Analysis

Type: Master or Semester Project

Pre-trained large language models (LLMs) and vision-language models (VLMs) have demonstrated the ability to understand and autoregressively complete complex token sequences, enabling them to capture both the physical and semantic properties of a scene. By leveraging in-context learning, these models can function as general sequence modelers without requiring additional training. This project aims to explore how these zero-shot capabilities can be applied to human motion analysis tasks, such as motion prediction, generation, and denoising. By converting human motion data into token sequences, the project will assess the effectiveness of pre-trained foundation models in digital human modeling. Students will conduct a literature review, design experimental pipelines, and run tests to evaluate the feasibility of using LLMs and VLMs for motion analysis, while exploring optimal tokenization schemes and input modalities.

Supervisors: sergey.prokudin@inf.ethz.ch

Learn more

Archived Projects

Incorporating Geometric Cues into 3D Reconstruction

Type: Master or Semester Project

Recent advancements in 3D reconstruction, such as neural radiance fields (NeRF) and 3D Gaussian Splatting have led to impressive results in high-quality novel view synthesis. However, these techniques still face challenges when it comes to extracting accurate geometry, particularly in scenes with reflective or transparent surfaces. At the same time, monocular depth estimation using data-driven or diffusion-based models has shown great promise in inferring depth from a single image and in certain controlled scenarios, access to ground-truth depth information further enables a more precise understanding of scene geometry. This project aims to investigate how depth or normal cues can be integrated into 3D reconstruction pipelines to improve geometric accuracy. The student will explore various methods for incorporating monocular geometric cues, either through direct supervision or indirectly by leveraging depth-aware features, and evaluate the effectiveness of these approaches in challenging scenarios.

Supervisors: johannes.weidenfeller@ai.ethz.ch, lilian.calvet@balgrist.ch

Learn more

3D Point Tracking with Dynamic Reconstruction Methods

Type: Semester Project

This project aims to evaluate the point tracking performance of state-of-the-art dynamic 3D reconstruction methods on multi-view videos from the TAPVid-3D benchmark. In addition to performance evaluation, failure cases will be analyzed, and improvements will be explored based on the time available during the project.

Supervisor: frano.rajic@inf.ethz.ch

Learn more

Capture and Synthesis of Sign Languages

Type: Master or Semester Project

Sign language is a visual means of communication that uses hand shapes, facial expressions, body movements, and gestures to convey meaning. It serves as the primary language for the deaf and hard-of-hearing communities. Technologies that capture and generate sign language can bridge communication gaps by enabling real-time translation to text or speech, providing educational tools for non-signers, and improving accessibility in public services like healthcare. This project aims to develop a generative model that can convert spoken language to 3D sign language performance by a human avatar.

Supervisors: kaifeng.zhao@inf.ethz.ch

Learn more

Learning 3D Human-Scene Interactions from 2D Observations

Type: Master or Semester Project

The goal of this project is to investigate methods to learn human-scene interaction skills from 2D observations.

Supervisors: kaifeng.zhao@inf.ethz.ch, siwei.zhang@inf.ethz.ch

Learn more

Diffusion Models for 3D Face Animation

Type: Master Project

The goal of this project is to investigate methods to generate 3D facial animations leveraging diffusion models. Diffusion models have shown compelling results in human motion generation. Recent work leverages these models to synthesize full-body motions from sparse input (e.g. head-hand tracking signal). This project will explore extensions of this method to facial animation -- e.g., synthesizing face motion from sparse 2D/3D keypoints.

Supervisors: qianli.ma@inf.ethz.ch, fbogo@meta.com

Learn more

Human motion generation in rich contextual environments

Type: Master or Semester Project

This project aims to leverage a recent 3D human motion dataset CIRCLE to develop a generative human motion model to synthesize highly complex human scene interactions.

Supervisors: gen.li@inf.ethz.ch, yan.zhang@inf.ethz.ch

Learn more

Multi-Person Interaction Capture in the Interactive Design Lab

Type: Master Project

This project aims to build a system to capture interactions between people and the environment.

Supervisors: yan.zhang@inf.ethz.ch, kraus@ibk.baug.ethz.ch

Learn more

Controllable 3D Image Generation

Type: Master or Semester Project

This project attempts to learn object geometry and appearance from a set of 2D images and allows for scale specific controlling. We have also witnessed many great processes in realistic controllable 2D image synthesis and pleasant 3D image results by tacking leverage the recent advance in volume rendering. The core idea of this project is to extend the recent 3D generator that enables a level of control on both appearance and geometry.

Supervisors: anpei.chen@inf.ethz.ch

Learn more

Motion Generation for Hand-object Interaction

Type: Master or Semester Project

Supervisors: kkarunrat@inf.ethz.ch

Learn more

Scene Reconstruction with Moving Objects

Type: Master or Semester Project

This project attempts to reconstruct the geometric and appearance of 4D scenes (static scene + moving objects). We will start with decomposable radiance field reconstruction with a specific setting: a middle scale static environment (room or outdoor street) and one class of objects (human or car).

Supervisors: anpei.chen@inf.ethz.ch

Learn more

Diffusion Models for 3D Scene Generation

Type: Master or Semester Project

Supervisors: Francis Engelmann (mailto:francisengelmann@ai.ethz.ch)

Learn more

A Close Look at Domain Shift in Point Cloud Registration

Type: Master Project

Supervisors: Shengyu Huang (shengyu.huang@geod.baug.ethz.ch), Xuyang Bai (xbaiad@connect.ust.hk), Dr. Theodora Kontogianni (theodora.kontogianni@inf.ethz.ch), Prof. Dr. Konrad Schindler (konrad.schindler@geod.baug.ethz.ch)

Learn more

Thesis

Available Projects

Archived Projects