Novel view synthesis (NVS), a fundamental problem in computer vision, seeks to generate renderings from novel target viewpoints given a set of input viewpoints. Achieving this requires addressing several complex challenges: (1) inferring the geometric structure of a scene from 2D observations, (2) rendering the inferred 3D reconstruction from new viewpoints in a physically plausible manner, and (3) inpainting or extrapolating missing regions that are not observed in the input viewpoints. To tackle these challenges, diverse 3D representations, along with classical geometric constraints, advanced optimization techniques, and deep stereo priors, have been extensively studied. In recent years, diffusion generative models for 2D images and videos have demonstrated remarkable capabilities in generating photorealistic images. These advancements have opened new avenues for enhancing NVS by leveraging the priors encoded in these models. This project aims to investigate the types of prior knowledge encoded within 2D generative models that can most effectively benefit NVS. Unlike many contemporary approaches that fine-tune pretrained generative models for specific NVS tasks, this research adopts a zero-shot framework.
Supervisors: yutong.chen@inf.ethz.ch
Recent advancements in 3D reconstruction, such as neural radiance fields (NeRF) and 3D Gaussian Splatting have led to impressive results in high-quality novel view synthesis. However, these techniques still face challenges when it comes to extracting accurate geometry, particularly in scenes with reflective or transparent surfaces. At the same time, monocular depth estimation using data-driven or diffusion-based models has shown great promise in inferring depth from a single image and in certain controlled scenarios, access to ground-truth depth information further enables a more precise understanding of scene geometry. This project aims to investigate how depth or normal cues can be integrated into 3D reconstruction pipelines to improve geometric accuracy. The student will explore various methods for incorporating monocular geometric cues, either through direct supervision or indirectly by leveraging depth-aware features, and evaluate the effectiveness of these approaches in challenging scenarios.
Supervisors: johannes.weidenfeller@ai.ethz.ch, lilian.calvet@balgrist.ch
In the era of autonomy, the creation of a 3D digital world that faithfully replicates our physical reality becomes increasingly critical. Central to this endeavor is the incorporation of realistic human behaviors. Moreover, human behaviors are intricately rooted in environments - our movements are influenced by our interactions with various objects and the spatial arrangement of our surroundings. Therefore, it is essential not only to model human motion itself but also to model how humans interact with the surrounding environment. Creating human motions within diverse environments has significant applications across numerous fields, including augmented reality (AR), virtual reality (VR), assistive robotics, biomechanics, filmmaking, and the gaming industry. However, capturing human motions in environments require expensive devices, complicated hardware setup and significant manual efforts, thus not scalable to create large-scale human-scene interaction datasets. In this project, we explore how to leverage 2D foundation models to synthesize 3D human motions in various environments in an efficient and scalable way. The project starts from December 2024 or January 2025.
Supervisor: siwei.zhang@inf.ethz.ch
Pre-trained large language models (LLMs) and vision-language models (VLMs) have demonstrated the ability to understand and autoregressively complete complex token sequences, enabling them to capture both the physical and semantic properties of a scene. By leveraging in-context learning, these models can function as general sequence modelers without requiring additional training. This project aims to explore how these zero-shot capabilities can be applied to human motion analysis tasks, such as motion prediction, generation, and denoising. By converting human motion data into token sequences, the project will assess the effectiveness of pre-trained foundation models in digital human modeling. Students will conduct a literature review, design experimental pipelines, and run tests to evaluate the feasibility of using LLMs and VLMs for motion analysis, while exploring optimal tokenization schemes and input modalities.
Supervisors: sergey.prokudin@inf.ethz.ch
Sign language is a visual means of communication that uses hand shapes, facial expressions, body movements, and gestures to convey meaning. It serves as the primary language for the deaf and hard-of-hearing communities. Technologies that capture and generate sign language can bridge communication gaps by enabling real-time translation to text or speech, providing educational tools for non-signers, and improving accessibility in public services like healthcare. This project aims to develop a generative model that can convert spoken language to 3D sign language performance by a human avatar.
Supervisors: kaifeng.zhao@inf.ethz.ch
This project aims to evaluate the point tracking performance of state-of-the-art dynamic 3D reconstruction methods on multi-view videos from the TAPVid-3D benchmark. In addition to performance evaluation, failure cases will be analyzed, and improvements will be explored based on the time available during the project.
Supervisor: frano.rajic@inf.ethz.ch
The goal of this project is to investigate methods to learn human-scene interaction skills from 2D observations.
Supervisors: kaifeng.zhao@inf.ethz.ch, siwei.zhang@inf.ethz.ch
The goal of this project is to investigate methods to generate 3D facial animations leveraging diffusion models. Diffusion models have shown compelling results in human motion generation. Recent work leverages these models to synthesize full-body motions from sparse input (e.g. head-hand tracking signal). This project will explore extensions of this method to facial animation -- e.g., synthesizing face motion from sparse 2D/3D keypoints.
Supervisors: qianli.ma@inf.ethz.ch, fbogo@meta.com
This project aims to leverage a recent 3D human motion dataset CIRCLE to develop a generative human motion model to synthesize highly complex human scene interactions.
Supervisors: gen.li@inf.ethz.ch, yan.zhang@inf.ethz.ch
This project aims to build a system to capture interactions between people and the environment.
Supervisors: yan.zhang@inf.ethz.ch, kraus@ibk.baug.ethz.ch
This project attempts to learn object geometry and appearance from a set of 2D images and allows for scale specific controlling. We have also witnessed many great processes in realistic controllable 2D image synthesis and pleasant 3D image results by tacking leverage the recent advance in volume rendering. The core idea of this project is to extend the recent 3D generator that enables a level of control on both appearance and geometry.
Supervisors: anpei.chen@inf.ethz.ch
Supervisors: kkarunrat@inf.ethz.ch
This project attempts to reconstruct the geometric and appearance of 4D scenes (static scene + moving objects). We will start with decomposable radiance field reconstruction with a specific setting: a middle scale static environment (room or outdoor street) and one class of objects (human or car).
Supervisors: anpei.chen@inf.ethz.ch
Supervisors: Francis Engelmann (mailto:francisengelmann@ai.ethz.ch)
Supervisors: Shengyu Huang (shengyu.huang@geod.baug.ethz.ch), Xuyang Bai (xbaiad@connect.ust.hk), Dr. Theodora Kontogianni (theodora.kontogianni@inf.ethz.ch), Prof. Dr. Konrad Schindler (konrad.schindler@geod.baug.ethz.ch)