Publications | Neural Capture and Synthesis - Max Planck Institute for Intelligent Systems

20 results (View BibTeX file of all listed publications)

2024

Stable Video Portraits

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (inproceedings) Accepted

Abstract

Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

link (url) [BibTex]

2024

Ostrek, M., Thies, J. Stable Video Portraits In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (inproceedings) Accepted

link (url) [BibTex]

Synthesizing Environment-Specific People in Photographs

Ostrek, M., O’Sullivan, C., Black, M., Thies, J.

In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (inproceedings) Accepted

Abstract

We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.

link (url) [BibTex]

Ostrek, M., O’Sullivan, C., Black, M., Thies, J. Synthesizing Environment-Specific People in Photographs In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (inproceedings) Accepted

link (url) [BibTex]

Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Sklyarova, V., Zakharov, E., Hilliges, O., Black, M. J., Thies, J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

Abstract

We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by using the 2D priors, they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods, and they only model the "outer shell", which is not ready to be used in physics-based rendering or simulation pipelines. In contrast, we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems, we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies, we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches.

ArXiv Code link (url) [BibTex]

Sklyarova, V., Zakharov, E., Hilliges, O., Black, M. J., Thies, J. Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

ArXiv Code link (url) [BibTex]

Neuropostors: Neural Geometry-aware 3D Crowd Character Impostors

Ostrek, M., Mitra, N. J., O’Sullivan, C.

In 2024 27th International Conference on Pattern Recognition (ICPR), Springer, 2024 27th International Conference on Pattern Recognition (ICPR), June 2024 (inproceedings) Accepted

Abstract

Crowd rendering and animation was a very active research area over a decade ago, but in recent years this has lessened, mainly due to improvements in graphics acceleration hardware. Nevertheless, there is still a high demand for generating varied crowd appearances and animation for games, movie production, and mixed-reality applications. Current approaches are still limited in terms of both the behavioral and appearance aspects of virtual characters due to (i) high memory and computational demands; and (ii) person-hours needed of skilled artists in the context of short production cycles. A promising previous approach to generating varied crowds was the use of pre-computed impostor representations for crowd characters, which could replace an animation of a 3D mesh with a simplified 2D impostor for every frame of an animation sequence, e.g., Geopostors [1]. However, with their high memory demands at a time when improvements in consumer graphics accelerators were outpacing memory availability, the practicality of such methods was limited. Inspired by this early work and recent advances in the field of Neural Rendering, we present a new character representation: Neuropostors. We train a Convolutional Neural Network as a means of compressing both the geometric properties and animation key-frames for a 3D character, thereby allowing for constant-time rendering of animated characters from arbitrary camera views. Our method also allows for explicit illumination and material control, by utilizing a flexible rendering equation that is connected to the outputs of the neural network.

[BibTex]

Ostrek, M., Mitra, N. J., O’Sullivan, C. Neuropostors: Neural Geometry-aware 3D Crowd Character Impostors In 2024 27th International Conference on Pattern Recognition (ICPR), Springer, 2024 27th International Conference on Pattern Recognition (ICPR), June 2024 (inproceedings) Accepted

[BibTex]

TECA: Text-Guided Generation and Editing of Compositional 3D Avatars

Zhang, H., Feng, Y., Kulits, P., Wen, Y., Thies, J., Black, M. J.

In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings) To be published

Abstract

Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.

arXiv project link (url) [BibTex]

Zhang, H., Feng, Y., Kulits, P., Wen, Y., Thies, J., Black, M. J. TECA: Text-Guided Generation and Editing of Compositional 3D Avatars In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings) To be published

arXiv project link (url) [BibTex]

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar

Kabadayi, B., Zielonka, W., Bhatnagar, B. L., Pons-Moll, G., Thies, J.

In International Conference on 3D Vision (3DV), March 2024 (inproceedings)

Abstract

Digital humans and, especially, 3D facial avatars have raised a lot of attention in the past years, as they are the backbone of several applications like immersive telepresence in AR or VR. Despite the progress, facial avatars reconstructed from commodity hardware are incomplete and miss out on parts of the side and back of the head, severely limiting the usability of the avatar. This limitation in prior work stems from their requirement of face tracking, which fails for profile and back views. To address this issue, we propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data. To train this appearance model, we only assume to have a collection of 2D images with the corresponding camera parameters. For controlling the model, we learn a mapping from 3DMM facial expression parameters to the latent space of the generative model. This mapping can be learned by sampling the latent space of the appearance model and reconstructing the facial parameters from a normalized frontal view, where facial expression estimation performs well. With this scheme, we decouple 3D appearance reconstruction and animation control to achieve high fidelity in image synthesis. In a series of experiments, we compare our proposed technique to state-of-the-art monocular methods and show superior quality while not requiring expression tracking of the training data.

Video Webpage Code Arxiv [BibTex]

Kabadayi, B., Zielonka, W., Bhatnagar, B. L., Pons-Moll, G., Thies, J. GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar In International Conference on 3D Vision (3DV), March 2024 (inproceedings)

Video Webpage Code Arxiv [BibTex]

2023

Instant Volumetric Head Avatars

Zielonka, W., Bolkart, T., Thies, J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2023, June 2023 (inproceedings)

Abstract

We present Instant Volumetric Head Avatars (INSTA),a novel approach for reconstructing photo-realistic digital avatars instantaneously. INSTA models a dynamic neural radiance field based on neural graphics primitives embedded around a parametric face model. Our pipeline is trained on a single monocular RGB portrait video that observes the subject under different expressions and views. While state-of-the-art methods take up to several days to train an avatar, our method can reconstruct a digital avatar in less than 10 minutes on modern GPU hardware, which is orders of magnitude faster than previous solutions. In addition, it allows for the interactive rendering of novel poses and expressions. By leveraging the geometry prior of the underlying parametric face model, we demonstrate that INSTA extrapolates to unseen poses. In quantitative and qualitative studies on various subjects, INSTA outperforms state-of-the-art methods regarding rendering quality and training time.

pdf project video code face tracker code dataset [BibTex]

2023

Zielonka, W., Bolkart, T., Thies, J. Instant Volumetric Head Avatars In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2023, June 2023 (inproceedings)

pdf project video code face tracker code dataset [BibTex]

MIME: Human-Aware 3D Scene Generation

Yi, H., Huang, C. P., Tripathi, S., Hering, L., Thies, J., Black, M. J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 12965-12976, CVPR 2023, June 2023 (inproceedings) Accepted

Abstract

Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a “scanner” of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.

project arXiv paper [BibTex]

Yi, H., Huang, C. P., Tripathi, S., Hering, L., Thies, J., Black, M. J. MIME: Human-Aware 3D Scene Generation In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 12965-12976, CVPR 2023, June 2023 (inproceedings) Accepted

project arXiv paper [BibTex]

DINER: Depth-aware Image-based Neural Radiance Fields

Prinzler, M., Hilliges, O., Thies, J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2023, 2023 (inproceedings) Accepted

Abstract

We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison to the previous state of the art, DINER achieves higher synthesis quality and can process input views with greater disparity. This allows us to capture scenes more completely without changing capturing hardware requirements and ultimately enables larger viewpoint changes during novel view synthesis. We evaluate our method by synthesizing novel views, both for human heads and for general objects, and observe significantly improved qualitative results and increased perceptual metrics compared to the previous state of the art. The code is publicly available for research purposes.

Video Code Arxiv link (url) [BibTex]

Prinzler, M., Hilliges, O., Thies, J. DINER: Depth-aware Image-based Neural Radiance Fields In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2023, 2023 (inproceedings) Accepted

Video Code Arxiv link (url) [BibTex]

2022

Towards Metrical Reconstruction of Human Faces

Zielonka, W., Bolkart, T., Thies, J.

In Computer Vision – ECCV 2022, 13, pages: 250-269, Lecture Notes in Computer Science, 13673, (Editors: Avidan, Shai and Brostow, Gabriel and Cissé, Moustapha and Farinella, Giovanni Maria and Hassner, Tal), Springer, Cham, 17th European Conference on Computer Vision (ECCV 2022), October 2022 (inproceedings)

Abstract

Face reconstruction and tracking is a building block of numerous applications in AR/VR, human-machine interaction, as well as medical applications. Most of these applications rely on a metrically correct prediction of the shape, especially, when the reconstructed subject is put into a metrical context (i.e., when there is a reference object of known size). A metrical reconstruction is also needed for any application that measures distances and dimensions of the subject (e.g., to virtually fit a glasses frame). State-of-the-art methods for face reconstruction from a single image are trained on large 2D image datasets in a self-supervised fashion. However, due to the nature of a perspective projection they are not able to reconstruct the actual face dimensions, and even predicting the average human face outperforms some of these methods in a metrical sense. To learn the actual shape of a face, we argue for a supervised training scheme. Since there exists no large-scale 3D dataset for this task, we annotated and unified small- and medium-scale databases. The resulting unified dataset is still a medium-scale dataset with more than 2k identities and training purely on it would lead to overfitting. To this end, we take advantage of a face recognition network pretrained on a large-scale 2D image dataset, which provides distinct features for different faces and is robust to expression, illumination, and camera changes. Using these features, we train our face shape estimator in a supervised fashion, inheriting the robustness and generalization of the face recognition network. Our method, which we call MICA (MetrIC fAce), outperforms the state-of-the-art reconstruction methods by a large margin, both on current non-metric benchmarks as well as on our metric benchmarks (15% and 24% lower average error on NoW, respectively). Project website: https://zielon.github.io/mica/.

pdf project video code DOI [BibTex]

2022

Zielonka, W., Bolkart, T., Thies, J. Towards Metrical Reconstruction of Human Faces In Computer Vision – ECCV 2022, 13, pages: 250-269, Lecture Notes in Computer Science, 13673, (Editors: Avidan, Shai and Brostow, Gabriel and Cissé, Moustapha and Farinella, Giovanni Maria and Hassner, Tal), Springer, Cham, 17th European Conference on Computer Vision (ECCV 2022), October 2022 (inproceedings)

pdf project video code DOI [BibTex]

Human-Aware Object Placement for Visual Environment Reconstruction

Yi, H., Huang, C. P., Tzionas, D., Kocabas, M., Hassan, M., Tang, S., Thies, J., Black, M. J.

In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), pages: 3949-3960, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), June 2022 (inproceedings)

Abstract

Humans are in constant contact with the world as they move through it and interact with it. This contact is a vital source of information for understanding 3D humans, 3D scenes, and the interactions between them. In fact, we demonstrate that these human-scene interactions (HSIs) can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video. Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images, and optimize the 3D scene to reconstruct a consistent, physically plausible and functional 3D scene layout. Our optimization-based approach exploits three types of HSI constraints: (1) humans that move in a scene are occluded or occlude objects, thus, defining the depth ordering of the objects, (2) humans move through free space and do not interpenetrate objects, (3) when humans and objects are in contact, the contact surfaces occupy the same place in space. Using these constraints in an optimization formulation across all observations, we significantly improve the 3D scene layout reconstruction. Furthermore, we show that our scene reconstruction can be used to refine the initial 3D human pose and shape (HPS) estimation. We evaluate the 3D scene layout reconstruction and HPS estimation qualitatively and quantitatively using the PROX and PiGraphs datasets. The code and data are available for research purposes at https://mover.is.tue.mpg.de/.

project arXiv DOI Project Page [BibTex]

Yi, H., Huang, C. P., Tzionas, D., Kocabas, M., Hassan, M., Tang, S., Thies, J., Black, M. J. Human-Aware Object Placement for Visual Environment Reconstruction In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), pages: 3949-3960, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), June 2022 (inproceedings)

project arXiv DOI Project Page [BibTex]

Neural Head Avatars from Monocular RGB Videos

Grassal, P., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.

In 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , pages: 18632-18643 , IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022 (inproceedings)

Abstract

We present Neural Head Avatars, a novel neural representation that explicitly models the surface geometry and appearance of an animatable human avatar that can be used for teleconferencing in AR/VR or other applications in the movie or games industry that rely on a digital human. Our representation can be learned from a monocular RGB portrait video that features a range of different expressions and views. Specifically, we propose a hybrid representation consisting of a morphable model for the coarse shape and expressions of the face, and two feed-forward networks, predicting vertex offsets of the underlying mesh as well as a view- and expression-dependent texture. We demonstrate that this representation is able to accurately extrapolate to unseen poses and view points, and generates natural expressions while providing sharp texture details. Compared to previous works on head avatars, our method provides a disentangled shape and appearance model of the complete human head (including hair) that is compatible with the standard graphics pipeline. Moreover, it quantitatively and qualitatively outperforms current state of the art in terms of reconstruction quality and novel-view synthesis.

Code Video link (url) DOI [BibTex]

Grassal, P., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J. Neural Head Avatars from Monocular RGB Videos In 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , pages: 18632-18643 , IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022 (inproceedings)

Code Video link (url) DOI [BibTex]

2021

Dynamic Surface Function Networks for Clothed Human Bodies

Burov, A., Nießner, M., Thies, J.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 10734-10744, IEEE/CVF International Conference on Computer Vision (ICCV 2021), October 2021 (inproceedings)

Abstract

We present a novel method for temporal coherent reconstruction and tracking of clothed humans. Given a monocular RGB-D sequence, we learn a person-specific body model which is based on a dynamic surface function network. To this end, we explicitly model the surface of the person using a multi-layer perceptron (MLP) which is embedded into the canonical space of the SMPL body model. With classical forward rendering, the represented surface can be rasterized using the topology of a template mesh. For each surface point of the template mesh, the MLP is evaluated to predict the actual surface location. To handle pose-dependent deformations, the MLP is conditioned on the SMPL pose parameters. We show that this surface representation as well as the pose parameters can be learned in a self-supervised fashion using the principle of analysis-by-synthesis and differentiable rasterization. As a result, we are able to reconstruct a temporally coherent mesh sequence from the input data. The underlying surface representation can be used to synthesize new animations of the reconstructed person including pose-dependent deformations.

link (url) DOI [BibTex]

2021

Burov, A., Nießner, M., Thies, J. Dynamic Surface Function Networks for Clothed Human Bodies In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 10734-10744, IEEE/CVF International Conference on Computer Vision (ICCV 2021), October 2021 (inproceedings)

link (url) DOI [BibTex]

Neural Parametric Models for 3D Deformable Shapes

Palafox, P., Bozic, A., Thies, J., Nießner, M., Dai, A.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021) , pages: 12675-12685 , IEEE, IEEE/CVF International Conference on Computer Vision (ICCV 2021) , October 2021 (inproceedings)

Abstract

Parametric 3D models have enabled a wide variety of tasks in computer graphics and vision, such as modeling human bodies, faces, and hands. However, the construction of these parametric models is often tedious, as it requires heavy manual tweaking, and they struggle to represent additional complexity and details such as wrinkles or clothing. To this end, we propose Neural Parametric Models (NPMs), a novel, learned alternative to traditional, parametric 3D models, which does not require hand-crafted, object-specific constraints. In particular, we learn to disentangle 4D dynamics into latent-space representations of shape and pose, leveraging the flexibility of recent developments in learned implicit functions. Crucially, once learned, our neural parametric models of shape and pose enable optimization over the learned spaces to fit to new observations, similar to the fitting of a traditional parametric model, e.g., SMPL. This enables NPMs to achieve a significantly more accurate and detailed representation of observed deformable sequences. We show that NPMs improve notably over both parametric and non-parametric state of the art in reconstruction and tracking of monocular depth sequences of clothed humans and hands. Latent-space interpolation as well as shape/pose transfer experiments further demonstrate the usefulness of NPMs.

DOI [BibTex]

Palafox, P., Bozic, A., Thies, J., Nießner, M., Dai, A. Neural Parametric Models for 3D Deformable Shapes In 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021) , pages: 12675-12685 , IEEE, IEEE/CVF International Conference on Computer Vision (ICCV 2021) , October 2021 (inproceedings)

DOI [BibTex]

ID-Reveal: Identity-aware DeepFake Video Detection

Cozzolino, D., Rössler, A., Thies, J., Nießner, M., Verdoliva, L.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 15088-15097 , IEEE/CVF International Conference on Computer Vision (ICCV 2021), October 2021 (inproceedings)

Abstract

State-of-the-art DeepFake forgery detectors are trained in a supervised fashion to answer the question ‘is this video real or fake?’. Given that their training is typically method-specific, these approaches show poor generalization across different types of facial manipulations, e.g., face swapping or facial reenactment. In this work, we look at the problem from a different perspective by focusing on the facial characteristics of a specific identity; i.e., we want to answer the question ‘Is this the person who is claimed to be?’. To this end, we introduce ID-Reveal, a new approach that learns temporal facial features, specific of how each person moves while talking, by means of metric learning coupled with an adversarial training strategy. ur method is independent of the specific type of manipulation since it is trained only on real videos. Moreover, relying on high-level semantic features, it is robust to widespread and disruptive forms of post-processing. We performed a thorough experimental analysis on several publicly available benchmarks, such as FaceForensics++, Google’s DFD, and Celeb-DF. Compared to state of the art, our method improves generalization and is more robust to low-quality videos, that are usually spread over social networks. In particular, we obtain an average improvement of more than 15% in terms of accuracy for facial reenactment on high compressed videos.

Paper Video Code link (url) DOI [BibTex]

Cozzolino, D., Rössler, A., Thies, J., Nießner, M., Verdoliva, L. ID-Reveal: Identity-aware DeepFake Video Detection In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 15088-15097 , IEEE/CVF International Conference on Computer Vision (ICCV 2021), October 2021 (inproceedings)

Paper Video Code link (url) DOI [BibTex]

RetrievalFuse: Neural 3D Scene Reconstruction with a Database

Siddiqui, Y., Thies, J., Ma, F., Shan, Q., Nießner, M., Dai, A.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 12548-12557 , IEEE/CVF International Conference on Computer Vision (ICCV 2021), October 2021 (inproceedings)

Abstract

3D reconstruction of large scenes is a challenging problem due to the high-complexity nature of the solution space, in particular for generative neural networks. In contrast to traditional generative learned models which encode the full generative process into a neural network and can struggle with maintaining local details at the scene level, we introduce a new method that directly leverages scene geometry from the training database. First, we learn to synthesize an initial estimate for a 3D scene, constructed by retrieving a top-k set of volumetric chunks from the scene database. These candidates are then refined to a final scene generation with an attention-based refinement that can effectively select the most consistent set of geometry from the candidates and combine them together to create an output scene, facilitating transfer of coherent structures and local detail from train scene geometry. We demonstrate our neural scene reconstruction with a database for the tasks of 3D super-resolution and surface reconstruction from sparse point clouds, showing that our approach enables generation of more coherent, accurate 3D scenes, improving on average by over 8% in IoU over state-of-the-art scene reconstruction.

DOI [BibTex]

Siddiqui, Y., Thies, J., Ma, F., Shan, Q., Nießner, M., Dai, A. RetrievalFuse: Neural 3D Scene Reconstruction with a Database In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 12548-12557 , IEEE/CVF International Conference on Computer Vision (ICCV 2021), October 2021 (inproceedings)

DOI [BibTex]

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

Bozic, A., Palafox, P., Thies, J., Dai, A., Nießner, M.

Advances in Neural Information Processing Systems 34 (NeurIPS 2021) , 2, pages: 1403-1414 , 35th Conference on Neural Information Processing Systems , 2021 (conference)

Abstract

We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.

[BibTex]

Bozic, A., Palafox, P., Thies, J., Dai, A., Nießner, M. TransformerFusion: Monocular RGB Scene Reconstruction using Transformers Advances in Neural Information Processing Systems 34 (NeurIPS 2021) , 2, pages: 1403-1414 , 35th Conference on Neural Information Processing Systems , 2021 (conference)

[BibTex]

Neural RGB-D Surface Reconstruction

Azinovic, D., Martin-Brualla, R., Goldman, D. B., Nießner, M., Thies, J.

ArXiv, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages: 6280-6291 , IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2021 (conference)

link (url) DOI [BibTex]

Azinovic, D., Martin-Brualla, R., Goldman, D. B., Nießner, M., Thies, J. Neural RGB-D Surface Reconstruction ArXiv, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages: 6280-6291 , IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2021 (conference)

link (url) DOI [BibTex]