这是用户在 2024-3-9 16:37 为 https://ar5iv.labs.arxiv.org/html/2312.02069?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
GaussianAvatars:带有 Rigged 3D Gaussian 的逼真头部头像

Shenhan Qian11{}^{1}         Tobias Kirschstein11{}^{1}         Liam Schoneveld22{}^{2}         Davide Davoli33{}^{3}\thanks{associated partner by contracted service}
沉瀚钱 11{}^{1} 托比亚斯·基尔施斯坦 11{}^{1} 利亚姆·舍内维尔德 22{}^{2} 达维德·达沃利 33{}^{3}\thanks{associated partner by contracted service}

Simon Giebenhain11{}^{1}         Matthias Nießner11{}^{1}
西蒙·吉本海恩 11{}^{1} 马蒂亚斯·尼埃纳 11{}^{1}

11{}^{1}Technical University of Munich         22{}^{2}Woven by Toyota         33{}^{3}Toyota Motor Europe NV/SA
11{}^{1} 慕尼黑工业大学 22{}^{2} 丰田编织 33{}^{3} 丰田汽车欧洲公司 NV/SA
associated partner by contracted service
合同服务关联伙伴
Abstract 抽象的

We introduce GaussianAvatars111Project page: https://shenhanqian.github.io/gaussian-avatars, a new method to create photorealistic head avatars that are fully controllable in terms of expression, pose, and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying parametric model, e.g., through expression transfer from a driving sequence or by manually changing the morphable model parameters. We parameterize each splat by a local coordinate frame of a triangle and optimize for explicit displacement offset to obtain a more accurate geometric representation. During avatar reconstruction, we jointly optimize for the morphable model parameters and Gaussian splat parameters in an end-to-end fashion. We demonstrate the animation capabilities of our photorealistic avatar in several challenging scenarios. For instance, we show reenactments from a driving video, where our method outperforms existing works by a significant margin.
我们引入了 GaussianAvatars 1 ,这是一种创建在表情、姿势和视角方面完全可控的逼真头部头像的新方法。核心思想是基于 3D 高斯图的动态 3D 表示,该图被绑定到参数化可变形面部模型。这种组合促进了照片级真实感渲染,同时允许通过底层参数模型进行精确的动画控制,例如通过驱动序列的表达传输或通过手动更改可变形模型参数。我们通过三角形的局部坐标系对每个板进行参数化,并优化显式位移偏移以获得更准确的几何表示。在头像重建过程中,我们以端到端的方式联合优化可变形模型参数和高斯分布参数。我们在几个具有挑战性的场景中展示了逼真头像的动画功能。例如,我们展示了驾驶视频的重演,我们的方法明显优于现有的作品。

[Uncaptioned image]
Figure 1: We introduce GaussianAvatars, a new method to create photorealistic head avatars from multi-view videos. Our avatar is represented by 3D Gaussian splats rigged to a parametric face model. We can fully control and animate our avatars in terms of pose, expression, and view point as shown in the example renderings above.
图 1:我们引入了 GaussianAvatars,这是一种从多视图视频创建逼真头像的新方法。我们的化身由装配到参数化脸部模型的 3D 高斯斑点表示。我们可以在姿势、表情和视角方面完全控制我们的化身并为其设置动画,如上面的示例渲染所示。
11footnotetext: Associated partner by contracted services.
1 通过合同服务建立关联合作伙伴。

1 Introduction 1简介

Creating animatable avatars of human heads has been a longstanding problem in computer vision and graphics.
创建可动画的人体头部头像一直是计算机视觉和图形领域的一个长期存在的问题。

In particular, the ability to render photorealistic dynamic avatars from arbitrary viewpoints enables numerous applications in gaming, movie production, immersive telepresence, and augmented or virtual reality. For such applications, it is also crucial to be able to control the avatar, and for it to generalize well to novel poses and expressions.
特别是,从任意角度渲染逼真的动态化身的能力使得游戏、电影制作、沉浸式远程呈现以及增强或虚拟现实中的众多应用成为可能。对于此类应用,能够控制化身并使其能够很好地概括新的姿势和表情也至关重要。

Reconstructing a 3D representation able to jointly capture the appearance, geometry and dynamics of human heads represents a major challenge for high-fidelity avatar generation. The under-constrained nature of this reconstruction problem significantly complicates the task of achieving a representation that combines novel-view rendering photorealism with expression controllability. Moreover, extreme expressions and facial details, like wrinkles, the mouth interior, and hair, are difficult to capture, and can produce visual artifacts easily noticed by humans.
重建能够共同捕捉人类头部的外观、几何形状和动态的 3D 表示是高保真化身生成的一项重大挑战。这种重建问题的约束不足的性质使实现将新颖视图渲染照片写实性与表达可控性相结合的表示的任务变得非常复杂。此外,极端的表情和面部细节,如皱纹、口腔内部和头发,很难捕捉,并且会产生容易被人类注意到的视觉伪影。

Neural Radiance Fields (NeRF [29]) and its variants [2, 6, 30] have shown impressive results in reconstructing static scenes from multi-view observations. Follow-up works have extended NeRF to model dynamic scenes for both arbitrary [31, 32, 22] and human-tailored scenarios [17, 12]. These works achieve impressive results for novel view rendering; however, they lack controllability and as such do not generalize well to novel poses and expressions.
神经辐射场(NeRF [29])及其变体 [2,6,30] 在从多视图观察中重建静态场景方面显示出了令人印象深刻的结果。后续工作将 NeRF 扩展为任意 [31, 32, 22] 和人类定制场景 [17, 12] 的动态场景建模。这些作品在新颖的视图渲染方面取得了令人印象深刻的成果;然而,它们缺乏可控性,因此不能很好地推广到新的姿势和表情。

The recent 3D Gaussian Splatting method [14] achieves even higher rendering quality than NeRF for novel-view synthesis with real-time performance, optimizing for discrete geometric primitives (3D Gaussians) throughout 3D space. This method has been extended to capture dynamic scenes by building explicit correspondences across time steps [28]; however, they do not allow for animations of the reconstructed outputs.
最近的 3D 高斯 Splatting 方法 [14] 实现了比 NeRF 更高的渲染质量,用于具有实时性能的新颖视图合成,优化了整个 3D 空间中的离散几何图元(3D 高斯)。该方法已扩展到通过跨时间步长建立明确的对应关系来捕获动态场景[28];然而,它们不允许重建输出的动画。

To this end, we propose GaussianAvatars, a method for dynamic 3D representation of human head based on 3D Gaussian splats that are rigged to a parametric morphable face model. Given a FLAME [21] mesh, we initialize a 3D Gaussian at the center of each triangle. When the FLAME mesh is animated, each Gaussian then translates, rotates, and scales according to its parent triangle. The 3D Gaussians then form a radiance field on top of the mesh, compensating for regions where the mesh is either not accurately aligned, or is incapable of reproducing certain visual elements. To achieve high fidelity of the reconstructed avatar, we introduce a binding inheritance strategy to support Gaussian splats without losing controllability. We also explore balancing fidelity and robustness to animate the avatars with novel expressions and poses. Our method outperforms existing works by a significant margin on both novel-view rendering and reenactment from a driving video.
为此,我们提出了 GaussianAvatars,一种基于 3D 高斯图的人体头部动态 3D 表示方法,该图被绑定到参数化可变形面部模型。给定一个 FLAME [21] 网格,我们在每个三角形的中心初始化一个 3D 高斯。当 FLAME 网格动画化时,每个高斯模型都会根据其父三角形进行平移、旋转和缩放。然后,3D 高斯在网格顶部形成辐射场,补偿网格未准确对齐或无法再现某些视觉元素的区域。为了实现重建头像的高保真度,我们引入了绑定继承策略来支持高斯分布而不失去可控性。我们还探索平衡保真度和鲁棒性,以新颖的表情和姿势来动画化身。我们的方法在小说视图渲染和驾驶视频重演方面都远远优于现有的作品。

Our contributions are as follows:
我们的贡献如下:

  • We propose GaussianAvatars, a method to create animatable head avatars by rigging 3D Gaussians to parametric mesh models.


    • 我们提出GaussianAvatars,一种通过将3D 高斯绑定到参数化网格模型来创建可动画化头部头像的方法。
  • We design a binding inheritance strategy to support adding and removing 3D Gaussians without losing controllability.


    • 我们设计了一个绑定继承策略来支持添加和删除3D 高斯而不失去可控性。

2 Related Work 2相关工作

2.1 Static Scene Representation
2.1静态场景表示

NeRF [29] stores the radiance field of a scene in a neural network and provides photorealistic renderings of novel views with volumetric rendering. Later work [49, 39] represents the scene as voxel grids, achieving comparable rendering quality in a shorter time. Efficiency can be further improved by employing compression methods such as voxel hashing [30] or tensor decomposition [6]. PointNeRF [45] uses point clouds as a scene representation, whereas 3D Gaussian Splatting [14] uses anisotropic 3D Gaussians that are rendered by sorting and rasterization, achieving superior visual quality with real-time rendering. Mixture of Volumetric Primitives [26] uses surface-aligned volumes to achieve fast rendering with high visual fidelity. Our method follows 3D Gaussian Splatting [14], benefiting from the expressiveness of anisotropic Gaussians.
NeRF [29] 将场景的辐射场存储在神经网络中,并通过体积渲染提供新颖视图的真实感渲染。后来的工作 [49, 39] 将场景表示为体素网格,在更短的时间内实现了可比的渲染质量。通过采用体素散列[30]或张量分解[6]等压缩方法可以进一步提高效率。 PointNeRF [45] 使用点云作为场景表示,而 3D Gaussian Splatting [14] 使用通过排序和光栅化渲染的各向异性 3D 高斯,通过实时渲染实现卓越的视觉质量。 Mixture of Volumetric Primitives [26] 使用表面对齐的体积来实现具有高视觉保真度的快速渲染。我们的方法遵循 3D 高斯分布 [14],受益于各向异性高斯的表现力。

2.2 Dynamic Scene Representation
2.2动态场景表示

A common paradigm to model a dynamic scene is directly storing it in an MLP with 4D coordinates as the input [9, 44] or a 4D tensor with the time dimension or spatial dimensions compressed [38, 7, 4, 1]. These methods can faithfully replay a dynamic scene and produce realistic novel-view rendering but lack an explicit handle to manipulate the content. Another paradigm learns a static canonical space and the maps time steps back to the canonical space via separate MLPs [35, 31, 32]. Instead of using a deformation MLP, a proxy geometry provides more direct controllability. Liu et al. [25] warp points from an observed space to the canonical space based on the movement of the nearest triangle on an SMPL [27] mesh. Peng et al. [34] deform points with the skeleton of SMPL and neural blending weights. Concurrent works create human body avatars with forward deformation [23, 20, 48, 13, 18] and cage-based deformation [54]. Unlike these methods, we directly attach 3D Gaussians to triangles and explicitly move them, obviating the need for a canonical space and enabling effective mesh finetuning.
动态场景建模的常见范例是将其直接存储在 MLP 中,并以 4D 坐标作为输入 [9, 44] 或时间维度或空间维度压缩的 4D 张量 [38, 7, 4, 1]。这些方法可以忠实地重播动态场景并产生逼真的小说视图渲染,但缺乏操作内容的显式句柄。另一种范式学习静态规范空间,并通过单独的 MLP 将时间步映射回规范空间 [35,31,32]。代理几何体不使用变形 MLP,而是提供更直接的可控性。刘等人。 [25] 基于 SMPL [27] 网格上最近三角形的移动将扭曲点从观察空间扭曲到规范空间。彭等人。 [34]使用 SMPL 骨架和神经混合权重对点进行变形。同时进行的工作创建了具有前向变形[23,20,48,13,18]和基于笼子的变形[54]的人体化身。与这些方法不同,我们直接将 3D 高斯函数附加到三角形并显式移动它们,从而无需规范空间并实现有效的网格微调。

2.3 Human Head Reconstruction and Animation
2.3人体头部重建与动画

Head avatar creation advanced through the uptake of differentiable rendering and scene representations. Thies et al. [41] instigated a shift toward digital avatars with real-time face tracking and authentic face reenactment. Advancements in image synthesis with neural networks [42, 5] boosted the controllability of head avatars from lip-syncing to expression transfer and head motions [40, 15, 47]. Gafni et al. [8] learn a NeRF conditioned on an expression vector from monocular videos. Grassal et al. [10] subdivide and add offsets to FLAME [21] to enhance its geometry and enable a dynamic texture via an expression-dependent texture field. IMavatar [51] learns a 3D morphable head avatar with neural implicit functions, solving for a map from observed to canonical space via iterative root-finding. HeadNeRF [11] learns a NeRF-based parametric head model with 2D neural rendering for efficiency.
通过采用可微分渲染和场景表示,头部头像的创建取得了进步。蒂斯等人。 [41]通过实时面部跟踪和真实的面部重演,推动了向数字化身的转变。神经网络图像合成的进步 [42, 5] 提高了头部化身的可控性,从口型同步到表情转移和头部运动 [40, 15, 47]。加夫尼等人。 [8] 从单目视频中学习以表达向量为条件的 NeRF。格拉萨尔等人。 [10] 对 FLAME [21] 进行细分和添加偏移,以增强其几何形状并通过依赖于表达式的纹理字段启用动态纹理。 IMavatar [51] 使用神经隐式函数学习 3D 可变形头部化身,通过迭代寻根求解从观察空间到规范空间的映射。 HeadNeRF [11] 通过 2D 神经渲染学习基于 NeRF 的参数化头部模型,以提高效率。

INSTA [55] deforms query points to a canonical space by finding the nearest triangle on a FLAME [21] mesh, and combines this with InstantNGP [30] to achieve fast rendering. Like INSTA, we also use triangles to warp the scene, but we build a consistent correspondence between a 3D Gaussian and a triangle instead of querying the nearest one for each timestep. Zheng et al. [52] explore point-based representations with differential point splatting. They define a point set in canonical space and learn a deformation field conditioned on FLAME’s expression vectors to animate the head avatar. While the scale of a point has to be manually set for their method, it is an optimizable parameter for 3D Gaussians. AvatarMAV [46] defines a canonical radiance field and a motion field, both with voxel grids. To model deformations, a set of voxel grid motion bases are blended according to an input 3DMM expression vector [3, 33].
INSTA [55] 通过在 FLAME [21] 网格上查找最近的三角形将查询点变形为规范空间,并将其与 InstantNGP [30] 相结合以实现快速渲染。与 INSTA 一样,我们也使用三角形来扭曲场景,但我们在 3D 高斯和三角形之间建立一致的对应关系,而不是为每个时间步查询最近的对应关系。郑等人。 [52]探索基于点的微分点表示。他们在规范空间中定义一个点集,并学习以 FLAME 表达向量为条件的变形场,以对头部头像进行动画处理。虽然必须为其方法手动设置点的比例,但它是 3D 高斯函数的可优化参数。 AvatarMAV [46] 定义了一个规范的辐射场和一个运动场,两者都带有体素网格。为了对变形进行建模,根据输入 3DMM 表达向量 [3, 33] 混合一组体素网格运动基础。

3 Method 3方法

Refer to caption
Figure 2: Overview. Our method binds 3D Gaussian splats to a FLAME [21] mesh locally. We take the tracked mesh for each frame and transform the splats from local to global space before rendering them with 3D Gaussian Splatting [14]. We optimize the splats in the local space by minimizing color loss on the rendering. We add and remove splats adaptively with their binding relation to triangles inherited so that all splats remain rigged throughout the optimization procedure. Further, we regularize the position and scaling of 3D Gaussian splats to suppress artifacts during animation.
图 2:概述。我们的方法将 3D 高斯图块本地绑定到 FLAME [21] 网格。我们采用每个帧的跟踪网格并将splats从局部空间转换为全局空间,然后使用3D Gaussian Splatting [14]渲染它们。我们通过最小化渲染上的颜色损失来优化局部空间中的splats。我们自适应地添加和删除板片及其与继承的三角形的绑定关系,以便所有板片在整个优化过程中保持装配状态。此外,我们规范 3D 高斯图块的位置和缩放,以抑制动画期间的伪影。

As shown in Fig. 2, the input to our method is a multi-view video recording of a human head. For each time step, we use a photometric head tracker based on [41] to fit FLAME [21] parameters with multi-view observations and known camera parameters. The FLAME meshes have vertices at varied positions but share the same topology. Therefore, we can build a consistent connection between triangles of the mesh and 3D Gaussian splats (Sec. 3.2). The splats are rendered into images via a differentiable tile rasterizer [14]. These images are then supervised by the ground-truth images towards learning a photorealistic human head avatar. As per static scenes, it is also necessary to densify and prune Gaussian splats for optimal quality, with a set of adaptive density control operations [14]. To achieve this without breaking the connection between triangles and splats, we design a binding inheritance strategy (Sec. 3.3) so that new Gaussian points remain rigged to the FLAME mesh. In addition to the color loss, we also find it crucial to regularize the local position and scaling of Gaussian splats to avoid quality degradation under novel expressions and poses (Sec. 3.4).
如图 2 所示,我们方法的输入是人头的多视图视频记录。对于每个时间步长,我们使用基于[41]的光度头部跟踪器来将FLAME [21]参数与多视图观察和已知相机参数相匹配。 FLAME 网格的顶点位于不同的位置,但共享相同的拓扑。因此,我们可以在网格三角形和 3D 高斯图之间建立一致的连接(第 3.2 节)。通过可微分的图块光栅器将图块渲染成图像[14]。然后,这些图像由真实图像监督,以学习逼真的人体头部头像。根据静态场景,还需要通过一组自适应密度控制操作来致密和修剪高斯图以获得最佳质量[14]。为了在不破坏三角形和图块之间的连接的情况下实现这一点,我们设计了一种绑定继承策略(第 3.3 节),以便新的高斯点保持与 FLAME 网格绑定。除了颜色损失之外,我们还发现规范高斯图块的局部位置和缩放以避免新表情和姿势下的质量下降至关重要(第 3.4 节)。

3.1 Preliminary 3.1初步

3D Gaussian Splatting [14] provides a solution to reconstruct a static scene with anisotropic 3D Gaussians given images and camera parameters. A scene is represented by a set of Gaussian splats, each defined by a covariance matrix ΣΣ\Sigma centered at point (mean) 𝝁𝝁\bm{\mu}:
3D Gaussian Splatting [14] 提供了一种在给定图像和相机参数的情况下使用各向异性 3D 高斯重建静态场景的解决方案。场景由一组高斯图表示,每个图由以点(平均值) 𝝁𝝁\bm{\mu} 为中心的协方差矩阵 ΣΣ\Sigma 定义:

G(𝒙)=e12(𝒙𝝁)TΣ1(𝒙𝝁)𝐺𝒙superscript𝑒12superscript𝒙𝝁𝑇superscriptΣ1𝒙𝝁G(\bm{x})=e^{-\frac{1}{2}(\bm{x}-\bm{\mu})^{T}\Sigma^{-1}(\bm{x}-\bm{\mu})} (1)

Note that covariance matrices have physical meaning only when they are semi-definite, which cannot be guaranteed for an optimization process with gradient descent. Therefore, Kerbl et al. [14] first define a parametric ellipse with a scaling matrix S𝑆S and a rotation matrix R𝑅R, then construct the covariance matrix by:
请注意,协方差矩阵仅在半定时才具有物理意义,这不能保证梯度下降的优化过程。因此,Kerbl 等人。 [14] 首先定义一个带有缩放矩阵 S𝑆S 和旋转矩阵 R𝑅R 的参数椭圆,然后通过以下方式构造协方差矩阵:

Σ=RSSTRT.Σ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RSS^{T}R^{T}. (2)

Practically, an ellipse is stored as a position vector 𝝁3𝝁superscript3\bm{\mu}\in\mathbb{R}^{3}, a scaling vector 𝒔3𝒔superscript3\bm{s}\in\mathbb{R}^{3}, and a quaternion 𝒒4𝒒superscript4\bm{q}\in\mathbb{R}^{4}. In our paper, we use 𝒓3×3𝒓superscript33\bm{r}\in\mathbb{R}^{3\times 3} to notate the corresponding rotation matrix to 𝒒𝒒\bm{q}.
实际上,椭圆存储为位置向量 𝝁3𝝁superscript3\bm{\mu}\in\mathbb{R}^{3} 、缩放向量 𝒔3𝒔superscript3\bm{s}\in\mathbb{R}^{3} 和四元数 𝒒4𝒒superscript4\bm{q}\in\mathbb{R}^{4} 。在我们的论文中,我们使用 𝒓3×3𝒓superscript33\bm{r}\in\mathbb{R}^{3\times 3} 来表示 𝒒𝒒\bm{q} 对应的旋转矩阵。

For rendering, the color 𝑪𝑪\bm{C} of a pixel is computed by blending all 3D Gaussians overlapping the pixel:
对于渲染,像素的颜色 𝑪𝑪\bm{C} 通过混合与像素重叠的所有 3D 高斯函数来计算:

𝑪=i=1𝒄iαij=1i1(1αj),𝑪subscript𝑖1subscript𝒄𝑖superscriptsubscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11superscriptsubscript𝛼𝑗\bm{C}=\sum_{i=1}{\bm{c}_{i}\alpha_{i}^{\prime}\prod_{j=1}^{i-1}}(1-\alpha_{j}^{\prime}), (3)

where 𝒄isubscript𝒄𝑖\bm{c}_{i} is the color of each point modeled by 3-degree spherical harmonics. The blending weight αsuperscript𝛼\alpha^{\prime} is given by evaluating the 2D projection of the 3D Gaussian multiplied by a per-point opacity α𝛼\alpha. The Gaussian splats are sorted by depth before blending to respect visibility order.
其中 𝒄isubscript𝒄𝑖\bm{c}_{i} 是通过 3 度球谐函数建模的每个点的颜色。混合权重 αsuperscript𝛼\alpha^{\prime} 通过评估 3D 高斯的 2D 投影乘以每点不透明度 α𝛼\alpha 来给出。高斯图块在混合之前按深度排序以尊重可见性顺序。

3.2 3D Gaussian Rigging 3.23D 高斯索具

The key component of our method is how we build connections between the FLAME [21] mesh and 3D Gaussian splats. Initially, we pair each triangle of the mesh with a 3D Gaussian and let the 3D Gaussian move with the triangle across time steps. In other words, the 3D Gaussian is static in the local space of its parent triangle but dynamic in the global (metric) space as the triangle moves. Given the three vertices of a triangle, we take their mean position 𝑻𝑻\bm{T} as the origin of the local space. Then, we concatenate the direction vector of one edge, the normal vector of the triangle, and their cross product as column vectors to form a rotation matrix 𝑹𝑹\bm{R}, which describes the orientation of the triangle in the global space. We also compute a scalar k𝑘k by the mean length of one edge and its perpendicular in the triangle to describe the triangle scaling.
我们方法的关键组成部分是如何在 FLAME [21] 网格和 3D 高斯图之间建立连接。最初,我们将网格的每个三角形与 3D 高斯配对,并让 3D 高斯与三角形一起跨时间步移动。换句话说,3D 高斯在其父三角形的局部空间中是静态的,但随着三角形的移动在全局(度量)空间中是动态的。给定三角形的三个顶点,我们将它们的平均位置 𝑻𝑻\bm{T} 作为局部空间的原点。然后,我们将一条边的方向向量、三角形的法向量及其叉积作为列向量连接起来,形成一个旋转矩阵 𝑹𝑹\bm{R} ,它描述了三角形在全局空间中的方向。我们还通过三角形中一条边及其垂线的平均长度来计算标量 k𝑘k 来描述三角形缩放。

For the paired 3D Gaussian of a triangle, we define its location 𝝁𝝁\bm{\mu}, rotation 𝒓𝒓\bm{r}, and anisotropic scaling 𝒔𝒔\bm{s} all in the local space. We initialize the location 𝝁𝝁\bm{\mu} at the local origin, the rotation 𝒓𝒓\bm{r} as an identity rotation matrix, and the scaling 𝒔𝒔\bm{s} as a unit vector. At rendering time, we convert these properties into the global space by:
对于三角形的成对 3D 高斯,我们在局部空间中定义其位置 𝝁𝝁\bm{\mu} 、旋转 𝒓𝒓\bm{r} 和各向异性缩放 𝒔𝒔\bm{s} 。我们将位置 𝝁𝝁\bm{\mu} 初始化为本地原点,将旋转 𝒓𝒓\bm{r} 初始化为单位旋转矩阵,将缩放 𝒔𝒔\bm{s} 初始化为单位向量。在渲染时,我们通过以下方式将这些属性转换为全局空间:

𝒓superscript𝒓\displaystyle\bm{r}^{\prime} =𝑹𝒓,absent𝑹𝒓\displaystyle=\bm{R}\bm{r}, (4)
𝝁superscript𝝁\displaystyle\bm{\mu}^{\prime} =k𝑹𝝁+𝑻,absent𝑘𝑹𝝁𝑻\displaystyle=k\bm{R}\bm{\mu}+\bm{T}, (5)
𝒔superscript𝒔\displaystyle\bm{s}^{\prime} =k𝒔.absent𝑘𝒔\displaystyle=k\bm{s}. (6)

We incorporate triangle scaling in this process Eqs. 5 and 6 so that the local position and scaling of a 3D Gaussian are defined relative to the absolute scale of a triangle. This enables an adaptive step size in the metric space with a constant learning rate for parameters defined in the local space. For example, a 3D Gaussian paired with a smaller triangle will move slower in an iteration step than those paired with large triangles. This also makes interpreting the parameters regarding the distance from the triangle center easier.

3.3 Binding Inheritance

Only having the same numbers of Gaussian splats as the triangles is insufficient to capture details. For instance, representing a curved hair strand requires multiple splats, while a triangle on the scalp may intersect with several strands. Therefore, we also need the adaptive density control strategy [14], which adds and removes splats based on the view-space positional gradient and the opacity of each Gaussian.

For each 3D Gaussian with a large view-space positional gradient, we split it into two smaller ones if it is large or clone it if it is small. We conduct this in the local space and ensure a newly created Gaussian is close to the old one that triggers this densification operation. Then, it is reasonable to bind a new 3D Gaussian to the same triangle as the old one because it was created to enhance the fidelity of the local region. Therefore, each 3D Gaussian must carry one more parameter, the index of its parent triangle, to enable binding inheritance during densification.

Besides densification, we also use the pruning operation as a part of the adaptive density control strategy [14]. It periodically resets the opacity of all splats close to zero and removes points with opacity below a threshold. This technique is effective in suppressing floating artifacts, however, such pruning can also cause problems in a dynamic scene. For instance, regions of the face that are often occluded (such as eyeball triangles), can be overly sensitive to this pruning strategy, and often end up with few or no attached Gaussians. To prevent this, we keep track of the number of splats attached to each triangle, and ensure that every triangle always has at least one splat attached.

3.4 Optimization and Regularization

We supervise the rendered images with a combination of 1subscript1\mathcal{L}_{1} term and a D-SSIM term following [14]:

rgb=(1λ)1+λD-SSIM,subscriptrgb1𝜆subscript1𝜆subscriptD-SSIM\mathcal{L}_{\text{rgb}}=(1-\lambda)\mathcal{L}_{1}+\lambda\mathcal{L}_{\text{D-SSIM}}, (7)

with λ=0.2𝜆0.2\lambda=0.2. This already results in good re-rendering quality without additional supervision, such as depth or silhouette supervision, thanks to the powerful tile rasterizer [14]. We found however, that if we try to animate these splats via FLAME [21] to novel expressions and poses, large spike- and blob-like artifacts appear wildly throughout the scene. This is due to a poor alignment between the Gaussian splats and the triangles.

Position loss with threshold. A basic hypothesis behind 3D Gaussian rigging is that the Gaussian splats should roughly match the underlying mesh. They should also match their locations; for instance a Gaussian representing a spot on the nose should not be rigged to a triangle on the cheek. Although our splats are initialized at triangle centers, and new splats are added nearby these existing ones, it is not guaranteed that the primitives remain close to their parent triangle after the optimization. To address this, we regularize the local position of each Gaussian by:

position=max(𝝁,ϵposition)2,subscriptpositionsubscriptnorm𝝁subscriptitalic-ϵposition2\mathcal{L}_{\text{position}}=\|\max\left(\bm{\mu},\epsilon_{\text{position}}\right)\|_{2}, (8)

where ϵposition=1subscriptitalic-ϵposition1\epsilon_{\text{position}}=1 is a threshold that tolerates small errors within the scaling of its parent triangle.

Scaling loss with threshold. Aside from position, the scaling of 3D Gaussians is even more essential for the visual quality during animation. Specifically, if a 3D Gaussian is large in comparison to its parent triangle, small rotations of the triangle – barely noticeable at the scale of the triangle – will be magnified by the scale of the 3D Gaussian, resulting in unpleasant jittering artifacts. To mitigate this, we also regularize the local scale of each 3D Gaussian by:

scaling=max(𝒔,ϵscaling)2,subscriptscalingsubscriptnorm𝒔subscriptitalic-ϵscaling2\mathcal{L}_{\text{scaling}}=\|\max\left(\bm{s},\epsilon_{\text{scaling}}\right)\|_{2}, (9)

where ϵscaling=0.6subscriptitalic-ϵscaling0.6\epsilon_{\text{scaling}}=0.6 is a threshold that disables this loss term when the scale of a Gaussian less than 0.6×0.6\times the scale of its parent triangle. This ϵscalingsubscriptitalic-ϵscaling\epsilon_{\text{scaling}}-tolerance is indispensable; without it, the Gaussian splats shrink excessively, causing rendering speeds to deteriorate, as camera rays need to hit more splats to before zero transmittance is reached.

Our final loss function is thus:

=rgb+λpositionposition+λscalingscaling,subscriptrgbsubscript𝜆positionsubscriptpositionsubscript𝜆scalingsubscriptscaling\mathcal{L}=\mathcal{L}_{\text{rgb}}+\lambda_{\text{position}}\mathcal{L}_{\text{position}}+\lambda_{\text{scaling}}\mathcal{L}_{\text{scaling}}, (10)

where λposition=0.01subscript𝜆position0.01\lambda_{\text{position}}=0.01 and λscaling=1subscript𝜆scaling1\lambda_{\text{scaling}}=1. Note that we only apply positionsubscriptposition\mathcal{L}_{\text{position}} and scalingsubscriptscaling\mathcal{L}_{\text{scaling}} to visible splats. Thereby, we only regularize points when the color loss rgbsubscriptrgb\mathcal{L}_{\text{rgb}} is present. This helps to maintain the learned structure of often-occluded regions such as teeth and eyeballs.

Implementation details. We use Adam [16] for parameter optimization (the same hyperparameter values are used across all subjects). We set the learning rate to 5e-3 for the position and 1.7e-2 for the scaling of 3D Gaussians and keep the same learning rates as 3D Gaussian Splatting [14] for the rest of the parameters. Alongside the Gaussian splat parameters, we also finetune the translation, joint rotation, and expression parameters of FLAME [21] for each timestep, using learning rates 1e-6, 1e-5, and 1e-3, respectively. We train for 600,000 iterations, and exponentially decay the learning rate for the splat positions until the final iteration, where it reaches 0.01×\times the initial value. We enable adaptive density control with binding inheritance every 2,000 iterations, from iteration 10,000 until the end. Every 60,000 iterations, we reset the Gaussians’ opacities. We use a photo-metric head tracker to obtain the FLAME parameters, including shape 𝜷𝜷\bm{\beta}, translation 𝒕𝒕\bm{t}, pose 𝜽𝜽\bm{\theta}, expression 𝝍𝝍\bm{\psi}, and vertex offset Δ𝒗Δ𝒗\Delta\bm{v} in the canonical space. We also manually add triangles for the upper and lower teeth, which are rigid to the neck and jaw joints, respectively.

4 Experiments

Refer to caption
Figure 3: Qualitative comparison on novel-view synthesis and self-reenactment of head avatars. Our method outperforms state-of-the-art methods by producing significantly sharper rendering outputs. We obtain precise reconstruction of details such as reflective light on eyes, hair strands, teeth, etc. Our results for self-reenactment show more accurate expressions compare to baselines.
Refer to caption
Figure 4: Cross-identity reenactment of head avatars. We use the tracked FLAME expression and pose parameters of source actors to drive the reconstructed avatars. Our method produces high-quality rendering and transfers expressions vividly, while baseline methods suffer from artifacts and generalize poorly to novel expressions.

4.1 Setup

Dataset. We conduct experiments on video recordings of 9 subjects from the NeRSemble [17] dataset. All recordings contain 16 views covering the front and sides of the subject. We take 11 video sequences for each subject, each containing around 200 time steps. We downsample the images to a resolution of 802×550802550802\times 550. Participants were instructed to perform specific expressions in 10 sequences and freely perform in the other.

We train our method, plus baselines, using 9 out of 10 expression sequences and 15 out of 16 available cameras. This allows us to quantitatively evaluate both novel expression and novel view synthesis. We use the 11th (free-performance) sequences to visually assess cross-identity re-enactment capabilities.

Evaluation. We evaluate the quality of head avatars with three settings: 1) novel-view synthesis: driving an avatar with head poses and expressions from training sequences and rendering from a held-out viewpoint. 2) self-reenactment: driving an avatar with unseen poses and expressions from a held-out sequence of the same subject and rendering all 16 camera views 3) cross-identity reenactment: animating an avatar with unseen poses and expressions from another identity.

Baselines. We compare our method with three state-of-the-art methods for head avatar creation. INSTA [55] directly warps points according the nearest FLAME [21] mesh triangle. It also adds triangles to the mouth, and conditions radiance field queries in the mouth region on the expression code of FLAME to improve the quality of the mouth interior. PointAvatar [52] uses a point-based representation, which is closely related to 3D Gaussians. It does not directly rely on the FLAME surface but uses its pose and expression parameters to condition a deformation field. During optimization, it applies a coarse-to-fine strategy to progressively increase the size of the point cloud and decrease the radius of each point. AvatarMAV [46] uses voxel grids for both a canonical radiance field and a set of bases of a motion field. It models deformation by blending the motion bases with the tracked expression vectors of a 3D morphable model [33].

4.2 Head Avatar Reconstruction and Animation

We evaluate the reconstruction quality of avatars by novel-view synthesis and animation fidelity by self-reenactment. Fig. 3 shows qualitative comparisons. For novel-view synthesis, all methods produce plausible rendering results. A close inspection of PointAvatar’s [52] results show dotted artifacts, owing its fixed point size. In our case, the anisotropic scaling of 3D Gaussians alleviates this issue.

INSTA [55] shows clean results for the face, however regions around the neck and shoulder can be noisy, as the tracked FLAME meshes are often misaligned in those regions. This causes issues for INSTA, as its warping process is based on the nearest triangle. In the case of our method, each Gaussian splat is rigged to a consistent triangle, regardless of the pose or expression. When the tracked mesh is inaccurate, the positional gradient to a Gaussian splat can consistently back-propagate to the same triangle. This enables misalignments due to incorrect FLAME tracking to be corrected while the 3D Gaussians are being optimized.

AvatarMAV [46] shows comparable quality to other methods in novel-view synthesis but struggles with novel-expression synthesis. This is because it only uses the expression vector of a 3DMM as conditioning. Since the control from the expression vectors to the deformation bases must be learned, it struggles to reproduce expressions that are far from the training distribution. Similar conclusions can be drawn from the quantitative comparison shown in Tab. 1. Our approach outperforms others by a large margin regarding metrics for novel-view synthesis. Our method also stands out in self-reenactment, with significantly lower perceptual differences in terms of LPIPS [50]. Note that self-reenactment is based on tracked FLAME meshes that may not perfectly align with the target images, thus bringing disadvantages to our results with more visual details regarding pixel-wise metrics such as PSNR.

For a real-world test of avatar animation, we conduct experiments on cross-identity reenactment in Fig. 4. Our avatars accurately reproduce eye blinks and mouth movements from source actors showing lively, complex dynamics such as wrinkles. INSTA [55] suffers from aliasing artifacts when the avatars move beyond the occupancy grid of I-NGP [30] optimized for training sequences. The movement of results from PointAvatar [52] is not precise because its deformation space is not guaranteed to be consistent with FLAME. AvatarMAV [46] exhibits large degradations in reenactment due to a lack of deformation priors.

Novel-View Synthesis Self-Reenactment
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
AvatarMAV [46] 29.5 0.913 0.152 24.3 0.887 0.168
PointAvatar [52] 25.8 0.893 0.097 23.4 0.884 0.102
INSTA [55] 26.7 0.899 0.122 26.3 0.906 0.110
Ours 31.6 0.938 0.065 26.0 0.910 0.076
Table 1: Quantitative comparison with state-of-the-art methods. Green indicates the best and yellow indicates the second.

4.3 Ablation Study

To validate the effectiveness of our method components, we deactivate each of them and report results in Tab. 2.

Adaptive density control with binding inheritance. Without binding inheritance, we lose the ability to add and remove Gaussian splats besides the initial ones. With a limited number of splats, the scaling of each splat increases to occupy the same space, causing blurry renderings. This leads to huge fidelity loss, as shown in the second row of Tab. 2.

Regularization on the scaling of Gaussians splats. Without the scaling loss, the image quality slightly drops (the third row in Tab. 2). But when we use the scaling loss without a small error tolerance, all metrics drastically deteriorate (the fourth row in Tab. 2). This is because all the splats are regularized to be infinitely small, making it hard to construct opaque surfaces to handle occlusions in a scene.

Regularization on the position of Gaussians splats. When turning off the threshold for the position loss, we get slightly worse metrics (the sixth row in Tab. 2). But when turning off the position loss, the metrics on novel-view synthesis become the best in the table (the fifth row in Tab. 2). This is reasonable because Gaussian splats can freely move in the space to achieve minimal re-rendering error. However, the distributions of Gaussian splats are only optimal for the training frames. The avatar shows artifacts such as cracks and fly-around blobs (Fig. 5) when animated with novel expressions and poses. Therefore, the position loss is necessary to enforce a conservative deviation of splats from the mesh surface so that we can animate the reconstructed avatar with unseen facial motion.

FLAME fine-tuning. Thanks to our consistent binding between Gaussian splats and triangles, we can effectively fine-tune FLAME parameters while optimizing Gaussian splats. As shown in the last row in Tab. 2, turning off the fine-tuning negatively affects image quality for novel-view synthesis. For self-reenactment, fine-tuning FLAME parameters also leads to lower perceptual error.

Novel-View Self-Reenactment
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
Ours 28.8 0.883 0.098 25.1 0.853 0.101
w/o ADC 26.8 0.854 0.206 25.1 0.860 0.183
w/o scalingsubscriptscaling\mathcal{L}_{\text{scaling}} 28.0 0.877 0.114 24.9 0.852 0.109
w/o ϵscalingsubscriptitalic-ϵscaling\epsilon_{\text{scaling}} 25.0 0.833 0.195 24.1 0.843 0.176
w/o positionsubscriptposition\mathcal{L}_{\text{position}} 29.7 0.894 0.091 24.9 0.851 0.096
w/o ϵpositionsubscriptitalic-ϵposition\epsilon_{\text{position}} 28.7 0.882 0.105 25.0 0.855 0.106
w/o FLAME ft. 26.1 0.855 0.131 25.5 0.862 0.124
Table 2: Ablation study on subject #304. Green indicates the best and yellow indicates the second. “ADC” refers to Adaptive Density Control with binding inheritance. ‘FLAME ft.’ refers to FLAME parameter fine-tuning.
Refer to caption
Figure 5: The position loss helps prevent artifacts during animation with novel expressions and poses.

5 Limitations

Our approach, based on 3D Gaussian splats, directly captures the radiance field without decoupling material and lighting. Hence, relighting the avatar is not feasible within our current approach. However, this would be a requirement for use in production pipelines where an avatar would be embedded in a larger scene environment. Additionally, we lack control over areas that are not modeled by FLAME such as hair and other accessories. Here, we believe that modeling these parts directly would provide an interesting future research avenue; for instance, recent methods show promising result in reconstructing hair [36, 43].

6 Conclusion

GaussianAvatars is a novel approach that creates photorealistic avatars from video sequences. It features a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This enables flexible control and precise attribute transfer of a dynamic, photorealistic avatar. The set of Gaussian splats can deviate from the mesh surface to compensate the absence or inaccuracy of the morphable model, exhibiting astonishing ability to model fine details on human heads. Our approach outperforms state-of-the-art methods on image quality and expression accuracy for a large margin, indicating a large potential for more applications in related domains.

Acknowledgements

This work was supported by Toyota Motor Europe and Woven by Toyota. This work was also supported by the ERC Starting Grant Scan2CAD (804724). We thank Justus Thies, Andrei Burov, and Lei Li for constructive discussions and Angela Dai for the video voice-over.

References

  • Attal et al. [2023] Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Kim. Hyperreel: High-fidelity 6-dof video with ray-conditioned sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16610–16620, 2023.
  • Barron et al. [2021] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  • Blanz and Vetter [2003] Volker Blanz and Thomas Vetter. Face recognition based on fitting a 3d morphable model. IEEE Transactions on pattern analysis and machine intelligence, 25(9):1063–1074, 2003.
  • Cao and Johnson [2023] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023.
  • Chan et al. [2019] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019.
  • Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022.
  • Fridovich-Keil et al. [2023] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023.
  • Gafni et al. [2021] Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021.
  • Gao et al. [2021] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5712–5721, 2021.
  • Grassal et al. [2022] Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18653–18664, 2022.
  • Hong et al. [2022] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374–20384, 2022.
  • Işık et al. [2023] Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, and Matthias Nießner. Humanrf: High-fidelity neural radiance fields for humans in motion. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
  • Jena et al. [2023] Rohit Jena, Ganesh Subramanian Iyer, Siddharth Choudhary, Brandon Smith, Pratik Chaudhari, and James Gee. Splatarmor: Articulated gaussian splatting for animatable humans from monocular rgb videos. arXiv preprint arXiv:2311.10812, 2023.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  • Kim et al. [2018] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. Deep video portraits. ACM transactions on graphics (TOG), 37(4):1–14, 2018.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirschstein et al. [2023] Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radiance field reconstruction of human heads. ACM Trans. Graph., 42(4), 2023.
  • Kocabas et al. [2023] Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. Hugs: Human gaussian splats. arXiv preprint arXiv:2311.17910, 2023.
  • Laine et al. [2020] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.
  • Lei et al. [2023] Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. Gart: Gaussian articulated template models. arXiv preprint arXiv:2311.16099, 2023.
  • Li et al. [2017] Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  • Li et al. [2022] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5521–5531, 2022.
  • Li et al. [2023] Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. arXiv, 2023.
  • Lin et al. [2021] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L Curless, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Real-time high-resolution background matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8762–8771, 2021.
  • Liu et al. [2021] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor: Neural free-view synthesis of human actors with pose control. ACM transactions on graphics (TOG), 40(6):1–16, 2021.
  • Lombardi et al. [2021] Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG), 40(4):1–13, 2021.
  • Loper et al. [2023] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  • Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  • Park et al. [2021a] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021a.
  • Park et al. [2021b] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021b.
  • Paysan et al. [2009] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009.
  • Peng et al. [2021] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14314–14323, 2021.
  • Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
  • Rosu et al. [2022] Radu Alexandru Rosu, Shunsuke Saito, Ziyan Wang, Chenglei Wu, Sven Behnke, and Giljoo Nam. Neural strands: Learning hair geometry and appearance from multi-view images. In European Conference on Computer Vision, pages 73–89. Springer, 2022.
  • Sagonas et al. [2016] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and vision computing, 47:3–18, 2016.
  • Song et al. [2023] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 29(5):2732–2742, 2023.
  • Sun et al. [2022] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
  • Suwajanakorn et al. [2017] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
  • Thies et al. [2016] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016.
  • Thies et al. [2019] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019.
  • Wang et al. [2023] Ziyan Wang, Giljoo Nam, Tuur Stuyck, Stephen Lombardi, Chen Cao, Jason Saragih, Michael Zollhöfer, Jessica Hodgins, and Christoph Lassner. Neuwigs: A neural dynamic model for volumetric hair capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8641–8651, 2023.
  • Xian et al. [2021] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9421–9431, 2021.
  • Xu et al. [2022] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022.
  • Xu et al. [2023a] Yuelang Xu, Lizhen Wang, Xiaochen Zhao, Hongwen Zhang, and Yebin Liu. Avatarmav: Fast 3d head avatar reconstruction using motion-aware neural voxels. In ACM SIGGRAPH 2023 Conference Proceedings, 2023a.
  • Xu et al. [2023b] Yuelang Xu, Hongwen Zhang, Lizhen Wang, Xiaochen Zhao, Huang Han, Qi Guojun, and Yebin Liu. Latentavatar: Learning latent expression code for expressive neural head avatar. In ACM SIGGRAPH 2023 Conference Proceedings, 2023b.
  • Ye et al. [2023] Keyang Ye, Tianjia Shao, and Kun Zhou. Animatable 3d gaussians for high-fidelity synthesis of human motions. arXiv preprint arXiv:2311.13404, 2023.
  • Yu et al. [2021] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4578–4587, 2021.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Zheng et al. [2022] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, Xu Chen, Michael J Black, and Otmar Hilliges. Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13545–13555, 2022.
  • Zheng et al. [2023] Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J Black, and Otmar Hilliges. Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21057–21067, 2023.
  • Zhou et al. [2023] Zhenglin Zhou, Huaxia Li, Hong Liu, Nanyang Wang, Gang Yu, and Rongrong Ji. Star loss: Reducing semantic ambiguity in facial landmark detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15475–15484, 2023.
  • Zielonka et al. [2023a] Wojciech Zielonka, Timur Bagautdinov, Shunsuke Saito, Michael Zollhöfer, Justus Thies, and Javier Romero. Drivable 3d gaussian avatars. 2023a.
  • Zielonka et al. [2023b] Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4574–4584, 2023b.

Appendix A FLAME Tracking

For FLAME [21] tracking, we optimize for both per-frame parameters (translation 𝒕isubscript𝒕𝑖\bm{t}_{i}, joint poses 𝜽isubscript𝜽𝑖\bm{\theta}_{i}, expression 𝝍isubscript𝝍𝑖\bm{\psi}_{i}) and shared parameters (shape 𝜷𝜷\bm{\beta}, vertex offset Δ𝒗Δ𝒗\Delta\bm{v}, and an albedo map A𝐴A). Our optimization combines a landmark loss, a color loss, and regularization terms.

We use a state-of-the-art facial landmark detector [53] to obtain 68 facial landmarks in 300-W [37] format. Among them, we exclude 17 facial contour landmarks to avoid inconsistency caused by occlusion. We use NVDiffRast [19] to render FLAME meshes and obtain gradients of vertex positions regarding the color loss by texel interpolation for the interior and anti-aliasing on the boundary. For regularization, we apply a Laplacian smoothness term on the vertex offset and temporal smoothness terms on the per-frame parameters.

We optimize all the parameters on the first time step of the video sequence until convergence, then optimize per-frame parameters for 50 iterations for each following time step with the previous one as initialization. Afterward, we conduct global optimization for 30 epochs by randomly sampling time steps to fine-tune all parameters.

We use the 2023 version of FLAME [21] for the revised eye regions. Furthermore, we manually add 168 triangles for teeth to the template mesh of FLAME and make the upper and lower teeth triangles rigid to the neck and jaw joints, respectively. This improves the fidelity of our avatar as shown in Fig. 6.

Refer to caption
Figure 6: Adding triangles that move rigidly with the head and the jaw helps Gaussian splats to capture teeth details.

Appendix B Data Pre-processing

For our experiments we use ten sequences per subject, as shown in Tab. 3. We use nine of them for training, while one is reserved for self-reenactment evaluation (Tab. 4).

Sequence Type Emotions Expressions
Sequence ID 1 2 3 4 2 3 4 5 8 9
Table 3: The type and ID of sequences for novel-view synthesis and self-reenactment.
Subject ID 074 104 218 253 264 302 304 306 460
Test Sequence EMO-4 EXP-2 EXP-9 EMO-4 EXP-9 EMO-2 EXP-2 EXP-2 EMO-3
Table 4: The held-out sequence of each subject for self-reenactment.

To simplify the pipeline for Gaussian splat optimization, we remove the background of raw images with Background Matting V2 [24]. Additionally, we fit a line across the bottom vertices of each tracked FLAME mesh and project the line to each viewpoint to remove the pixels below. We show an example of pre-processing results in Fig. 7.

Refer to caption
Figure 7: We remove the background and pixels below the shoulder to focus on the head region.
Refer to caption
(a) The number of 3D Gaussians throughout the optimization process.
Refer to caption
(b) The run-time of an optimization iteration.
Figure 8: The number of 3D Gaussians increases by a factor of around 10 from its starting point for all subjects. After this, the number of 3D Gaussians stops growing. Despite this growth in Gaussians, the run time of each training iteration at most only doubles. Each curve corresponds to a different subject.

Appendix C Computation Efficiency

Our method binds 3D Gaussians to triangles in an efficient way, maintaining high rendering and optimization speed. Given that 3D Gaussians are actively added and pruned during optimization, the running speed of the program also changes.

We show the evolution of the number of Gaussians and the run-time of an iteration in Fig. 8. According to Fig. 8(a), the number of Gaussians grows from 10,144 (that is, the number of triangles in our modified FLAME mesh) to around 100,000 (on average). After this point is reached however, the number of Gaussians no longer increases. Thanks to this, longer training times do not mean ever-increasing memory requirements. In fact, our model can fit and be trained on an NVIDIA RTX 2080 Ti Graphics card with 12 gigabytes of VRAM. Moreover, while the number of Gaussians grows by as much as 1000% during training, the run-time of each optimization iteration increases by less than 100% (Fig. 8(b)) at this peak. This validates the efficiency of the differentiable tile rasterizer [14], which sorts splats before blending and terminates ray marching once zero transmittance is reached. The threshold of our scaling loss (see Section 3 of the main paper) is crucial to this efficiency. Without it, rendering time would increase substantially, as the rasterizer would need to blend many more Gaussians before reaching zero transmittance.