DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction

Abstract

Reconstruct human mesh from a monocular image.

Recently, implicit function (IF)-based methods for clothed human reconstruction using a single image have received a lot of attention. Most existing methods rely on a 3D embedding branch using volume such as the skinned multi-person linear (SMPL) model, to compensate for the lack of information in a single image. Beyond the SMPL, which provides skinned parametric human 3D information, in this paper, we propose a new IF-based method, DIFu, that utilizes a projected depth prior containing textured and non-parametric human 3D information. In particular, DIFu consists of a generator, an occupancy prediction network, and a texture prediction network. The generator takes an RGB image of the human front-side as input, and hallucinates the human back-side image. After that, depth maps for front/back images are estimated and projected into 3D volume space. Finally, the occupancy prediction network extracts a pixel-aligned feature and a voxel-aligned feature through a 2D encoder and a 3D encoder, respectively, and estimates occupancy using these features. Note that voxel-aligned features are obtained from the projected depth maps, thus it can contain detailed 3D information such as hair and cloths. Also, colors of each query point are also estimated with the texture inference branch. The effectiveness of DIFu is demonstrated by comparing to recent IF-based models quantitatively and qualitatively.

DIFu Pipeline

(1) Back-side image generation (I^B) with the hallucinator (mirrored-form, PIFuHD Setting).
(2) Using front-/back-side images and the parametric mesh, the depth estimator infers front-/back-side depth maps (D^F, D^B).
(3) D^F and D^B are projected into the volume V.
(4) If required (texture estimation), I^F and I^B also can be projected.
(5) I^F, I^B, D^F, D^B, and V are encoded.
(6) 2D and 3D features are aligned and concatenated channel-wisely.
(7) The MLPs estimates an occupancy vector.
(8) the occupancy vector is converted into a mesh by the marching cubes algorithm.

Discussions

More disscussions can be updated if needed.

Design Motive

Although we inspired by PaMIR, which demonstrates powerful performance with simple implementation, we determined that the existing implicit function-based digital human reconstruction methods are difficult to benefit from spatial assumptions within the occupancy vector estimation mechanism. We focused on addressing the issue of oversmoothing, particulary in relation to overreliance on patterns of human for the unseen regions, as the loss function compares 1D tensor using MSE. By placing modules with the inductive bias of convolutional operation at the forefront of the pipeline, we devised a method that allows the implicit function to convert the explicit 3D shaped input into a mesh output without excessive reliance on human patterns. However, it does not simply serve as a converter. As the 3D prior can somewhat be incorrect, the implicit function can compensate for this by relying the patterns. To enhance this ability, we introduced augmentation offset during training.

Training Generative Model

Due to the limited availability of the dataset, we performed reimplementation and retraining of the comparative algorithms under the same conditions. The dataset we used had limited statistics such as clothing, poses, and races, making it challenging to attempt web images that significantly deviated from the dataset distribution. DIFu is sensitive in the performance of its two frontal modules. Particularly in this situation where the dataset is scarce, the performance of the hallucinator can undergo dramatic changes depending on the training method. In the ablation study and table 2 of the main paper, we investigated the hallucinator. The model with the application of adversarial loss demonstrates robustness on unseen datasets compared to the model without it. However, on the contrary, when training the implicit function, the predicted back-side image can differ from the actual back view in the training dataset, which can undermine confidence in explicit guidance. Preventing mode collapse in GANs can ironically result in the implicit function losing confidence in the generated inputs, incurring to oversmoothing in the back-side.

Texture with Lower Resolution than PaMIR

In many implicit function-based methods, if there is no appropriate conditioning for the unseen parts, the implicit function tends to grey out those parts during minimizing the objective function. Our approach significantly mitigates this drawback by embedding color information in the spatial domain. However, During the blending process of aligned front-/back-side images and estimated texture vector, we observed the undesired decrease in resolution with the similar architecture to PaMIR. We acknowledge that there is still room for improvement in this aspect and it appears necessary to introduce additional modules or methods to facilitate better blending.

Acknowledgments

The source code repository of PIFu, PaMIR, and ICON were referenced for reimplementation of themselves and pre-processing of the dataset. Also, PIFuHD was referred for mesh rendering and evaluation.

We employed THuman2.0 and BUFF for the experiments as datasets.

THuman2.0 (CVPR 2021, Yu et al.): Paper
BUFF (CVPR 2017, Zhang et al.): Paper