1Electronics and Telecommunication Research Institute (ETRI), Daejeon, South Korea
2Computer Vision and Image Processing (CVIP) Lab., Chungnam National University, Daejeon, South Korea
Recently, implicit function (IF)-based methods for clothed human reconstruction using a single image have received a lot of attention. Most existing methods rely on a 3D embedding branch using volume such as the skinned multi-person linear (SMPL) model, to compensate for the lack of information in a single image. Beyond the SMPL, which provides skinned parametric human 3D information, in this paper, we propose a new IF-based method, DIFu, that utilizes a projected depth prior containing textured and non-parametric human 3D information. In particular, DIFu consists of a generator, an occupancy prediction network, and a texture prediction network. The generator takes an RGB image of the human front-side as input, and hallucinates the human back-side image. After that, depth maps for front/back images are estimated and projected into 3D volume space. Finally, the occupancy prediction network extracts a pixel-aligned feature and a voxel-aligned feature through a 2D encoder and a 3D encoder, respectively, and estimates occupancy using these features. Note that voxel-aligned features are obtained from the projected depth maps, thus it can contain detailed 3D information such as hair and cloths. Also, colors of each query point are also estimated with the texture inference branch. The effectiveness of DIFu is demonstrated by comparing to recent IF-based models quantitatively and qualitatively.
(1) Back-side image generation (IB) with the hallucinator (mirrored-form, PIFuHD Setting).
(2) Using front-/back-side images and the parametric mesh, the depth estimator infers front-/back-side depth maps (DF, DB).
(3) DF and DB are projected into the volume V.
(4) If required (texture estimation), IF and IB also can be projected.
(5) IF, IB, DF, DB, and V are encoded.
(6) 2D and 3D features are aligned and concatenated channel-wisely.
(7) The MLPs estimates an occupancy vector.
(8) the occupancy vector is converted into a mesh by the marching cubes algorithm.
More disscussions can be updated if needed.
Although we inspired by PaMIR, which demonstrates powerful performance with simple implementation, we determined that the existing implicit function-based digital human reconstruction methods are difficult to benefit from spatial assumptions within the occupancy vector estimation mechanism. We focused on addressing the issue of oversmoothing, particulary in relation to overreliance on patterns of human for the unseen regions, as the loss function compares 1D tensor using MSE. By placing modules with the inductive bias of convolutional operation at the forefront of the pipeline, we devised a method that allows the implicit function to convert the explicit 3D shaped input into a mesh output without excessive reliance on human patterns. However, it does not simply serve as a converter. As the 3D prior can somewhat be incorrect, the implicit function can compensate for this by relying the patterns. To enhance this ability, we introduced augmentation offset during training.
Due to the limited availability of the dataset, we performed reimplementation and retraining of the comparative algorithms under the same conditions.
The dataset we used had limited statistics such as clothing, poses, and races, making it challenging to attempt web images that significantly deviated from the dataset distribution.
DIFu is sensitive in the performance of its two frontal modules.
Particularly in this situation where the dataset is scarce, the performance of the hallucinator can undergo dramatic changes depending on the training method.
In the ablation study and table 2 of the main paper, we investigated the hallucinator.
The model with the application of adversarial loss demonstrates robustness on unseen datasets compared to the model without it.
However, on the contrary, when training the implicit function, the predicted back-side image can differ from the actual back view in the training dataset, which can undermine confidence in explicit guidance.
Preventing mode collapse in GANs can ironically result in the implicit function losing confidence in the generated inputs, incurring to oversmoothing in the back-side.
For more questions, please contact eadyoung@naver.com or eadgaudiyoung@gmail.com.
PIFu (ICCV 2019, Saito et al.): Paper | Code | Video
PIFuHD (CVPR 2020, Saito et al.): Paper | Code | Video
PaMIR (IEEE TPAMI 2021, Zheng et al.): Paper | Code | Project Page
ICON (CVPR 2022, Xiu et al.): Paper | Code | Video
THuman2.0 (CVPR 2021, Yu et al.): Paper
BUFF (CVPR 2017, Zhang et al.): Paper
@InProceedings{Song2022difu,
author={Song, Dae-Young and and Lee, HeeKyung and Seo, Jeongil and Cho, Donghyeon},
title={DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction},
journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2023},
}