We present a method to generate full-body selfies -- photos that you take of yourself, but capturing your whole body as if someone else took the photo of you from a few feet away. Our approach takes as input a pre-captured video of your body, a target pose photo, and a selfie + background pair for each location. We introduce a novel diffusion-based approach to combine all of this information into high quality, well-composed photos of you with the desired pose and background.
Pipeline of Total Selfie. Given selfie video frames of different body parts (blue box), Region-Aware Generation (green box) trains a multi-concept DreamBooth to generate an initial full body image \(I_g\) in the background \(I_b\) with the target pose \(I_t\). Appearance Refinement (orange box) refines face region of \(I_g\) by incorporating the expression from the on-site selfie \(I_s\) with perspective undistortion. In addition, other body parts (e.g., cloth) are also refined using a similar idea with slight modifications. The refined image is defined as \(I_r\). Image Harmonization (purple box) harmonizes the refined image to improve unnatural regions using diffusion prior with appropriate guidance, generating the final output \(I_h\).
Sample images of different body parts of different users extracted from their selfie videos. The appearance of the same outfit can vary across different selfies, depending largely on factors such as spatially variable lighting conditions and diverse camera settings. For instance, when comparing the black top in row 1 (a) to row 1 (b), it is noticeable that the black top appears somewhat lighter in the latter image.
Results of Total Selfie. Total Selfie successfully generates authentic and realistic full-body images for diverse individuals, capturing a broad spectrum of expressions set against a variety of backgrounds, while preserving the clothes and producing reasonable shading. Please note, the second row's output has a stripe appearing on the left side of the pants, which is not an artifact but rather a ground pattern in the background.
Qualitative Comparison with baselines. All results are zoomed in for clear visualization. (d) are the results of DreamBooth+ControlNet. In all methods, we use the ground-truth as target pose to constrain the pose. Our pipeline clearly outperforms all baselines in terms of photo realism and faithfulness. Note that, despite being captured nearly at the same time, the color tone of the on-site selfie, background image, and ground-truth may not match due to differences in lighting conditions, auto exposure, and white balance etc .
Results for different modules of our pipeline. (a) shows the on-site selfie. (b) shows the results of trained dreambooth that only uses text prompt as condition for inference. (c)-(e) show the zoom-in of the outputs. With only region-aware generation, output (c) has incorrect identity and clothing. With appearance refinement and region-aware generation, output (d) has better identity but contains boundary artifacts (purple arrow), incorrect shading (blue arrow), and bad image details (green arrow). In contrast, the full pipeline (e) produces a realistic and faithful full-body photo.
Total Selfie is capable of generating full-body images featuring a variety of poses, while still maintaining accurate patterns on the individual's clothes. The target pose can be significantly different from the pose adopted when pre-capturing the selfie video, thereby providing an impressive range of flexibility and diversity in the generated full-body photographs.
Total Selfie has several limitations: (1) The shading in the generated full-body image may not align accurately with the actual photo. This happens when the shading in the initial full-body image (generated by DreamBooth) greatly differs from the shading in the on-site selfie. A potential avenue for future exploration could involve harnessing the on-site selfie to guide the region-aware generation.