Overall method. (a) Given the annotated multi-view human dataset, we train a text conditioned 3D avatar generative model. (b) The model is established upon a structured 3D human representation. The model training includes two stages: (c) firstly, a decoder is required by distilling a pretrained 3D human reconstruction model; (d) secondly, a structured latent diffusion model (LDM) is trained to generate structured latent maps from noises.
@inproceedings{wang2025tera,
title={TeRA: Rethinking Text-guided Realistic 3D Avatar Generation},
author={Wang, Yanwen and Zhuang, Yiyu and Zhang, Jiawei and Wang, Li and Zeng, Yifei and Cao, Xun and Zuo, Xinxin and Zhu, Hao},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={10686--10697},
year={2025}
}