Overall method. (a) Given the annotated multi-view human dataset, we train a text conditioned 3D avatar generative model. (b) The model is established upon a structured 3D human representation. The model training includes two stages: (c) firstly, a decoder is required by distilling a pretrained 3D human reconstruction model; (d) secondly, a structured latent diffusion model (LDM) is trained to generate structured latent maps from noises.
@article{wang2025TeRA,
author = {Wang, Yanwen and Zhuang, Yiyu and Zhang, Jiawei and Wang, Li and Zeng, Yifei and Cao, Xun and Zuo, Xinxin and Zhu, Hao},
title = {TeRA: Text-to-Avatar Generation via 3D Human Representation},
journal = {ICCV},
year = {2025},
}