Atlas Gaussians Diffusion for 3D Generation

1The University of Texas at Austin, 2Alibaba Group
ICLR 2025 (Spotlight)

*Indicates Equal Contribution

TL;DR:
(1) A new representation that can generate theoretically infinite number of 3DGS.
(2) We apply the VAE + LDM (Latent Diffusion Model) paradigm to 3DGS generation.

Abstract

Using the latent diffusion model has proven effective in developing novel 3D generation techniques. To harness the latent diffusion model, a key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space. In this paper, we introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation. Atlas Gaussians represent a shape as the union of local patches, and each patch can decode 3D Gaussians. We parameterize a patch as a sequence of feature vectors and design a learnable function to decode 3D Gaussians from the feature vectors. In this process, we incorporate UV-based sampling, enabling the generation of a sufficiently large, and theoretically infinite, number of 3D Gaussian points. The large amount of 3D Gaussians enables the generation of high-quality details. Moreover, due to local awareness of the representation, the transformer-based decoding procedure operates on a patch level, ensuring efficiency. We train a variational autoencoder to learn the Atlas Gaussians representation, and then apply a latent diffusion model on its latent space for learning 3D Generation. Experiments show that our approach outperforms the prior arts of feed-forward native 3D generation.

Atlas Gaussians representation

MY ALT TEXT

(Left) Atlas Gaussians \( \mathcal{A} \) model the shape as a union of patches, where each patch can decode 3D Gaussians. (Right) Each patch \( a_i \) is parameterized by patch center \( x_i \) and patch features \( f_i \) and \( h_i \). The 3D Gaussians are decoded via the UV-based sampling.

VAE architecture

MY ALT TEXT

The proposed VAE architecture. CA denotes the cross-attention layer. For simplicity, the variational component of the VAE is omitted. The latent \( z_0 \) is used for latent diffusion.

Analysis (diversity & controllability & originality)

MY ALT TEXT

(Left) Our generated results demonstrate significant diversity. (Right) Our generated results align closely with the text prompts, allowing for strong controllability. In the second row of each group, we present the nearest neighbors (NN) from the training dataset.

Text-conditioned 3D generation

MY ALT TEXT

Comparison of text-conditioned 3D generation between baseline approaches on Objaverse. From left to right: GVGEN (He et al., 2024), LN3Diff (Lan et al., 2024), LGM (Tang et al., 2024), Shap-E (Jun & Nichol, 2023), and our method.

BibTeX

@inproceedings{
      yang2025atlas,
      title={Atlas Gaussians Diffusion for 3D Generation},
      author={Haitao Yang and Yuan Dong and Hanwen Jiang and Dejia Xu and Georgios Pavlakos and Qixing Huang},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025},
      url={https://openreview.net/forum?id=H2Gxil855b}
    }