LlamaGen is a family of autoregressive image generation models that apply the next-token prediction paradigm of large language models to the visual domain. By scaling appropriately and reexamining design spaces of image tokenizers, LlamaGen achieves state-of-the-art image generation performance without relying on inductive biases tailored for vision.
Key Features
Class-conditional image generation models ranging from 100M to 3.1B parameters.
Text-conditional image generation model with 775M parameters.
Image tokenizers with downsample ratios of 16 and 8.
Achieves a 2.18 FID on the ImageNet 256×256 benchmark.
Demonstrates competitive performance in visual quality and text alignment.
Utilizes vLLM serving framework to achieve 326%–414% speedup in inference.
Releases pre-trained model weights and training/sampling PyTorch codes.
Provides online demos and Gradio-based interfaces for interactive use.
Supports both class-conditional and text-conditional image generation tasks.