CogView4: Bilingual Diffusion Transformer for High-Fidelity Text-to-Image Generation

CogView4: Bilingual Diffusion Transformer for High-Fidelity Text-to-Image Generation

CogView4 is an open-source text-to-image generation model developed by THUDM. It employs a Diffusion Transformer (DiT) architecture and is designed to generate high-quality images from textual descriptions in both Chinese and English. The model integrates the GLM-4 text encoder and supports dynamic text lengths up to 1024 tokens, enhancing its ability to process complex prompts. CogView4 achieves high-resolution outputs up to 2048×2048 pixels and demonstrates strong performance in Chinese character generation, making it suitable for diverse applications in creative and practical visual tasks.

Key Features

  • Bilingual Generation: Supports text prompts in both Chinese and English, with enhanced GLM-4 text encoder capabilities.
  • High-Resolution Output: Generates images with resolutions ranging from 512×512 up to 2048×2048 pixels.
  • Dynamic Text Processing: Handles variable-length prompts up to 1024 tokens, reducing redundant computations and improving training efficiency.
  • Efficient Training: Utilizes techniques like pre-computation and caching of latents and embeddings, sequence packing, and memory-efficient strategies to optimize training throughput.
  • Prompt Optimization: Includes tools for refining prompts using large language models to enhance generation quality.
  • Open-Source Accessibility: Available under the Apache-2.0 license, facilitating community contributions and integrations.