CogView4: Bilingual Diffusion Transformer for High-Fidelity Text-to-Image Generation

Category: Deep Learning

License: Apache-2.0

Model Type: Image Generation

CogView4 is an open-source text-to-image generation model developed by THUDM. It employs a Diffusion Transformer (DiT) architecture and is designed to generate high-quality images from textual descriptions in both Chinese and English. The model integrates the GLM-4 text encoder and supports dynamic text lengths up to 1024 tokens, enhancing its ability to process complex prompts. CogView4 achieves high-resolution outputs up to 2048×2048 pixels and demonstrates strong performance in Chinese character generation, making it suitable for diverse applications in creative and practical visual tasks.

Key Features

Bilingual Generation: Supports text prompts in both Chinese and English, with enhanced GLM-4 text encoder capabilities.
High-Resolution Output: Generates images with resolutions ranging from 512×512 up to 2048×2048 pixels.
Dynamic Text Processing: Handles variable-length prompts up to 1024 tokens, reducing redundant computations and improving training efficiency.
Efficient Training: Utilizes techniques like pre-computation and caching of latents and embeddings, sequence packing, and memory-efficient strategies to optimize training throughput.
Prompt Optimization: Includes tools for refining prompts using large language models to enhance generation quality.
Open-Source Accessibility: Available under the Apache-2.0 license, facilitating community contributions and integrations.

GitHub