NÜWA-PyTorch is a PyTorch reimplementation of NÜWA, a state-of-the-art attention-based transformer designed for multimodal generation tasks. While its primary focus is on text-to-video synthesis, it also extends to audio generation through a dual-decoder mechanism. The project includes tools for training and sampling video and audio outputs, using VQGAN-based autoencoders and hierarchical transformer architectures.
Key Features
Supports both text-to-video and text-to-audio generation
Includes a VQGAN-style autoencoder for compressing media into discrete latent tokens
Uses a unified transformer with separate decoders for video and audio
Modular components for training, inference, and content preprocessing
Allows conditioning on different modalities such as segmentation maps and sketches
Follows modern PyTorch practices for flexibility and performance