NÜWA-PyTorch: Multimodal Text-to-Video & Audio Transformer

NÜWA-PyTorch: Multimodal Text-to-Video & Audio Transformer

Category: Deep Learning
License: MIT
Model Type: Speech Synthesis
NÜWA-PyTorch is a PyTorch reimplementation of NÜWA, a state-of-the-art attention-based transformer designed for multimodal generation tasks. While its primary focus is on text-to-video synthesis, it also extends to audio generation through a dual-decoder mechanism. The project includes tools for training and sampling video and audio outputs, using VQGAN-based autoencoders and hierarchical transformer architectures.

Key Features

  • Supports both text-to-video and text-to-audio generation
  • Includes a VQGAN-style autoencoder for compressing media into discrete latent tokens
  • Uses a unified transformer with separate decoders for video and audio
  • Modular components for training, inference, and content preprocessing
  • Allows conditioning on different modalities such as segmentation maps and sketches
  • Follows modern PyTorch practices for flexibility and performance