NÜWA-PyTorch: Multimodal Text-to-Video & Audio Transformer

Category: Deep Learning

License: MIT

Model Type: Speech Synthesis

NÜWA-PyTorch is a PyTorch reimplementation of NÜWA, a state-of-the-art attention-based transformer designed for multimodal generation tasks. While its primary focus is on text-to-video synthesis, it also extends to audio generation through a dual-decoder mechanism. The project includes tools for training and sampling video and audio outputs, using VQGAN-based autoencoders and hierarchical transformer architectures.

Key Features

Supports both text-to-video and text-to-audio generation
Includes a VQGAN-style autoencoder for compressing media into discrete latent tokens
Uses a unified transformer with separate decoders for video and audio
Modular components for training, inference, and content preprocessing
Allows conditioning on different modalities such as segmentation maps and sketches
Follows modern PyTorch practices for flexibility and performance

GitHub

Similar Projects

NÜWA-PyTorch: Multimodal Text-to-Video & Audio Transformer

Key Features

Similar Projects

TangoFlux: Super‑Fast and Faithful Text‑to‑Audio Generation via Flow Matching

Make‑An‑Audio: Prompt‑Enhanced Diffusion Model for Text‑to‑Audio Generation

Tango: Latent Diffusion Models for Text‑to‑Audio Generation

Amphion: Real-Time Audio Generation Toolkit by OpenMMLab

OpenMusic: Quality-Aware Diffusion Transformer for Text-to-Music Generation

Mustango: Controllable Text-to-Music Generation via Diffusion