Android DevHub

Tango: Latent Diffusion Models for Text‑to‑Audio Generation

Tango: Latent Diffusion Models for Text‑to‑Audio Generation

Category: Deep Learning

License: Other

Model Type: Speech Synthesis

Tango is a family of latent diffusion models for generating high-quality audio conditioned on text prompts. It employs an instruction‑tuned LLM (Flan‑T5) as a prompt encoder and a UNet‑based latent diffusion model. Despite using a dataset ~63x smaller than many prior works, Tango matches or surpasses state‑of‑the‑art performance on benchmarks like AudioCaps. Tango 2 further enhances results by aligning generations via direct preference optimization (DPO) using the Audio‑Alpaca dataset

Key Features

Text-to-audio generation supporting speech, sound effects, and music
Instruction-tuned Flan-T5 prompt encoder (frozen during training)
Latent diffusion model with audio VAE and vocoder architecture
Competitive performance with much less training data
Tango 2 includes alignment via Direct Preference Optimization (DPO)
Pretrained models and inference scripts included
Support for batch and single-sample generation
Fully open-source and research-oriented

GitHub Tango2-web

Project Screenshots

Project Screenshot

Project Screenshot

Project Screenshot

Similar Projects

Mustango: Controllable Text-to-Music Generation via Diffusion

Mustango: Controllable Text-to-Music Generation via Diffusion

Amphion: Real-Time Audio Generation Toolkit by OpenMMLab

Amphion: Real-Time Audio Generation Toolkit by OpenMMLab

StreamSpeech: All‑in‑One Streaming Speech Recognition, Translation & Synthesis

StreamSpeech: All‑in‑One Streaming Speech Recognition, Translation & Synthesis

Auffusion: Leveraging Diffusion and Large Language Models for Text-to-Audio Generation

Auffusion: Leveraging Diffusion and Large Language Models for Text-to-Audio Generation

Awesome LLMs Meet Multimodal Generation

Awesome LLMs Meet Multimodal Generation

MagicDrive: Street View Generation with Diverse 3D Geometry Control

MagicDrive: Street View Generation with Diverse 3D Geometry Control