Tango: Latent Diffusion Models for Text‑to‑Audio Generation

Tango: Latent Diffusion Models for Text‑to‑Audio Generation

Category: Deep Learning
License: Other
Model Type: Speech Synthesis
Tango is a family of latent diffusion models for generating high-quality audio conditioned on text prompts. It employs an instruction‑tuned LLM (Flan‑T5) as a prompt encoder and a UNet‑based latent diffusion model. Despite using a dataset ~63x smaller than many prior works, Tango matches or surpasses state‑of‑the‑art performance on benchmarks like AudioCaps. Tango 2 further enhances results by aligning generations via direct preference optimization (DPO) using the Audio‑Alpaca dataset

Key Features

  • Text-to-audio generation supporting speech, sound effects, and music
  • Instruction-tuned Flan-T5 prompt encoder (frozen during training)
  • Latent diffusion model with audio VAE and vocoder architecture
  • Competitive performance with much less training data
  • Tango 2 includes alignment via Direct Preference Optimization (DPO)
  • Pretrained models and inference scripts included
  • Support for batch and single-sample generation
  • Fully open-source and research-oriented

Project Screenshots

Project Screenshot
Project Screenshot
Project Screenshot