Consistency‑TTA: Accelerated Text‑to‑Audio via Consistency Distillation

Consistency‑TTA: Accelerated Text‑to‑Audio via Consistency Distillation

Consistency-TTA is a research-driven framework for efficient text-to-audio generation using consistency models. It enables single-step inference, allowing high-quality audio synthesis from text with significantly reduced computational cost. The project focuses on accelerating diffusion-based audio generation while preserving diversity and fidelity.

Key Features

  • Single-step text-to-audio generation using consistency models
  • Reduces inference time by over 400x compared to traditional diffusion models
  • Utilizes classifier-free guidance in latent space
  • Fine-tuned for audio-text alignment using contrastive audio-language models
  • Achieves strong performance on FAD, KL divergence, and human evaluation metrics
  • Demonstration-ready interface with evaluation scripts and samples
  • Lightweight, scalable, and suitable for real-time applications