Consistency‑TTA: Accelerated Text‑to‑Audio via Consistency Distillation

License: MIT

Model Type: Speech Synthesis

Consistency-TTA is a research-driven framework for efficient text-to-audio generation using consistency models. It enables single-step inference, allowing high-quality audio synthesis from text with significantly reduced computational cost. The project focuses on accelerating diffusion-based audio generation while preserving diversity and fidelity.

Key Features

Single-step text-to-audio generation using consistency models
Reduces inference time by over 400x compared to traditional diffusion models
Utilizes classifier-free guidance in latent space
Fine-tuned for audio-text alignment using contrastive audio-language models
Achieves strong performance on FAD, KL divergence, and human evaluation metrics
Demonstration-ready interface with evaluation scripts and samples
Lightweight, scalable, and suitable for real-time applications

GitHub Live Demo Huggingface

Similar Projects

Consistency‑TTA: Accelerated Text‑to‑Audio via Consistency Distillation

Key Features

Similar Projects

Creative Text-to-Audio Generation (CTAG)

AudioLDM Google Colab Interface

AI Text-to-Audio with Latent Diffusion

AV Benchmark: Audio‑Text and Audio‑Visual Generation Metrics

RaveFussion

ComfyUI to Python Extension