Consistency-TTA is a research-driven framework for efficient text-to-audio generation using consistency models. It enables single-step inference, allowing high-quality audio synthesis from text with significantly reduced computational cost. The project focuses on accelerating diffusion-based audio generation while preserving diversity and fidelity.
Key Features
Single-step text-to-audio generation using consistency models
Reduces inference time by over 400x compared to traditional diffusion models
Utilizes classifier-free guidance in latent space
Fine-tuned for audio-text alignment using contrastive audio-language models
Achieves strong performance on FAD, KL divergence, and human evaluation metrics
Demonstration-ready interface with evaluation scripts and samples
Lightweight, scalable, and suitable for real-time applications