SoundCTM-DiT is a PyTorch implementation of a full-band text-to-sound generation framework introduced at ICLR 2025. It merges score-based diffusion models with consistency models into a single efficient pipeline, enabling both one-step and multi-step generation for diverse, high-fidelity audio outputs with greatly reduced inference time.
Key Features
Offers one-step or multi-step audio generation from text
Unifies score-based diffusion with consistency models for high-quality results
Supports two model variants for conditioning on CLAP text embeddings
Includes scripts and Docker setup for training, inference, evaluation, and sample generation
Compatible with AudioCaps dataset and includes evaluation metrics (FAD, FD, KL divergence, etc.)
Demonstrates inference through command-line and Jupyter notebook interfaces
Integrates Weights & Biases logging for experiment tracking
Provides pretrained checkpoints and reproducible pipelines for research