SoundCTM-DiT: Unified Score-Based & Consistency Models for Full-Band Text-to-Sound

SoundCTM-DiT: Unified Score-Based & Consistency Models for Full-Band Text-to-Sound

Category: Other
License: MIT
Model Type: Speech Synthesis
SoundCTM-DiT is a PyTorch implementation of a full-band text-to-sound generation framework introduced at ICLR 2025. It merges score-based diffusion models with consistency models into a single efficient pipeline, enabling both one-step and multi-step generation for diverse, high-fidelity audio outputs with greatly reduced inference time.

Key Features

  • Offers one-step or multi-step audio generation from text
  • Unifies score-based diffusion with consistency models for high-quality results
  • Supports two model variants for conditioning on CLAP text embeddings
  • Includes scripts and Docker setup for training, inference, evaluation, and sample generation
  • Compatible with AudioCaps dataset and includes evaluation metrics (FAD, FD, KL divergence, etc.)
  • Demonstrates inference through command-line and Jupyter notebook interfaces
  • Integrates Weights & Biases logging for experiment tracking
  • Provides pretrained checkpoints and reproducible pipelines for research