SoundCTM-DiT: Unified Score-Based & Consistency Models for Full-Band Text-to-Sound

Category: Other

License: MIT

Model Type: Speech Synthesis

SoundCTM-DiT is a PyTorch implementation of a full-band text-to-sound generation framework introduced at ICLR 2025. It merges score-based diffusion models with consistency models into a single efficient pipeline, enabling both one-step and multi-step generation for diverse, high-fidelity audio outputs with greatly reduced inference time.

Key Features

Offers one-step or multi-step audio generation from text
Unifies score-based diffusion with consistency models for high-quality results
Supports two model variants for conditioning on CLAP text embeddings
Includes scripts and Docker setup for training, inference, evaluation, and sample generation
Compatible with AudioCaps dataset and includes evaluation metrics (FAD, FD, KL divergence, etc.)
Demonstrates inference through command-line and Jupyter notebook interfaces
Integrates Weights & Biases logging for experiment tracking
Provides pretrained checkpoints and reproducible pipelines for research

GitHub

Similar Projects

MMagic – OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox

Other

ChatGPT Text-to-Speech Application

Other

SoundCTM-DiT: Unified Score-Based & Consistency Models for Full-Band Text-to-Sound

Key Features

Similar Projects

xShop

Text-to-Audio with Bark

chatGPT-shell-cli: Terminal Access to OpenAI's ChatGPT and DALL·E

PaLM‑PDFChat – GUI for Chatting with Your PDFs via PaLM

MMagic – OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox

ChatGPT Text-to-Speech Application