SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Category: Deep Learning

License: MIT

Model Type: Speech Synthesis

SoundCTM is a novel model designed to generate high-quality audio from textual descriptions. It addresses the challenges of slow inference speeds and semantic misalignment in previous models by introducing a flexible framework that allows creators to switch between high-quality one-step generation and superior multi-step generation. This flexibility enables efficient trial-and-error refinement of sounds to align with artistic intentions.

Key Features

Flexible Generation: Switch between one-step high-quality and multi-step superior sound generation.
Real-Time Performance: Achieve real-time generation on a single NVIDIA RTX A6000 GPU.
Training-Free Control: Utilize a training-free controllable framework for sound generation.
High-Quality Output: Generate full-band (44.1kHz) audio with high fidelity.
Open-Source Implementation: Available on GitHub with pretrained models and inference tools.

GitHub Live Demo Arxiv

Similar Projects

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Key Features

Similar Projects

Awesome LLMs Meet Multimodal Generation

SubToAudio: Subtitle-to-Audio Conversion with Coqui TTS

MagicDrive: Street View Generation with Diverse 3D Geometry Control

Tango: Latent Diffusion Models for Text‑to‑Audio Generation

OpenMusic: Quality-Aware Diffusion Transformer for Text-to-Music Generation

Mustango: Controllable Text-to-Music Generation via Diffusion