SoundCTM is a novel model designed to generate high-quality audio from textual descriptions. It addresses the challenges of slow inference speeds and semantic misalignment in previous models by introducing a flexible framework that allows creators to switch between high-quality one-step generation and superior multi-step generation. This flexibility enables efficient trial-and-error refinement of sounds to align with artistic intentions.
Key Features
Flexible Generation: Switch between one-step high-quality and multi-step superior sound generation.
Real-Time Performance: Achieve real-time generation on a single NVIDIA RTX A6000 GPU.
Training-Free Control: Utilize a training-free controllable framework for sound generation.
High-Quality Output: Generate full-band (44.1kHz) audio with high fidelity.
Open-Source Implementation: Available on GitHub with pretrained models and inference tools.