SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Category: Deep Learning
License: MIT
Model Type: Speech Synthesis
SoundCTM is a novel model designed to generate high-quality audio from textual descriptions. It addresses the challenges of slow inference speeds and semantic misalignment in previous models by introducing a flexible framework that allows creators to switch between high-quality one-step generation and superior multi-step generation. This flexibility enables efficient trial-and-error refinement of sounds to align with artistic intentions.

Key Features

  • Flexible Generation: Switch between one-step high-quality and multi-step superior sound generation.
  • Real-Time Performance: Achieve real-time generation on a single NVIDIA RTX A6000 GPU.
  • Training-Free Control: Utilize a training-free controllable framework for sound generation.
  • High-Quality Output: Generate full-band (44.1kHz) audio with high fidelity.
  • Open-Source Implementation: Available on GitHub with pretrained models and inference tools.