Creative Text-to-Audio Generation (CTAG)

License: MIT

Model Type: Speech Synthesis

CTAG is a research-oriented tool for generating audio from text prompts using a modular synthesizer engine. Developed on top of SynthAX, a JAX-based synthesizer framework, CTAG enables flexible and efficient audio synthesis through textual input. It supports rich configuration via Hydra and integrates LAION-CLAP models for aligning audio output with natural language descriptions. This system was introduced at ICML 2024.

Key Features

Converts descriptive text prompts into synthetic audio
Utilizes SynthAX, a JAX-based modular synthesizer
Supports LAION-CLAP audio-text similarity models
Fully customizable architecture via Hydra configuration
Enables hyperparameter tuning with Ax
Provides example scripts for quick experimentation
Compatible with CPU and GPU environments
Includes reproducible generation pipelines and model checkpoints

GitHub Live Demo Arxiv

Similar Projects

Creative Text-to-Audio Generation (CTAG)

Key Features

Similar Projects

AV Benchmark: Audio‑Text and Audio‑Visual Generation Metrics

AI Text-to-Audio with Latent Diffusion

RaveFussion

Consistency‑TTA: Accelerated Text‑to‑Audio via Consistency Distillation

AudioLDM Google Colab Interface

Soundstorm – AI‑Powered Audio Toolkit