Make‑An‑Audio is a PyTorch implementation of an ICML 2023 diffusion-based model that generates high-fidelity audio from text prompts. It uses a prompt‑enhanced diffusion architecture with spectrogram autoencoding, combined with contrastive language–audio pretraining (CLAP) to produce diverse, controllable, and realistic audio outputs.
Key Features
Diffusion probabilistic model conditioned on text prompts
Spectrogram autoencoder for efficient audio representation
CLAP-based pretraining for strong text-audio alignment
High-quality audio synthesis demonstrated across benchmarks
Support for multi-modal audio inpainting and personalized audio scenarios
Includes pretrained checkpoints, inference scripts, and training pipelines