Make‑An‑Audio: Prompt‑Enhanced Diffusion Model for Text‑to‑Audio Generation

Make‑An‑Audio: Prompt‑Enhanced Diffusion Model for Text‑to‑Audio Generation

Category: Deep Learning
License: MIT
Model Type: Speech Synthesis
Make‑An‑Audio is a PyTorch implementation of an ICML 2023 diffusion-based model that generates high-fidelity audio from text prompts. It uses a prompt‑enhanced diffusion architecture with spectrogram autoencoding, combined with contrastive language–audio pretraining (CLAP) to produce diverse, controllable, and realistic audio outputs.

Key Features

  • Diffusion probabilistic model conditioned on text prompts
  • Spectrogram autoencoder for efficient audio representation
  • CLAP-based pretraining for strong text-audio alignment
  • High-quality audio synthesis demonstrated across benchmarks
  • Support for multi-modal audio inpainting and personalized audio scenarios
  • Includes pretrained checkpoints, inference scripts, and training pipelines