EzAudio – Efficient Diffusion Transformer for Text-to-Audio

EzAudio – Efficient Diffusion Transformer for Text-to-Audio

Category: Deep Learning
License: MIT
Model Type: Speech Synthesis
EzAudio is a high-quality, open-source model for generating realistic audio from textual prompts. It operates in the latent space of 1D waveform audio, using a diffusion transformer architecture tailored for efficient and robust audio synthesis—eliminating the need for a separate vocoder

Key Features

  • Direct generation of waveforms in latent space using a 1D VAE backbone
  • Efficient architecture with adaptive layer normalization, long-skip connections, and rotary positional encoding for better stability and speed
  • Multi-stage training combining unsupervised learning, automatic audio-caption alignment, and human-annotated fine-tuning
  • Classifier-free guidance rescaling for better prompt adherence without compromising audio quality
  • High-quality outputs outperforming many open-source alternatives
  • Support for audio editing and inpainting
  • Provides pretrained checkpoints and inference pipelines for ease of use