EzAudio is a high-quality, open-source model for generating realistic audio from textual prompts. It operates in the latent space of 1D waveform audio, using a diffusion transformer architecture tailored for efficient and robust audio synthesis—eliminating the need for a separate vocoder
Key Features
Direct generation of waveforms in latent space using a 1D VAE backbone
Efficient architecture with adaptive layer normalization, long-skip connections, and rotary positional encoding for better stability and speed
Multi-stage training combining unsupervised learning, automatic audio-caption alignment, and human-annotated fine-tuning
Classifier-free guidance rescaling for better prompt adherence without compromising audio quality
High-quality outputs outperforming many open-source alternatives
Support for audio editing and inpainting
Provides pretrained checkpoints and inference pipelines for ease of use