Auffusion: Leveraging Diffusion and Large Language Models for Text-to-Audio Generation

Auffusion: Leveraging Diffusion and Large Language Models for Text-to-Audio Generation

Category: Deep Learning
License: MIT
Model Type: Voice Cloning
Auffusion is a latent diffusion model (LDM) tailored for generating realistic audio from textual prompts. It effectively adapts state-of-the-art text-to-image (T2I) frameworks to the text-to-audio (TTA) domain, enabling the synthesis of human speech, animal sounds, environmental noises, and sound effects with high fidelity and precise cross-modal alignment.

Key Features

  • Latent Diffusion Architecture: Utilizes a latent diffusion model to generate audio in the latent space, enhancing efficiency and quality.
  • Cross-Modal Alignment: Incorporates large language models to ensure accurate alignment between text inputs and audio outputs.
  • Versatile Audio Generation: Capable of producing a wide range of audio types, including speech, animal sounds, and environmental noises.
  • Advanced Manipulation Capabilities: Supports tasks such as audio style transfer, inpainting, and prompt-guided audio editing.
  • Pretrained Models and Inference Tools: Provides pretrained models and inference scripts to facilitate easy deployment and experimentation.
  • Comprehensive Documentation: Includes detailed notebooks and examples to guide users through setup and usage.

Project Screenshots

Project Screenshot