RaveFussion is a text-to-audio generative pipeline that combines a fine-tuned Riffusion model (based on Stable Diffusion) with RAVE, an audio-to-audio autoencoder. The system translates text input into audio features via Riffusion and then refines these into high-quality waveforms using RAVE.
Key Features
Text-to-audio synthesis using a fine-tuned diffusion model
Integration of RAVE autoencoder for audio enhancement
Two-stage generation: visual embeddings to audio embedding to final waveform
Configuration and scripting based on Python environment
Support for experimentation and research in generative audio