Mustango: Controllable Text-to-Music Generation via Diffusion

Category: Deep Learning

License: MIT

Model Type: Speech Synthesis

Mustango is a state-of-the-art text-to-music model leveraging diffusion techniques guided by music-domain knowledge. It builds upon Tango’s text-to-audio foundation and enhances usability by allowing users to specify musical attributes—such as chords, tempo, beats, and key—through enriched text prompts. It also releases a new dataset (MusicBench) with musically annotated audio examples.

Key Features

Music-Aware Prompting: Accepts enriched text prompts with musical elements such as "chord: Gmaj7", "tempo: 90 BPM", or "style: jazz".
Domain-Aware Architecture: Uses a dedicated module called MuNet to process musical features and guide the diffusion model accordingly.
Data Augmentation: Enlarges the training dataset via pitch, tempo, and volume variations to improve generalization and robustness.
High Quality Output: Produces higher fidelity and prompt-accurate results compared to other models like MusicGen and AudioLDM.
MusicBench Dataset: Offers a novel dataset with 11x more diverse samples than the original, annotated with musically meaningful metadata.
Open and Reproducible: Provides pretrained weights, training code, and inference scripts for reproducibility and experimentation.

GitHub

Similar Projects

Auffusion: Leveraging Diffusion and Large Language Models for Text-to-Audio Generation

Deep Learning

CogView4: Bilingual Diffusion Transformer for High-Fidelity Text-to-Image Generation

Deep Learning

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Deep Learning

Mustango: Controllable Text-to-Music Generation via Diffusion

Key Features

Similar Projects

Auffusion: Leveraging Diffusion and Large Language Models for Text-to-Audio Generation

CogView4: Bilingual Diffusion Transformer for High-Fidelity Text-to-Image Generation

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Abogen – Audiobook Generator for EPUB, PDF, and Text

SubToAudio: Subtitle-to-Audio Conversion with Coqui TTS

Tango: Latent Diffusion Models for Text‑to‑Audio Generation