Mustango: Controllable Text-to-Music Generation via Diffusion

Mustango: Controllable Text-to-Music Generation via Diffusion

Category: Deep Learning
License: MIT
Model Type: Speech Synthesis
Mustango is a state-of-the-art text-to-music model leveraging diffusion techniques guided by music-domain knowledge. It builds upon Tango’s text-to-audio foundation and enhances usability by allowing users to specify musical attributes—such as chords, tempo, beats, and key—through enriched text prompts. It also releases a new dataset (MusicBench) with musically annotated audio examples.

Key Features

  • Music-Aware Prompting: Accepts enriched text prompts with musical elements such as "chord: Gmaj7", "tempo: 90 BPM", or "style: jazz".
  • Domain-Aware Architecture: Uses a dedicated module called MuNet to process musical features and guide the diffusion model accordingly.
  • Data Augmentation: Enlarges the training dataset via pitch, tempo, and volume variations to improve generalization and robustness.
  • High Quality Output: Produces higher fidelity and prompt-accurate results compared to other models like MusicGen and AudioLDM.
  • MusicBench Dataset: Offers a novel dataset with 11x more diverse samples than the original, annotated with musically meaningful metadata.
  • Open and Reproducible: Provides pretrained weights, training code, and inference scripts for reproducibility and experimentation.