Mustango is a state-of-the-art text-to-music model leveraging diffusion techniques guided by music-domain knowledge. It builds upon Tango’s text-to-audio foundation and enhances usability by allowing users to specify musical attributes—such as chords, tempo, beats, and key—through enriched text prompts. It also releases a new dataset (MusicBench) with musically annotated audio examples.
Key Features
Music-Aware Prompting: Accepts enriched text prompts with musical elements such as "chord: Gmaj7", "tempo: 90 BPM", or "style: jazz".
Domain-Aware Architecture: Uses a dedicated module called MuNet to process musical features and guide the diffusion model accordingly.
Data Augmentation: Enlarges the training dataset via pitch, tempo, and volume variations to improve generalization and robustness.
High Quality Output: Produces higher fidelity and prompt-accurate results compared to other models like MusicGen and AudioLDM.
MusicBench Dataset: Offers a novel dataset with 11x more diverse samples than the original, annotated with musically meaningful metadata.
Open and Reproducible: Provides pretrained weights, training code, and inference scripts for reproducibility and experimentation.