OpenMusic: Quality-Aware Diffusion Transformer for Text-to-Music Generation

OpenMusic: Quality-Aware Diffusion Transformer for Text-to-Music Generation

Category: Deep Learning
License: MIT
Model Type: Speech Synthesis
OpenMusic is a high-quality, open-source implementation of the QA-MDT (Quality-Aware Masked Diffusion Transformer) model for generating music from text descriptions. Designed to achieve state-of-the-art results, OpenMusic supports variable-length music generation and offers strong musical fidelity, leveraging a masked diffusion transformer architecture conditioned on quality tokens.

Key Features

  • Utilizes QA-MDT architecture for controllable and coherent music generation
  • Quality tokens guide audio output toward better musical fidelity
  • Supports variable-length generation (inf-length capability)
  • Benchmarked on datasets like MusicCaps and Song Describer
  • Includes pre-trained checkpoints and inference scripts
  • Gradio-based user interface for quick testing and demos
  • Supports model fine-tuning and custom dataset integration
  • Compatible with CLAP, AudioLDM, and Flan-T5 embeddings