Awesome LLMs Meet Multimodal Generation

Category: Deep Learning

License: Other

Model Type: Speech Synthesis

A thoughtfully curated repository that compiles academic papers and resources on integrating large language models (LLMs) with multimodal generation tasks—including image, video, 3D, audio, speech, and musical synthesis. It serves as a valuable reference for researchers interested in how text-guided LLMs can facilitate rich content creation across modalities.

Key Features

Organized by modality: separate sections for image, video, 3D, audio generation and editing
LLM-based and non‑LLM methods: distinguishes between LLM-centric approaches and those using CLIP/T5
Citations & code links: includes links to research papers and corresponding open-source codebases
Dataset references: highlights key multimodal datasets used in generation tasks
Tool-augmented agents: covers models that combine LLM reasoning with external tools for richer multimodal outputs
Safety & future directions: includes discussions about AI safety and ethical considerations
Frequently updated: recent sync indicates ongoing curation (e.g., last updated in late 2024)

GitHub

Similar Projects

Awesome LLMs Meet Multimodal Generation

Key Features

Similar Projects

Make‑An‑Audio: Prompt‑Enhanced Diffusion Model for Text‑to‑Audio Generation

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Chatterbox: State-of-the-Art Open-Source Text-to-Speech (TTS) Model

Tango: Latent Diffusion Models for Text‑to‑Audio Generation

Audio-WebUI: Unified Browser Interface for AI Audio Models

MMAudio: Multimodal Audio-Visual Speech Separation and Enhancement