A thoughtfully curated repository that compiles academic papers and resources on integrating large language models (LLMs) with multimodal generation tasks—including image, video, 3D, audio, speech, and musical synthesis. It serves as a valuable reference for researchers interested in how text-guided LLMs can facilitate rich content creation across modalities.
Key Features
Organized by modality: separate sections for image, video, 3D, audio generation and editing
LLM-based and non‑LLM methods: distinguishes between LLM-centric approaches and those using CLIP/T5
Citations & code links: includes links to research papers and corresponding open-source codebases
Dataset references: highlights key multimodal datasets used in generation tasks
Tool-augmented agents: covers models that combine LLM reasoning with external tools for richer multimodal outputs
Safety & future directions: includes discussions about AI safety and ethical considerations
Frequently updated: recent sync indicates ongoing curation (e.g., last updated in late 2024)