AlbumGen: An exploration into Multimodality with LLMs

Nov 10, 2024·
Jethro Au
Jethro Au
,
Lai Chun Yu (HKUST)
· 1 min read
AlbumGen
Abstract
This project explores the intersection of computer vision and music generation by utilizing image captioning models on album cover art and employing text-to-music generation models, such as MusicGen, to create music based on these captions. By training a model to generate descriptive captions from album covers, we aim to develop a system that automatically produces music aligned with the visual themes of album artwork. This innovative approach opens new avenues for creative AI-assisted music production.
Type
Publication
AlbumGen - Image-to-Music Generation with Textual Intermediaries

Full Text of paper here. Presentation of AlbumGen with Demo here.

AlbumGen explores generating music from album cover art using a multi-step pipeline that incorporates image captioning, text-to-music generation, and explainable intermediaries. The aim is to connect the visual themes of album art with music creation, leveraging advances in computer vision and large language models. This project aims to explore the development of an image-to-music pipeline with LLMs. Rather than delving into philosophical interpretations of how music should be conditioned on image inputs, we strive to ground our pipeline on explainable intermediaries in the generation process through text.

image

Demo

Some demos can be found here