Music Reactive Video Generation


ICASSP 2025 Submission 396

Schematic overview of our music reactive video generation framework.

We use our highlighting method to enhance climax information present in Chroma-CENS. Then, we combine the Highlighted CENS (H-CENS) with twelve seed latent images of flowers to generate the latent per frame. An audio-reactive video is generated by encoding the combined input of H-CENS and latent images. Since our H-CENS contains the emphasized features of climaxes, it leads to generating highly reactive videos

Abstract

While generating music-reactive videos is an impor-tant task in real-world applications, it has been a challenging generation task. We argue that this challenge arises from three factors: 1) emotional feelings about music exist in a too-small region of the chromagram to transform it into a video, 2) evaluating the music-reactive generation task is fundamentally a subjective matter of individual humans, and thus, 3) applying metrics for quantitative measurement used in existing generative models is tricky. For these reasons, this paper introduces an efficient music-reactive video generation model and a metric for accurately evaluating generated music-reactive videos. For our music-reactive generation model, we present the highlighted chromagram approach to effectively represent musical struc- ture information (e.g., melody, beat, rhythm) in videos. We also introduce a quantitative measurement metric, Beat Co- occurrence, which is strongly correlated with human survey results. Our experiments demonstrate that our music-reactive generation model performs favorably in human surveys and the proposed Beat Co-occurrence distance metric.

Music-Reactive Video Generation Comparison

We generate music-reactive videos across three different genres (i.e., Rock, K-POP, Traditional Instrumental Ensemble) and compare Chroma-CENS with our H-CENS method. In our comparison study, we observed that while videos generated using Chroma-CENS play smoothly, peak signals, such as musical climaxes, are diluted, resulting in less distinct responsiveness to the music. Conversely, videos generated using H-CENS better preserve and emphasize these climaxes, leading to more accurate and natural visual responses. As a result, we found that H-CENS more effectively captures subtle musical changes and emotional nuances, facilitating a dynamic interaction between the music and visual elements, and conveying the emotional characteristics of the music more clearly to the viewers.

*Please adjust the volume to suit your computer's sound settings.*

Volume

Genre 1: K-POP

Genre 2: Traditional Instrumental Ensemble

Genre 3: Rock

Music-Reactive Video Generation with H-CENS

Genre1 : K-POP

Heya

Supernatural

Small girl

Supernova

Bamyanggang

Genre2 : Traditional Instrumental Ensemble

The moon reflected in a well

The tiger is here

Iced ripe persimmon

Dance in hangawi

Full of happy things

Genre3 : Rock

Takin me down hardline

Cherry pie warrant

Youth gone wild skid row

Since you been gone rainbow

Rising force yngwie malmsteen