stable-audio-control

Editing Music with Melody and Text:
Using ControlNet for Diffusion Transformer

Siyuan Hou1,2, Shansong Liu2, Ruibin Yuan3, Wei Xue3, Ying Shan2, Mangsuo Zhao1, Chao Zhang1
1Tsinghua University
2ARC Lab, Tencent PCG
3Hong Kong University of Science and Technology

Supporting webpage for ICASSP 2025.
[Paper on ArXiv]

Abstract

Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-k constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments have been performed on text-to-music generation and music-style transfer tasks using open-source instrumental recording data. The results demonstrate that by extending StableAudio, a pre-trained text-controlled DiT model, our approach enables superior melody-controlled editing while retaining good text-to-music generation performance. These results outperform a strong MusicGen baseline in terms of both text-based generation and melody preservation for editing.

Music Editing

The examples of music editing task are all from the Song Describer dataset[1]. For our model, we use a text prompt and a music prompt as the conditions for music editing. The text prompt comes from the dataset, while the music prompt is the top-4 constant-Q transform (CQT) representation extracted from the target audio. The table below shows the music prompt, displaying the top-4 CQT representation of the left channel from 0 to 6 seconds. For the baseline model MusicGEN[2], the same text prompt and Chroma-based melody representation are used as conditional inputs.

Scroll to see all the results if necessary.

text prompt music prompt Target MusicGen-melody MusicGen-melody-large Ours
A twisty nice melody song by a slide electric guitar on top of acoustic chords later accompanied with a ukelele.
8-bit melody brings one back to the arcade saloons while keeping the desire to dance.
Instrumental piano piece with a slightly classical touch and a nostalgic, bittersweet or blue mood.
Positive instrumental pop song with a strong rhythm and brass section.
A blues piano track that would be very well suited in a 90s sitcom. The piano occupies the whole track that has a prominent bass line as well, with a general jolly and happy feeling throughout the song.
An upbeat pop instrumental track starting with synthesized piano sound, later with guitar added in, and then a saxophone-like melody line.
Pop song with a classical chord progression in which all instruments join progressively, building up a richer and richer music.
An instrumental world fusion track with prominent reggae elements.

Text To Music

The examples for the text-to-music task also come from the Song Describer dataset[1]. For both our model and the baseline model MusicGEN[2], only the text prompt from the dataset is used as the control condition for music generation. In this case, the music prompt for our model is left empty.

Scroll to see all the results if necessary.

text prompt MusicGen-melody MusicGen-melody-large Ours
An energetic rock and roll song, accompanied by a nervous electric guitar.
A deep house track with a very clear build up, very well balanced and smooth kick-snare timbre. The glockenspiel samples seem to be the best option to aid for the smoothness of such a track, which helps 2 minutes to pass like it was nothing. A very clear and effective contrastive counterpoint structure between the bass and treble registers of keyboards and then the bass drum/snare structure is what makes this song a very good representative of house music.
A string ensemble starts of the track with legato melancholic playing. After two bars, a more instruments from the ensemble come in. Alti and violins seem to be playing melody while celli, alti and basses underpin the moving melody lines with harmonies and chords. The track feels ominous and melanchonic. Halfway through, alti switch to pizzicato, and then fade out to let the celli and basses come through with somber melodies, leaving the chords to the violins.
medium tempo ambient sounds to begin with and slow guitar plucking layering followed by an ambient rhythmic beat and then remove the layering in the opposite direction.
An instrumental surf rock track with a twist. Open charleston beat with strummed guitar and a mellow synth lead. The song is a happy cyberpunk soundtrack.
Starts like an experimental hip hop beat, transitions into an epic happy and relaxing vibe with the melody and guitar. It is an instrumental track with mostly acoustic instruments.

References

[1] I. Manco, B. Weck, S. Doh, M. Won, Y. Zhang, D. Bogdanov, Y. Wu, K. Chen, P. Tovstogan, E. Benetos, E. Quinton, G. Fazekas, and J. Nam, “The Song Describer dataset: A corpus of audio captions for music-and-language evaluation,” in Proc. NeurIPS, New Orleans, 2023.

[2] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Defossez, “Simple and controllable music generation,” in Proc. NeurIPS, New Orleans, 2023.