Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Abstract. Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We eplore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and processing pipeline.
In this section, we present samples generated by SongGen in the Mixed Pro mode.
Lyrics | Description | Reference Voice (optional) | Generated |
---|---|---|---|
I only wanted to be strong to be brave But it's driven everyone away | A female vocalist sings this pop rock song. The tempo is medium with electric guitar and enthusiastic drumming. | ||
I've got a paper and pen, i go to write a goodbye, and that's when i've got a world of chances for you, i've got a world of chances for you | A female vocalist performs this rhythmic pop rock song, featuring rousing and sparkling guitar riffs. | ||
Hey, it's a beautiful day. Ain't no clouds in the sky. The sun is shining bright. Let's get up and fly. Come on and take my hand together. We'll explore the beauty of this world forevermore. We can go anywhere to the mountains or the sea. Let's start an adventure you and me. No matter what life throws at us. We’ll be by each other's side. | A male vocalist performs this rap. The tempo is slow with enthusiastic drumming, digital beats, keyboard arrangements, and a catchy vocal riff. The rap is catchy, youthful, insightful, and enthusiastic. | N/A | |
Hey, it's a beautiful day. No clouds in the sky. The sun is shining bright. | The song is performed by an ethereal and pure female voice, filled with hope and strength. The overall atmosphere is cheerful and uplifting. | N/A | |
Sing at the moon and drive through the night. | female, guitar, mellow | N/A | |
And I just can't seem to forget The way your face glows And the way you make me cry. | male, rock, emotional | N/A |
In this section, we present samples generated by SongGen in the dual-track mode using the interleaving (A-V) pattern. We display both the original vocal and accompaniment tracks, along with their summed version, which is presented as the first audio.
Lyrics | Description | Reference Voice (optional) | Generated |
---|---|---|---|
And even if they take my life away, I'ma stay with you, you are my shepherd. | A female vocalist sings this pop song, accompanied by punchy kicks and a buzzing synth bass. The track sounds aggressive, bright, punchy, energetic, and somewhat manic. | N/A | Vocal track: Acc. track: |
I've seen the beauty of diamonds and pearls, but they're nothing to me. | The romantic music features a male voice singing a melody. | Vocal track: Acc. track: | |
So if I let down my guard, if I rip up my scars, and I show you my heart, am I beautiful? If I tell you my secrets, show my dark and my demons, tell me, Am I beautiful? | A male vocalist with a soft, youthful voice sings this love song, accompanied by acoustic guitar. The rhythm adds smooth depth to the romantic, melancholic mood. | N/A | Vocal track: Acc. track: |
In this section, we present a subset of samples generated from MusicCaps test set. For each sample, we provide the lyrics (annotated through our data preprocessing pipeline), description texts, and the vocal reference (the first three seconds of the ground truth vocal) as input to our model. We showcase the generated audio under two different generation modes: Mixed Pro and Interleaving (A-V). For Interleaving (A-V), we display both the original vocal and accompaniment tracks, along with their summed version (presented as the first audio).
Lyrics | Description | Reference Voice | GT | Mixed Pro (ours) | Interleaving (A-V) (ours) |
---|---|---|---|---|---|
For the truth and the being as beaming as the moon queen. You bless my future to be with bismillah. For the soul's anguished love and the moment my brothers programmed these drums. | A male vocalist sings this Rap. The tempo is slow with enthusiastic drumming, syncopated piano harmony, digital beats ,keyboard arrangements with vocal backup and a catchy vocal riff. The rap is catchy, youthful, insightful, enthusiastic, intense, passionate, emotional and persuasive. This song is contemporary Rap/Hip-Hop. | Vocal track: Acc. track: | |||
The work to be done. Let the music play on. I'm feeling the vibe, you know I'm feeling good. | This is a remix of an R&B soul piece. There is a male vocal singing in a laid-back manner joined by an auto-tuned male vocal. The keyboard provides the melody with a gentle bass guitar playing in the background. The rhythmic structure is composed of the acoustic drums and the percussion playing a medium tempo beat. The atmosphere of the piece is groovy and there is a feelgood aura to it. This piece could be used in the soundtrack of a sitcom. | Vocal track: Acc. track: | |||
Singing Radiohead at the top of our lungs With the boombox blaring as we're falling in | This pop rock song features a female voice singing the main melody. The song starts off with the sound of folding paper. Then the voice starts singing. This is accompanied by rock drums playing the kick drum. The snare is struck at every alternate count. The reverb effect is added to the percussion. There are no other instruments in this song. This song has a happy mood. This song can be played in an intro scene of a coming-of-age movie. | Vocal track: Acc. track: | |||
And I just can't seem to forget The way your face glows And the way you make me cry | A male vocalist sings this smooth Soul melody. The tempo is medium with a mellow piano accompaniment, steady drum machine beats, atmospheric synthesiser and subtle bass with backup vocals. The song is soft, ambient, passionate, emotional, mellifluous ,sentimental and warm. This song is a contemporary R&B/Soul. | Vocal track: Acc. track: | |||
The sun is shining, the day is brightening up, happy times are happy times are happy times are here. The sun is shining. | The music features a group of female voices singing a melody in unison. The instrumental consists of only percussion drums, African percussion drums to be precise. A shaker can also be heard sounding on every beat. In the background one can hear water sounds. The overall atmosphere is cheerful and uplifting. | Vocal track: Acc. track: |