The official code repository for LeVo: High-Quality Song Generation with Multi-Preference Alignment

🚀 We introduce LeVo 2 (SongGeneration 2), an open-source music foundation model designed to shatter the ceiling of open-source AI music by achieving true commercial-grade generation.
Through a large-scale, rigorous expert evaluation (20 industry professionals, 6 core dimensions, 100 songs per model), LeVo 2 (SongGeneration 2) has proven its superiority:
📊 For detailed experimental setups and comprehensive metrics, please refer to the Evaluation Performance section below or our upcoming technical report.
📢 All the experimental results above are based on the latest checkpoint released on March 9th. If you downloaded the weights before March 9th, please re-download the latest checkpoint.
| Model | Max Length | Language | GPU Memory | RTF(H20) | Download Link |
|---|---|---|---|---|---|
| SongGeneration-base | 2m30s | zh | 10G/16G | 0.67 | Huggingface |
| SongGeneration-base-new | 2m30s | zh, en | 10G/16G | 0.67 | Huggingface |
| SongGeneration-base-full | 4m30s | zh, en | 12G/18G | 0.69 | Huggingface |
| SongGeneration-large | 4m30s | zh, en | 22G/28G | 0.82 | Huggingface |
| SongGeneration-v2-large | 4m30s | zh, en, es, ja, etc. | 22G/28G | 0.82 | Huggingface |
| SongGeneration-v2-medium | 4m30s | zh, en, es, ja, etc. | 12G/18G | 0.69 | Coming soon |
| SongGeneration-v2-fast | 4m30s | zh, en, es, ja, etc. | - | - | Coming soon |
💡 Notes:
To shatter the ceiling of open-source AI music and achieve commercial-grade generation, SongGeneration 2 introduces a paradigm shift in both its underlying architecture and training strategy.
Model Architecture: Hybrid LLM-Diffusion Architecture & Hierarchical Language Model
SongGeneration 2 adopts a hybrid LLM-Diffusion architecture to balance musicality and sound quality:
Training Strategy: Automated Aesthetic Evaluation & Multi-stage Progressive Post-Training
To resolve lyrical hallucinations and stiff musicality, we utilize a highly structured training pipeline:
Automated Aesthetic Evaluation Framework: We built a fine-grained evaluation framework trained on a massive expert-annotated dataset to provide the model with musicality priors.
Multi-stage Progressive Post-training: We implemented a 3-stage alignment process:
Stage 1 - SFT: Narrows the data distribution using high-quality songs to build a solid generation baseline.
Stage 2 - Large-scale Offline DPO: Utilizes ~200k strict positive/negative pairs to completely eliminate lyrical hallucinations and stabilize controllability.
Stage 3 - Semi-online DPO: Periodically updates the model based strictly on aesthetic scores to maximize musicality limits.
You can install the necessary dependencies using the requirements.txt file with Python>=3.8.12 and CUDA>=11.8:
pip install -r requirements.txt
pip install -r requirements_nodeps.txt --no-deps
(Optional) Then install flash attention from git. For example, if you’re using Python 3.10 and CUDA 12.0
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
docker pull juhayna/song-generation-levo:hf0613
docker run -it --gpus all --network=host juhayna/song-generation-levo:hf0613 /bin/bash
To ensure the model runs correctly, please download all the required folders from the original source at Hugging Face.
Download ckpt and third_party folder from Hugging Face 1 or Hugging Face 2, and move them into the root directory of the project. You can also download models using huggingface-cli.
huggingface-cli download lglg666/SongGeneration-Runtime --local-dir ./runtime
mv runtime/ckpt ckpt
mv runtime/third_party third_party
Download the specific model checkpoint and save it to your specified checkpoint directory: ckpt_path (We provide multiple versions of model checkpoints. Please select the most suitable version based on your needs and download the corresponding file. Also, ensure the folder name matches the model version name.) You can also download models using huggingface-cli.
# download SongGeneration-base
huggingface-cli download lglg666/SongGeneration-base --local-dir ./songgeneration_base
# download SongGeneration-base-new
huggingface-cli download lglg666/SongGeneration-base-new --local-dir ./songgeneration_base_new
# download SongGeneration-base-full
huggingface-cli download lglg666/SongGeneration-base-full --local-dir ./songgeneration_base_full
# download SongGeneration-large
huggingface-cli download lglg666/SongGeneration-large --local-dir ./songgeneration_large
# download SongGeneration-v2-large
huggingface-cli download lglg666/SongGeneration-v2-large --local-dir ./songgeneration_v2_large
Once everything is set up, you can run the inference script using the following command:
sh generate.sh ckpt_path lyrics.jsonl output_path
You may provides sample inputs in JSON Lines (.jsonl) format. Each line represents an individual song generation request. The model expects each input to contain the following fields:
idx: A unique identifier for the output song. It will be used as the name of the generated audio file.
gt_lyric:The lyrics to be used in generation. It must follow the format of [Structure] Text, where Structure defines the musical section (e.g., [Verse], [Chorus]). See Input Guide.
descriptions : (Optional) You may customize the text prompt to guide the model’s generation. This can include attributes like gender, genre, emotion, instrument. See Input Guide.
prompt_audio_path: (Optional) Path to a 10-second reference audio file. If provided, the model will generate a new song in a similar style to the given reference.
auto_prompt_audio_type: (Optional) Used only if prompt_audio_path is not provided. This allows the model to automatically select a reference audio from a predefined library based on a given style. Supported values include:
'Pop', 'Latin', 'Rock', 'Electronic', 'Metal', 'Country','R&B/Soul', 'Ballad', 'Jazz', 'World', 'Hip-Hop','Funk','Soundtrack', 'Auto'.Note: If certain optional fields are not required, they can be omitted.
Outputs of the loader output_path:
audio: generated audio filesjsonl: output jsonlsAn example command may look like:
sh generate.sh songgeneration_base sample/lyrics.jsonl sample/output
If you encounter out-of-memory (OOM) issues, you can manually enable low-memory inference mode using the --low_mem flag. For example:
sh generate.sh ckpt_path lyrics.jsonl output_path --low_mem
If your GPU device does not support Flash Attention or your environment does not have Flash Attention installed, you can disable it by adding the --not_use_flash_attn flag. For example:
sh generate.sh ckpt_path lyrics.jsonl output_path --not_use_flash_attn
By default, the model generates songs with both vocals and accompaniment. If you want to generate pure music, pure vocals, or separated vocal and accompaniment tracks, please use the following flags:
--bgm Generate pure music--vocal Generate vocal-only (a cappella)--separate Generate separated vocal and accompaniment tracksFor example:
sh generate.sh ckpt_path lyrics.jsonl output_path --separate
An example input file can be found in sample/lyrics.jsonl and sample/test100_v2_sg_des.jsonl
The gt_lyric field defines the lyrics and structure of the song. It consists of multiple musical sections, each starting with a structure label. The model uses these labels to guide the musical and lyrical progression of the generated song.
The following segments should not contain lyrics (they are purely instrumental):
[intro-short], [intro-medium], [inst-short], [inst-medium], [outro-short], [outro-medium]
shortindicates a segment of approximately 0–10 secondsmediumindicates a segment of approximately 10–20 seconds
The following segments require lyrics:
[verse], [chorus], [bridge]To ensure optimal generation quality, please strictly adhere to the following punctuation and formatting rules:
;).。, ,, !). All punctuation must be in English half-width format (e.g., ., ,).[verse], [chorus], [bridge]), use a period (.) to separate sentences or phrases.
.) before the section separator (;)..) at the end of the final phrase in a lyrical block. Simply end the phrase, add a space, and use the section separator (;).💡 A complete lyric string may look like:
🇺🇸 English Example:
[intro-medium] ; [verse] Trails wind through the forest. Trees stand tall and honest. Moss covers the logs. Sunlight starts to fondest. Birds sing in the branches. Days feel like a promise. ; [chorus] Forest is the sanctuary where the promise does fondest. Is the tree that stands through the storm's honest. Is the moss that covers the log's modest. Is the peace that makes the restless heart honest. ; [inst-medium] ; [verse] Squirrels scamper by. Nuts hide in the sky. Mushrooms grow below. Fungi start to fly. Streams trickle through. Days feel like a sigh. ; [chorus] Forest is the sanctuary where the promise does fondest. Is the tree that stands through the storm's honest. Is the moss that covers the log's modest. Is the peace that makes the restless heart honest. ; [bridge] Hiking through the forest where the trees do sigh. Feeling the peace that the woods supply. Forest days with you are the sweetest high. ; [chorus] Forest is the sanctuary where the promise does fondest. Is the tree that stands through the storm's honest. Is the moss that covers the log's modest. Is the peace that makes the restless heart honest. ; [outro-medium]
🇨🇳 Chinese Example:
[intro-medium]; [verse] 凌晨三点的便利店.冰柜发出持续的嗡鸣.穿西装的男人在挑饭团.领带松垮像投降的白旗.热食区的关东煮.在汤汁里慢慢膨胀 ; [chorus] 这里是城市的守夜人.收容所有流浪的灵魂.荧光灯照亮的面孔.都写着未完待续的故事 ; [inst-medium]; [verse] 收银员打着哈欠.扫描仪发出嘀嗒声响.找零的硬币落入掌心.带着金属的冰冷温度 ; [chorus] 这里是临时的避风港.用食物交换片刻温暖.即使最孤独的夜晚.也有泡面陪伴到天明 ; [bridge] 自动门开合之间.涌进带着酒气的风.一个女孩蹲在门口.喂食流浪的玳瑁猫 ; [chorus] 这里是不打烊的剧场.上演着无声的悲喜剧.而我们都是临时演员.在黎明前悄然退场 ; [outro-medium]
More examples can be found in sample/test100_v2_sg_des.jsonl.
The descriptions field allows you to control various musical attributes of the generated song. It can describe up to four musical dimensions:
male, female)pop, jazz, rock)sad, energetic, romantic)piano, drums, guitar)⚠️ CRITICAL FORMATTING RULE: Use Comma-Separated Tags, NOT Sentences. Please combine specific keywords or tags using commas (,). Do not write full descriptive sentences or natural language paragraphs.
sample/description/ folder.✅ Valid Inputs (Comma-separated keywords):
female, synth-pop, sweet, synthesizer, drum machine, bass, backing vocals.
rock, loving, electric guitar, bass guitar, drum kit.
❌ Invalid Inputs (Full sentences - DO NOT USE):
Please generate a sad pop song sung by a female artist using piano and drums.
A dark jazz song with a male singer.
prompt_audio_path and descriptions at the same time.prompt_audio_path is not provided, you can instead use auto_prompt_audio_type for automatic reference selection.You can start up the UI with the following command:
sh tools/gradio/run.sh ckpt_path
To rigorously assess the generation capabilities of LeVo 2 (SongGeneration 2), we conducted a large-scale subjective evaluation involving 20 music professionals. The models were evaluated across six core dimensions: Overall Quality, Melody, Arrangement, Sound Quality-Instrument, Sound Quality-Vocal, and Structure.
As shown in the benchmarking results above, LeVo 2 (SongGeneration 2) comprehensively outperforms all existing open-source baselines and achieves generation quality that directly rivals top-tier closed-source commercial models.
@article{lei2025levo,
title={LeVo: High-Quality Song Generation with Multi-Preference Alignment},
author={Lei, Shun and Xu, Yaoxun and Lin, Zhiwei and Zhang, Huaicheng and Tan, Wei and Chen, Hangting and Yu, Jianwei and Zhang, Yixuan and Yang, Chenyu and Zhu, Haina and Wang, Shuai and Wu, Zhiyong and Yu, Dong},
journal={arXiv preprint arXiv:2506.07520},
year={2025}
}
The code and weights in this repository are released under the LICENSE file.
Use WeChat or QQ to scan the below QR code