[ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
OpenVid-1M is a high-quality text-to-video dataset designed for research institutions to enhance video quality, featuring high aesthetics, clarity, and resolution. It can be used for direct training or as a quality tuning complement to other video datasets. It can also be used in other video generation task (video super-resolution, frame interpolation, etc)
We carefully curate 1 million high-quality video clips with expressive captions to advance text-to-video research, in which 0.4 million videos are in 1080P resolution (termed OpenVidHD-0.4M).
OpenVid-1M is cited, discussed or used in several recent works, including video diffusion models undefinedGokuundefined, undefinedMarDiniundefined, undefinedAllegroundefined, undefinedT2V-Turbo-V2undefined, undefinedPyramid Flowundefined, undefinedSnapGen-Vundefined; long video generation model with AR model undefinedARLONundefined; visual understanding and generation model undefinedVILA-Uundefined; 3D/4D generation models undefinedGenXDundefined, undefinedDimentionXundefined; video VAE model undefinedIV-VAEundefined; Frame interpolation model undefinedFramerundefined and large multimodal model undefinedInternVL 2.5undefined.
conda create -n openvid python=3.10
conda activate openvid
pip install torch torchvision
pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
# it takes a lot of time.
python download_scripts/download_OpenVid.py
./dataset folder.dataset
ββ OpenVid-1M
ββ data
ββ train
ββ OpenVid-1M.csv
ββ OpenVidHD.csv
ββ video
ββ ---_iRTHryQ_13_0to241.mp4
ββ ---agFLYkbY_7_0to303.mp4
ββ --0ETtekpw0_2_18to486.mp4
ββ ...
| Model | Data | Pretrained Weight | Steps | Batch Size | URL |
|---|---|---|---|---|---|
| STDiT-16Γ1024Γ1024 | OpenVidHQ | STDiT-16Γ512Γ512 | 16k | 32Γ4 | :link: |
| STDiT-16Γ512Γ512 | OpenVid-1M | STDiT-16Γ256Γ256 | 20k | 32Γ8 | :link: |
| MVDiT-16Γ512Γ512 | OpenVid-1M | MVDiT-16Γ256Γ256 | 20k | 32Γ4 | :link: |
Our modelβs weight is partially initialized from PixArt-Ξ±.
# MVDiT, 16x512x512
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/mvdit/inference/16x512x512.py --ckpt-path MVDiT-16x512x512.pt
# STDiT, 16x512x512
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/stdit/inference/16x512x512.py --ckpt-path STDiT-16x512x512.pt
# STDiT, 16x1024x1024
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/stdit/inference/16x1024x1024.py --ckpt-path STDiT-16x1024x1024.pt
# MVDiT, 16x256x256, 72k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/mvdit/train/16x256x256.py
# MVDiT, 16x512x512, 20k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/mvdit/train/16x512x512.py
# STDiT, 16x256x256, 72k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x256x256.py
# STDiT, 16x512x512, 20k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x512x512.py
# STDiT, 16x1024x1024, 16k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x1024x1024.py
Training orders: 16x256x256 $\rightarrow$ 16Γ512Γ512 $\rightarrow$ 16Γ1024Γ1024.
Part of the code is based upon:
Open-Sora.
Thanks for their great work!
@article{nan2024openvid,
title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation},
author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying},
journal={arXiv preprint arXiv:2407.02371},
year={2024}
}