OpenVid-1M

[ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

0
0
0
public
Forked

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

OpenVid-1M

OpenVid-1M is a high-quality text-to-video dataset designed for research institutions to enhance video quality, featuring high aesthetics, clarity, and resolution. It can be used for direct training or as a quality tuning complement to other video datasets. It can also be used in other video generation task (video super-resolution, frame interpolation, etc)

We carefully curate 1 million high-quality video clips with expressive captions to advance text-to-video research, in which 0.4 million videos are in 1080P resolution (termed OpenVidHD-0.4M).

OpenVid-1M is cited, discussed or used in several recent works, including video diffusion models undefinedGokuundefined, undefinedMarDiniundefined, undefinedAllegroundefined, undefinedT2V-Turbo-V2undefined, undefinedPyramid Flowundefined, undefinedSnapGen-Vundefined; long video generation model with AR model undefinedARLONundefined; visual understanding and generation model undefinedVILA-Uundefined; 3D/4D generation models undefinedGenXDundefined, undefinedDimentionXundefined; video VAE model undefinedIV-VAEundefined; Frame interpolation model undefinedFramerundefined and large multimodal model undefinedInternVL 2.5undefined.

News πŸš€πŸš€πŸš€

  • undefined[2025.05.30] πŸ€— We have uploaded a separate undefinedOpenVidHD-0.4Mundefined for convenient download. This will be helpful if you only want to use OpenVidHD-0.4M, and it requires about 4.5TB of storage space. You can open undefinedOpenVidHD.jsonundefined to view the list of video names included in each ZIP file.
  • undefined[2025.02.28] πŸ™ Thanks @Binglei, undefinedOpenVid-1M-mappingundefined was developed to correlate the video names in the CSV files with their file paths in the unzipped files. It will be particularly useful if you only need to use a portion of OpenVid-1M and prefer not to download the entire collection.
  • undefined[2025.01.23] πŸ† OpenVid-1M is accepted by ICLR 2025!!!
  • undefined[2024.12.01] πŸš€ OpenVid-1M dataset was downloaded over 79,000 times on Huggingface last month, placing it in the top 1% of all video datasets (as of Nov. 2024)!!
  • undefined[2024.07.01] πŸ”₯ Our paper, code, model and OpenVid-1M dataset are released!

Preparation

Environment

conda create -n openvid python=3.10
conda activate openvid
pip install torch torchvision
pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121

Dataset

  1. Download OpenVid-1M dataset.
# it takes a lot of time.
python download_scripts/download_OpenVid.py
  1. Put OpenVid-1M dataset in ./dataset folder.
dataset
└─ OpenVid-1M
    └─ data
        └─ train
            └─ OpenVid-1M.csv
            └─ OpenVidHD.csv
    └─ video
        └─ ---_iRTHryQ_13_0to241.mp4
        └─ ---agFLYkbY_7_0to303.mp4
        └─ --0ETtekpw0_2_18to486.mp4
        └─ ...

Model Weight

Model Data Pretrained Weight Steps Batch Size URL
STDiT-16Γ—1024Γ—1024 OpenVidHQ STDiT-16Γ—512Γ—512 16k 32Γ—4 :link:
STDiT-16Γ—512Γ—512 OpenVid-1M STDiT-16Γ—256Γ—256 20k 32Γ—8 :link:
MVDiT-16Γ—512Γ—512 OpenVid-1M MVDiT-16Γ—256Γ—256 20k 32Γ—4 :link:

Our model’s weight is partially initialized from PixArt-Ξ±.

Inference

# MVDiT, 16x512x512
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/mvdit/inference/16x512x512.py --ckpt-path MVDiT-16x512x512.pt
# STDiT, 16x512x512
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/stdit/inference/16x512x512.py --ckpt-path STDiT-16x512x512.pt
# STDiT, 16x1024x1024
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/stdit/inference/16x1024x1024.py --ckpt-path STDiT-16x1024x1024.pt

Training

# MVDiT, 16x256x256, 72k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/mvdit/train/16x256x256.py
# MVDiT, 16x512x512, 20k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/mvdit/train/16x512x512.py

# STDiT, 16x256x256, 72k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x256x256.py
# STDiT, 16x512x512, 20k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x512x512.py
# STDiT, 16x1024x1024, 16k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x1024x1024.py

Training orders: 16x256x256 $\rightarrow$ 16Γ—512Γ—512 $\rightarrow$ 16Γ—1024Γ—1024.

References

Part of the code is based upon:
Open-Sora.
Thanks for their great work!

Citation

@article{nan2024openvid,
  title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation},
  author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying},
  journal={arXiv preprint arXiv:2407.02371},
  year={2024}
}
[beta]v0.3.0