Wen Wang1*, Kangyang Xie1*, Zide Liu1*, Hao Chen1, Yue Cao2, Xinlong Wang2, Chunhua Shen1
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn’t require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
Video editing with off-the-shelf image diffusion models.
No training on any video.
Promising results in editing attributes, subjects, places, etc., in real-world videos.
pip install -r requirements.txt
Installing xformers is highly recommended for improved efficiency and speed on GPUs.
undefined[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from 🤗 Hugging Face (e.g., Stable Diffusion v1-4, v2-1). We use Stable Diffusion v1-4 by default.
Simply run:
accelerate launch test_vid2vid_zero.py --config path/to/config
For example:
accelerate launch test_vid2vid_zero.py --config configs/car-moving.yaml
Launch the local demo built with gradio:
python app.py
Or you can use our online gradio demo here.
Note that we disable Null-text Inversion and enable fp16 for faster demo response.
| Input Video | Output Video | Input Video | Output Video |
| "A car is moving on the road" | "A Porsche car is moving on the desert" | "A car is moving on the road" | "A jeep car is moving on the snow" |
![]() |
![]() |
||
| "A man is running" | "Stephen Curry is running in Time Square" | "A man is running" | "A man is running in New York City" |
![]() |
![]() |
||
| "A child is riding a bike on the road" | "a child is riding a bike on the flooded road" | "A child is riding a bike on the road" | "a lego child is riding a bike on the road.gif" |
![]() |
![]() |
||
| "A car is moving on the road" | "A car is moving on the snow" | "A car is moving on the road" | "A jeep car is moving on the desert" |
![]() |
![]() |
||
@article{vid2vid-zero,
title={Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models},
author={Wang, Wen and Xie, kangyang and Liu, Zide and Chen, Hao and Cao, Yue and Wang, Xinlong and Shen, Chunhua},
journal={arXiv preprint arXiv:2303.17599},
year={2023}
}
Tune-A-Video, diffusers, prompt-to-prompt.
undefinedWe are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on foundation model, visual perception and multimodal learning, please contact Xinlong Wang (wangxinlong@baai.ac.cn) and Yue Cao (caoyue@baai.ac.cn).
We use cookies
We use cookies to analyze traffic and improve your experience. You can accept or reject analytics cookies.