Enhancing Text-to-Video Editing with Motion Map Injection

Seong-Hun Jeong*, In-Hwan Jin*, Haesoo Choo*, Hyeonjun Na*, and Kyeongbo Kong
Pukyong National University, Pusan National University
Creative Video Editing and Understanding (CVEU), 2023 Oral Presentation

*Indicates Equal Contribution

Red words indicate the prompt that Prompt-to-prompt editing is applied ('replace' or 'refine'),
Blue words indicate the prompt that Attention Flow with Motion Map Injection module is applied.





Our Contribution

Our proposed Motion Map Injection (MMI) module gives a way to effectively inject motion extracted from video into the attention map of the prompt.
Our study represents, as far as we are aware, the first attempt to use optical flow in the context of text-tovideo editing.
We found that editing a video with the motion extracted from the video improves general editing performance.
To the

Comparison of video editing output and attention map compared to existing methods. Both Image-P2P and Video-P2P failed to accurately estimate the attention map, resulting in discrepancy with the prompt. Our framework performed realistic video editing by enabling accurate attention maps through optical flow guided attention maps. Following figure describes our proposed method.



Abstract

Text-guided video editing studies have recently increased due to the impressive performance of text-to-image diffusion models. Existing research on video editing has introduced an implicit method for estimating inter-frame attention from cross-frame attention, which produces videos that are temporally consistent. However, as these techniques rely on generative models trained on text-image pair data, they are unable to handle motion, which is a special feature of video. When a video is edited by manipulating prompts, the attention map of the prompt that implies the motion of the video (such as “running” or “moving”) is likely to be inaccurately estimated, which results in errors in video editing. In this article, we suggest the “Motion Map Injection” (MMI) module to carry out precise video editing by explicitly taking motion into account. The MMI module offers a basic but efficient way to provide text-to-video (T2V) model with the video motion information.





Framework

Overall framework of this study. First, the T2V-Model generates an attention map by receiving video and prompts as input. At the same time, the Motion Map Injection module receives the video frame, generates a motion map, and injects it into the attention map of the T2V-Model. After that, text-to-video editing is performed using the attention map that includes video motion information.



Qualitative Results

We coducted extensive exprements on existing T2V models (vid2vid-zero, FateZero) including Video-P2P. Followings are those expermental results.























Application

Since optical flow has the information on direction of pixel movement, as well as magnitude, our model can be applied to allow the user to edit contents in a specific direction by rotating the optical flow according to the direction provided by the user before injecting it.





Ablation Study


From the left, the result from applying cv2.TM_SQDIFF, cv2.TM_CCORR, and cv2.TM_CCOEFF as correlation calculating method.

From the left, the result from applying cv2.TM_SQDIFF_NORMED cv2.TM_CCORR_NORMED cv2.TM_CCOEFF_NORMED as correlation calculating method.

Ablation study on which method for calculating correlation between attention maps and motion map makes video edited in best quality. We used cv2.TM_CCOEFF_NORMED to calculate the correlation for realistic video editing.





Followup Works

We used Video-P2P: Video Editing with Cross-attention Control model as base-model which is based on Text-to-Image diffusion. And we expanded experments to show that our module is generally suitable for other T2V models, vid2vid and FateZero.