Motion-to-Attention: Attention Motion Composer using Optical Flow for Text-to-Video Editing

Seong-Hun Jeong*, Inhwan Jin*, Haesoo Choo*, Hyeonjun Na*, Kyeongbo Kong
Pusan National University, Pukyong National University
*Indicates Equal Contribution

Corresponding Author

Red words indicate the prompt that Prompt-to-prompt editing is applied ('replace' or 'refine'),
Blue words indicate the prompt that Attention Flow with Motion Map Injection module is applied.





Abstract

Recent text-guided video editing research attempts to expand from image to video based on the text-guided image editing model. To this end, most research focuses on achieving temporal consistency between frames as a primary challenge in text-guided video editing. However, despite their efforts, the editability is still limited when there is a prompt indicating motion, such as ``floating". In our experiment, we found that this phenomenon was due to the inaccurate attention map of the motion prompt. In this paper, we suggest the Motion-to-Attention (M2A) module to perform precise video editing by explicitly taking motion into account. First, we convert the optical flow extracted from the video into a motion map. During conversion, users can selectively apply direction information to extract the motion map. The proposed M2A module uses two methods: ``Attention-Motion Swap", which directly replaces the motion map with the attention map, and ``Attention-Motion Fusion", which uses the association between the motion map and the attention map, measured by a Fusion metric, as a weight to enhance the attention map using the motion map. The Text-to-Video editing model with the proposed M2A module showed better quantitative and qualitative results compared to the existing model.

Our Contribution

Our proposed Motion-to-Attention (M2A) module gives a way to effectively inject motion extracted from video into the attention map of the prompt. We found that editing a video with the motion extracted from the video improves general editing performance and enables selective editing according to the direction in which the object moves.

The existing T2V model failed to estimate an accurate attention map for the motion prompt \{Floating\}, which resulted in a decrease in editability, as shown in the top row of (b). (a) is a figure comparing the editability of Video-P2P and adding the proposed module to Video-P2P for input video. The proposed module improves the editability of existing video editing models through accurate estimation of attention maps. (b) briefly explains the method of enhancing the attention map by applying the proposed Motion-to-Attention module to address the limitations that the existing T2V model cannot accurately generate.



Framework

The left side of the figure shows the overall framework of video editing by enhancing the attention map. First, the Text-to-Video (T2V) Model generates an attention map by receiving video and prompts as input. Simultaneously, the optical flow estimation model estimates the optical flow from the input video frames. The estimated optical flow is converted to a motion map by default using only magnitude information. Optionally, when direction information is provided by the user, the Direction Control converts the optical flow to a motion map that only shows movement in the user-specified direction. If the user indicates directional words with \textbf{[]}, the model captures the direction information and performs Direction Control. Then, the motion map is injected into the attention map of the T2V-Model in two ways from the M2A module: Attention-Motion Swap and Attention Motion Fusion. After that, text-to-video editing is performed using the attention map enhanced by the motion map. The right side of the figure shows how the Attention-Motion Swap and Attention-Motion Fusion of the M2A module enhance the attention map with the motion map.



Qualitative Results

First, we provide results for 8 frames and 24 frames. It is possible to use not only 8 frames, which are commonly used by most video editing models, but also up to 24 frames.

We coducted extensive exprements on existing T2V models (vid2vid-zero, FateZero) including Video-P2P. Followings are those expermental results.





Direction Control

Since optical flow has the information on direction of pixel movement, as well as magnitude, our model can be applied to allow the user to edit contents in a specific direction by rotating the optical flow according to the direction provided by the user before injecting it.





Ablation Study



In ablation study, correlation-based normalized methods (Normalized Cross Correlation, Normalized Cross Coefficient) outperform other metrics including Spectral angle Mapper. In some case, both the Structural Similarity Index and Mutual Information shows novel results that were not observed within other template matching methods. Through this results, it is important to choose the template matching metrics well.





Followup Works

We used Video-P2P: Video Editing with Cross-attention Control model as base-model which is based on Text-to-Image diffusion. And we expanded experments to show that our module is generally suitable for other T2V models, vid2vid and FateZero.



BibTex

@article{2024Motiontoattention,
    title={Motion-to-Attention: Enhancing Attention Maps to Improve Performance of Text-Guided Video Editing Models},
    year={2024}
}