We evaluate video editing performance on complex videos containing multiple objects and diverse motion patterns. Existing text-guided video editing methods often fail to accurately localize and preserve object motions under such challenging conditions, whereas our method consistently produces coherent and accurate editing results.
Furthermore, we conduct additional experiments on videos that include camera motion in addition to multiple objects and multiple motion patterns. Despite the increased complexity introduced by camera motion, our method continues to deliver high-quality and stable editing results.
To reflect realistic scenarios where no user-provided prompt is available for the input video, we evaluate our method using automatically generated prompts obtained from a video captioning model. When applying M2A, we observe that the editing results produced using user-written prompts and those generated by the video captioning model are both accurate and reliable, demonstrating the robustness of our approach.
* Zhang, Boqiang, et al. "Videollama 3: Frontier multimodal foundation models for image and video understanding." arXiv preprint arXiv:2501.13106 (2025).
The proposed M2A module enhances attention distributions using optical-flow-based motion cues. As a result, its performance may degrade in challenging scenarios where reliable motion estimation is difficult. Such cases include videos with severe motion blur, low-texture regions, heavy occlusions, or complex crowd scenes. In these situations, inaccurate or ambiguous optical flow leads to noisy motion maps, which in turn limits the effectiveness of motion-guided attention enhancement and results in suboptimal editing quality.