1University of Science and Technology of China 2Tencent Hunyuan
Important If you notice temporal inconsistency between the original and edited videos, it is usually caused by webpage loading. Please refresh the page and try again. On Windows systems, playback issues may occur due to some unknown issues.
Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations (e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.
Video Editing • Dataset • Benchmark • Instruction Following • Multi-Task Editing
Examples from the Goku dataset across all editing task categories.
More qualitative results from our Goku-Edit model across diverse editing scenarios.
Side-by-side comparison of our method against state-of-the-art video editing approaches.