ECCV 2026

Goku: A Million-Scale Universal Dataset and
Benchmark for Instruction-Based Video Editing

Sen Liang1,2⋆, Cong Wang2⋆, Zhentao Yu2, Fengbin Guan1, Zhengguang Zhou2, Teng Hu2, Youliang Zhang2, Yuan Zhou2, Xin Li1, Qinglin Lu2, Zhibo Chen1†

1University of Science and Technology of China 2Tencent Hunyuan

Important If you notice temporal inconsistency between the original and edited videos, it is usually caused by webpage loading. Please refresh the page and try again. On Windows systems, playback issues may occur due to some unknown issues.

Abstract

Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations (e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.

Video Editing • Dataset • Benchmark • Instruction Following • Multi-Task Editing

1

Dataset Samples

Examples from the Goku dataset across all editing task categories.

2

Additional Model Results

More qualitative results from our Goku-Edit model across diverse editing scenarios.

3

Comparison with Existing Methods

Side-by-side comparison of our method against state-of-the-art video editing approaches.