Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

Zhengyang Liang1,♣, Yan Shu2,♣, Xiangrui Liu3, Minghao Qin3, Kaixin Liang4, Paolo Rota2, Nicu Sebe2, Zheng Liu3, Lizi Liao1
1Singapore Management University, 2University of Trento,
3Beijing Academy of Artificial Intelligence, 4Beijing University of Posts and Telecommunications

Abstract

The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web.

To bridge this gap, we present Video-BrowseComp, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 23.81% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.

Leaderboard

We evaluate state-of-the-art models on Video-BrowseComp. The accuracy (%) is reported for Overall (OA) and three difficulty levels (Level 1, Level 2, Level 3).

Model Overall Acc (%) Level 1 (%) Level 2 (%) Level 3 (%) Calibration Error (%)
Tool-Free Models
Qwen3-VL-8B-Thinking7.1412.000.000.0052.49
Qwen3-VL-235B-A22B-Instruct13.3322.400.000.0077.64
GLM-4.6V10.9516.803.230.0044.40
gpt-4o-2024-11-2017.6228.003.230.0058.81
gpt-4o-mini-2024-07-189.5216.000.000.0063.55
gpt-5-mini-2025-08-0715.7126.400.000.0037.47
gemini-2.5-flash-2025-0616.6727.201.610.0077.79
gemini-2.5-pro-2025-0619.5231.203.230.0079.18
Search Models
gemini-2.5-flash-2025-06 (w/ Search)20.9532.804.840.0035.98
gemini-2.5-pro-2025-06 (w/ Search)23.8137.604.840.0031.45
gpt-5.1-2025-11-13 (w/ Search)15.2421.606.454.3530.20
o4-mini-deep-research-2025-06-2622.8630.4012.908.7042.55

Citation

@article{liang2025video,
  title={Video-BrowseComp: Benchmarking Agentic Video Research on Open Web},
  author={Liang, Zhengyang and Shu, Yan and Liu, Xiangrui and Qin, Minghao and Liang, Kaixin and Rota, Paolo and Sebe, Nicu and Liu, Zheng and Liao, Lizi},
  journal={arXiv preprint},
  year={2025}
}