JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions

Author(s): Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG
Affiliation(s): Huazhong University of Science and Technology, Wuhan 430074, China; more
Corresponding email(s): huyangliu@hust.edu.cn, skyesong@hust.edu.cn
Key Words: Video large language model (Video-LLM); Multimodal large language model (MLLM); MLLM-as-a-Judge; Video caption; Benchmark

Share this article to： More <<< Previous Paper \|Next Paper >>>

Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG. TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2500412

@article{title="TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions",
author="Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2500412"
}

%0 Journal Article
%T TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
%A Yangliu HU
%A Zikai SONG
%A Junqing YU
%A Yiping Phoebe CHEN
%A Wei YANG
%J Frontiers of Information Technology & Electronic Engineering
%P 2204-2214
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2500412"

TY - JOUR
T1 - TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
A1 - Yangliu HU
A1 - Zikai SONG
A1 - Junqing YU
A1 - Yiping Phoebe CHEN
A1 - Wei YANG
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 2204
EP - 2214
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2500412"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Video large language models (video-LLMs) have demonstrated impressive capabilities in multimodal understanding, but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored. Existing methods notably underperform in detecting critical temporal errors, such as missing, hallucinated, or misordered actions. To address this gap, we introduce two key contributions. (1) TimeJudge: a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs. It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation. (2) TEDBench: a rigorously constructed benchmark featuring videos across four distinct complexity levels, specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task. Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench, we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning. Our approach provides a generalizable, scalable, and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.

TimeJudge：基于视频大语言模型的视频摘要时序一致性零样本检测

胡杨柳¹，宋子恺¹，于俊清¹，陈依萍²，杨卫¹
¹华中科技大学，中国武汉市，430074
²乐卓博大学，澳大利亚墨尔本，3086
摘要：视频大语言模型（video-LLM）在多模态理解方面展现出卓越能力，但其在视频摘要时序一致性零样本评估方面的潜力仍未被充分挖掘。现有方法在检测关键时序错误（如动作缺失、幻觉或顺序混乱）时表现有限。为此，本文作出两项核心贡献：（1）提出一种创新的零样本框架TimeJudge，将时序错误检测重构为一组经校准的二元问答任务，并引入模态敏感的置信度校准机制与一致性加权投票策略，以实现稳健的结果聚合；（2）精心构建一个基准数据集TEDBench，涵盖4个层次动作复杂度的视频，并提供细粒度的时序错误标注，用于系统评估video-LLM在该任务上的表现。实验结果表明，TimeJudge在多个先进的video-LLM上显著提升了时序错误检测的召回率与F1分数，无需任何针对特定任务的微调。该方法为提升video-LLM的时序审查能力提供了一种通用、可扩展且无需训练的解决方案。

关键词组：视频大语言模型（Video-LLM）；多模态大语言模型（MLLM）；多模态大语言模型作为评判员；视频摘要；评估基准

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bai S, Chen K, Liu X, et al., 2025. Qwen2.5-VL technical report.

[2]Bai YS, Ying JH, Cao YX, et al., 2023. Benchmarking foundation models with Language-Model-as-an-Examiner.

[3]Chen DP, Chen RX, Zhang SL, et al., 2024. MLLM-as-a-Judge: assessing multimodal LLM-as-a-Judge with vision-language benchmark.

[4]Deshpande D, Ravi SS, CH-Wang S, et al., 2024. GLIDER: grading LLM interactions and decisions using explainable ranking.

[5]Goyal R, Kahou SE, Michalski V, et al., 2017. The “something something” video database for learning and evaluating visual common sense. Proc IEEE Int Conf on Computer Vision, p.5842-5850.

[6]Hurst A, Lerer A, Goucher AP, et al., 2024. GPT-4o system card.

[7]Lee H, Phatale S, Mansoor H, et al., 2023. RLAIF: scaling reinforcement learning from human feedback with AI feedback. Proc 41^st Int Conf on Machine Learning.

[8]Li JL, Sun SC, Yuan WZ, et al., 2023. Generative judge for evaluating alignment.

[9]Li L, Wei YC, Xie ZH, et al., 2024. VL-RewardBench: a challenging benchmark for vision-language generative reward models.

[10]Li RS, Patel T, Du XY, 2023. PRD: peer rank and discussion improve large language model based evaluations.

[11]Liang T, He ZW, Jiao WX, et al., 2024. Encouraging divergent thinking in large language models through multi-agent debate. Proc Conf on Empirical Methods in Natural Language Processing, p.17889-17904.

[12]Liao RT, Erler M, Wang HY, et al., 2024. VideoINSTA: zero-shot long video understanding via informative spatial-temporal reasoning with LLMs. Proc Findings of the Association for Computational Linguistics, p.6577-6602.

[13]Liu M, Zhang WS, 2025. Is your video language model a reliable judge?

[14]Monfort M, Andonian A, Zhou BL, et al., 2020. Moments in Time dataset: one million videos for event understanding. IEEE Trans Pattern Anal Mach Intell, 42(2):502-508.

[15]Park J, Jwa S, Ren MY, et al., 2024. OffsetBias: leveraging debiased data for tuning evaluators. Proc Findings of the Association for Computational Linguistics, p.1043-1067.

[16]Pu S, Wang YC, Chen DP, et al., 2025. Judge anything: MLLM as a judge across any modality.

[17]Rafailov R, Sharma A, Mitchell E, et al., 2023. Direct preference optimization: your language model is secretly a reward model.

[18]Shi JW, Yuan ZH, Liu YN, et al., 2024. Optimization-based prompt injection attack to LLM-as-a-Judge. Proc ACM SIGSAC Conf on Computer and Communications Security, p.660-674.

[19]Sigurdsson GA, Varol G, Wang XL, et al., 2016. Hollywood in Homes: crowdsourcing data collection for activity understanding. Proc 14^th European Conf on Computer Vision, p.510-526.

[20]Son G, Yoon D, Suk J, et al., 2024. MM-Eval: a multilingual meta-evaluation benchmark for LLM-as-a-Judge and reward models.

[21]Tan SJ, Zhuang SY, Montgomery K, et al., 2024. JudgeBench: a benchmark for evaluating LLM-based judges.

[22]Vu T, Krishna K, Alzubi S, et al., 2024. Foundational autoraters: taming large language models for better automatic evaluation. Proc Conf on Empirical Methods in Natural Language Processing, p.17086-17105.

[23]Wang BJ, Chern S, Chern E, et al., 2024. Halu-J: critique-based hallucination judge.

[24]Wang BS, Yue X, Sun H, 2023. Can ChatGPT defend its belief in truth? Evaluating LLM reasoning via debate. Proc Findings of the Association for Computational Linguistics, p.11865-11881.

[25]Wang LM, Xiong YJ, Wang Z, et al., 2016. Temporal segment networks: towards good practices for deep action recognition. Proc 14^th European Conf on Computer Vision, p.20-36.

[26]Wang TL, Kulikov I, Golovneva O, et al., 2024. Self-taught evaluators.

[27]Wang YC, Yuan JY, Chuang YN, et al., 2024. DHP benchmark: are LLMs good NLG evaluators? Proc Findings of the Association for Computational Linguistics, p.8079-8094.

[28]Wang YD, Yu ZH, Zeng ZR, et al., 2023. PandaLM: an automatic evaluation benchmark for LLM instruction tuning optimization.

[29]Wang ZT, Hu SM, Zhao SY, et al., 2025. MLLM-as-a-Judge for image safety without human labeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14657-14666.

[30]Wu TH, Yuan WZ, Golovneva O, et al., 2024. Meta-rewarding language models: self-improving alignment with LLM-as-a-Meta-Judge.

[31]Xie TH, Qi XY, Zeng Y, et al., 2024. SORRY-bench: systematically evaluating large language model safety refusal behaviors.

[32]Xu YF, Sun YZ, Xie ZE, et al., 2024. VTG-GPT: tuning-free zero-shot video temporal grounding with GPT. Appl Sci, 14(5):1894.

[33]Yasunaga M, Zettlemoyer L, Ghazvininejad M, 2025. Multimodal RewardBench: holistic evaluation of reward models for vision language models.

[34]Ye JY, Wang YB, Huang Y, et al., 2024. Justice or prejudice? Quantifying biases in LLM-as-a-Judge.

[35]Yu TY, Zhang HY, Li QM, et al., 2025. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19985-19995.

[36]Zhang BQ, Li KH, Cheng ZS, et al., 2025. VideoLLaMA 3: frontier multimodal foundation models for image and video understanding.

[37]Zheng LM, Chiang WL, Sheng Y, et al., 2023. Judging LLM-as-a-Judge with MT-bench and Chatbot Arena. Proc 37^th Int Conf on Neural Information Processing Systems, Article 2020.

[38]Zhu JG, Wang WY, Chen Z, et al., 2025. InternVL3: exploring advanced training and test-time recipes for open-source multimodal models.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

TimeJudge：基于视频大语言模型的视频摘要时序一致性零样本检测

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference