
CLC number: TP391
On-line Access: 2026-01-08
Received: 2025-06-14
Revision Accepted: 2025-10-20
Crosschecked: 2026-01-08
Cited: 0
Clicked: 371
Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG. TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2500412 @article{title="TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions", %0 Journal Article TY - JOUR
TimeJudge:基于视频大语言模型的视频摘要时序一致性零样本检测1华中科技大学,中国武汉市,430074 2乐卓博大学,澳大利亚墨尔本,3086 摘要:视频大语言模型(video-LLM)在多模态理解方面展现出卓越能力,但其在视频摘要时序一致性零样本评估方面的潜力仍未被充分挖掘。现有方法在检测关键时序错误(如动作缺失、幻觉或顺序混乱)时表现有限。为此,本文作出两项核心贡献:(1)提出一种创新的零样本框架TimeJudge,将时序错误检测重构为一组经校准的二元问答任务,并引入模态敏感的置信度校准机制与一致性加权投票策略,以实现稳健的结果聚合;(2)精心构建一个基准数据集TEDBench,涵盖4个层次动作复杂度的视频,并提供细粒度的时序错误标注,用于系统评估video-LLM在该任务上的表现。实验结果表明,TimeJudge在多个先进的video-LLM上显著提升了时序错误检测的召回率与F1分数,无需任何针对特定任务的微调。该方法为提升video-LLM的时序审查能力提供了一种通用、可扩展且无需训练的解决方案。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Bai S, Chen K, Liu X, et al., 2025. Qwen2.5-VL technical report. ![]() [2]Bai YS, Ying JH, Cao YX, et al., 2023. Benchmarking foundation models with Language-Model-as-an-Examiner. ![]() [3]Chen DP, Chen RX, Zhang SL, et al., 2024. MLLM-as-a-Judge: assessing multimodal LLM-as-a-Judge with vision-language benchmark. ![]() [4]Deshpande D, Ravi SS, CH-Wang S, et al., 2024. GLIDER: grading LLM interactions and decisions using explainable ranking. ![]() [5]Goyal R, Kahou SE, Michalski V, et al., 2017. The “something something” video database for learning and evaluating visual common sense. Proc IEEE Int Conf on Computer Vision, p.5842-5850. ![]() [6]Hurst A, Lerer A, Goucher AP, et al., 2024. GPT-4o system card. ![]() [7]Lee H, Phatale S, Mansoor H, et al., 2023. RLAIF: scaling reinforcement learning from human feedback with AI feedback. Proc 41st Int Conf on Machine Learning. ![]() [8]Li JL, Sun SC, Yuan WZ, et al., 2023. Generative judge for evaluating alignment. ![]() [9]Li L, Wei YC, Xie ZH, et al., 2024. VL-RewardBench: a challenging benchmark for vision-language generative reward models. ![]() [10]Li RS, Patel T, Du XY, 2023. PRD: peer rank and discussion improve large language model based evaluations. ![]() [11]Liang T, He ZW, Jiao WX, et al., 2024. Encouraging divergent thinking in large language models through multi-agent debate. Proc Conf on Empirical Methods in Natural Language Processing, p.17889-17904. ![]() [12]Liao RT, Erler M, Wang HY, et al., 2024. VideoINSTA: zero-shot long video understanding via informative spatial-temporal reasoning with LLMs. Proc Findings of the Association for Computational Linguistics, p.6577-6602. ![]() [13]Liu M, Zhang WS, 2025. Is your video language model a reliable judge? ![]() [14]Monfort M, Andonian A, Zhou BL, et al., 2020. Moments in Time dataset: one million videos for event understanding. IEEE Trans Pattern Anal Mach Intell, 42(2):502-508. ![]() [15]Park J, Jwa S, Ren MY, et al., 2024. OffsetBias: leveraging debiased data for tuning evaluators. Proc Findings of the Association for Computational Linguistics, p.1043-1067. ![]() [16]Pu S, Wang YC, Chen DP, et al., 2025. Judge anything: MLLM as a judge across any modality. ![]() [17]Rafailov R, Sharma A, Mitchell E, et al., 2023. Direct preference optimization: your language model is secretly a reward model. ![]() [18]Shi JW, Yuan ZH, Liu YN, et al., 2024. Optimization-based prompt injection attack to LLM-as-a-Judge. Proc ACM SIGSAC Conf on Computer and Communications Security, p.660-674. ![]() [19]Sigurdsson GA, Varol G, Wang XL, et al., 2016. Hollywood in Homes: crowdsourcing data collection for activity understanding. Proc 14th European Conf on Computer Vision, p.510-526. ![]() [20]Son G, Yoon D, Suk J, et al., 2024. MM-Eval: a multilingual meta-evaluation benchmark for LLM-as-a-Judge and reward models. ![]() [21]Tan SJ, Zhuang SY, Montgomery K, et al., 2024. JudgeBench: a benchmark for evaluating LLM-based judges. ![]() [22]Vu T, Krishna K, Alzubi S, et al., 2024. Foundational autoraters: taming large language models for better automatic evaluation. Proc Conf on Empirical Methods in Natural Language Processing, p.17086-17105. ![]() [23]Wang BJ, Chern S, Chern E, et al., 2024. Halu-J: critique-based hallucination judge. ![]() [24]Wang BS, Yue X, Sun H, 2023. Can ChatGPT defend its belief in truth? Evaluating LLM reasoning via debate. Proc Findings of the Association for Computational Linguistics, p.11865-11881. ![]() [25]Wang LM, Xiong YJ, Wang Z, et al., 2016. Temporal segment networks: towards good practices for deep action recognition. Proc 14th European Conf on Computer Vision, p.20-36. ![]() [26]Wang TL, Kulikov I, Golovneva O, et al., 2024. Self-taught evaluators. ![]() [27]Wang YC, Yuan JY, Chuang YN, et al., 2024. DHP benchmark: are LLMs good NLG evaluators? Proc Findings of the Association for Computational Linguistics, p.8079-8094. ![]() [28]Wang YD, Yu ZH, Zeng ZR, et al., 2023. PandaLM: an automatic evaluation benchmark for LLM instruction tuning optimization. ![]() [29]Wang ZT, Hu SM, Zhao SY, et al., 2025. MLLM-as-a-Judge for image safety without human labeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14657-14666. ![]() [30]Wu TH, Yuan WZ, Golovneva O, et al., 2024. Meta-rewarding language models: self-improving alignment with LLM-as-a-Meta-Judge. ![]() [31]Xie TH, Qi XY, Zeng Y, et al., 2024. SORRY-bench: systematically evaluating large language model safety refusal behaviors. ![]() [32]Xu YF, Sun YZ, Xie ZE, et al., 2024. VTG-GPT: tuning-free zero-shot video temporal grounding with GPT. Appl Sci, 14(5):1894. ![]() [33]Yasunaga M, Zettlemoyer L, Ghazvininejad M, 2025. Multimodal RewardBench: holistic evaluation of reward models for vision language models. ![]() [34]Ye JY, Wang YB, Huang Y, et al., 2024. Justice or prejudice? Quantifying biases in LLM-as-a-Judge. ![]() [35]Yu TY, Zhang HY, Li QM, et al., 2025. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19985-19995. ![]() [36]Zhang BQ, Li KH, Cheng ZS, et al., 2025. VideoLLaMA 3: frontier multimodal foundation models for image and video understanding. ![]() [37]Zheng LM, Chiang WL, Sheng Y, et al., 2023. Judging LLM-as-a-Judge with MT-bench and Chatbot Arena. Proc 37th Int Conf on Neural Information Processing Systems, Article 2020. ![]() [38]Zhu JG, Wang WY, Chen Z, et al., 2025. InternVL3: exploring advanced training and test-time recipes for open-source multimodal models. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE | ||||||||||||||


ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>