CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-04-22
Cited: 0
Clicked: 5338
Yahong Han, Aming Wu, Linchao Zhu, Yi Yang. Visual commonsense reasoning with directional visual connections[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000722 @article{title="Visual commonsense reasoning with directional visual connections", %0 Journal Article TY - JOUR
面向视觉常识推理的有向视觉连接1天津大学智能与计算学部,中国天津市,300350 2天津市机器学习重点实验室,天津大学,中国天津市,300350 3悉尼科技大学计算机学院,澳大利亚悉尼市,2007 摘要:为推动认知层面视觉内容理解的研究,即基于视觉细节的深入理解做出精确推理,视觉常识推理的概念被提出。相比仅需模型正确回答问题的传统视觉问答,视觉常识推理不仅需要模型正确地回答问题,还需给出相应解释。最近关于人类认知的研究指出大脑认知可以看作局部神经元连接的全局动态集成,有助于解决特定的认知任务。受其启发,本文提出有向连接网络。通过使用问题和答案的语义来情景化视觉神经元从而动态重组神经元连接,以及借助方向信息增强推理能力,所提方法能有效实现视觉常识推理。具体地,首先开发一个GraphVLAD模块来捕捉能够充分表达视觉内容相关性的视觉神经元连接。然后提出一个情景化模型来融合视觉和文本表示。最后,基于情景化连接的输出设计有向连接来推断答案及对应解释,其中包含了ReasonVLAD模块。实验结果和可视化分析证明了所提方法的有效性。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Anderson P, He XD, Buehler C, et al., 2018. Bottom-up and top-down attention for image captioning and visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6077-6086. ![]() [2]Antol S, Agrawal A, Lu JS, et al., 2015. VQA: visual question answering. Proc IEEE Int Conf on Computer Vision, p.2425-2433. ![]() [3]Arandjelović R, Gronat P, Torii A, et al., 2018. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Patt Anal Mach Intell, 40(6):1437-1451. ![]() [4]Badrinarayanan V, Kendall A, Cipolla R, 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Patt Anal Mach Intell, 39(12):2481-2495. ![]() [5]Bansal A, Zhang YT, Chellappa R, 2020. Visual question answering on image sets. European Conf on Computer Vision, p.51-67. ![]() [6]Ben-younes H, Cadene R, Cord M, et al., 2017. MUTAN: multimodal tucker fusion for visual question answering. Proc IEEE Int Conf on Computer Vision, p.2631-2639. ![]() [7]Bola M, Sabel BA, 2015. Dynamic reorganization of brain functional networks during cognition. NeuroImage, 114:398-413. ![]() [8]Cadene R, Ben-younes H, Cord M, et al., 2019. MUREL: multimodal relational reasoning for visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1989-1998. ![]() [9]Chen L, Yan X, Xiao J, et al., 2020. Counterfactual samples synthesizing for robust visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10797-10806. ![]() [10]Chen LC, Papandreou G, Kokkinos I, et al., 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Patt Anal Mach Intell, 40(4):834-848. ![]() [11]Chen YP, Rohrbach M, Yan ZC, et al., 2019. Graph-based global reasoning networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.433-442. ![]() [12]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p.4171-4186. ![]() [13]Feltovich PJ, Ford KM, Hoffman RR, 1997. Expertise in Context: Human and Machine. MIT Press, Cambridge, MA, USA, p.67-99. ![]() [14]Gao P, Li H, Li S, et al., 2018. Question-guided hybrid convolution for visual question answering. European Conf on Computer Vision, p.485-501. ![]() [15]Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440-1448. ![]() [16]Goyal Y, Khot T, Summers-Stay D, et al., 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6325-6334. ![]() [17]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. ![]() [18]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780. ![]() [19]Jégou H, Douze M, Schmid C, et al., 2010. Aggregating local descriptors into a compact image representation. Proc IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.3304-3311. ![]() [20]Kim KM, Choi SH, Kim JH, et al., 2018. Multimodal dual attention memory for video story question answering. https://arxiv.org/abs/1809.07999 ![]() [21]Kipf TN, Welling M, 2016. Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907v4 ![]() [22]Le TM, Le V, Venkatesh S, et al., 2020. Hierarchical conditional relation networks for video question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9969-9978. ![]() [23]Li G, Duan N, Fang YJ, et al., 2020. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf on Artificial Intelligence, p.11336-11344. ![]() [24]Li LH, Yatskar M, Yin D, et al., 2019. VisualBERT: a simple and performant baseline for vision and language. https://arxiv.org/abs/1908.03557 ![]() [25]Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. European Conf on Computer Vision, p.21-37. ![]() [26]Lu JS, Xiong CM, Parikh D, et al., 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3242-3250. ![]() [27]Lu JS, Batra D, Parikh D, et al., 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. https://arxiv.org/abs/1908.02265 ![]() [28]Malinowski M, Doersch C, Santoro A, et al., 2018. Learning visual question answering by bootstrapping hard attention. European Conf on Computer Vision, p.3-20. ![]() [29]Monti F, Boscaini D, Masci J, et al., 2017. Geometric deep learning on graphs and manifolds using mixture model CNNs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5425-5434. ![]() [30]Narasimhan M, Lazebnik S, Schwing AG, 2018. Out of the box: reasoning with graph convolution nets for factual visual question answering. Proc 32nd Int Conf on Neural Information Processing Systems, p.2659-2670. ![]() [31]Norcliffe-Brown W, Vafeias ES, Parisot S, 2018. Learning conditioned graph structures for interpretable visual question answering. https://arxiv.org/abs/1806.07243 ![]() [32]Pan YH, 2019. On visual knowledge. Front Inform Technol Electron Eng, 20(8):1021-1025. ![]() [33]Pan YH, 2020. Miniaturized five fundamental issues about visual knowledge. Front Inform Technol Electron Eng, online. ![]() [34]Park HJ, Friston K, 2013. Structural and functional brain networks: from connections to cognition. Science, 342(6158):1238411. ![]() [35]Perez E, Strub F, de Vries H, et al., 2017. FiLM: visual reasoning with a general conditioning layer. https://arxiv.org/abs/1709.07871v2 ![]() [36]Schwartz I, Yu S, Hazan T, et al., 2019. Factor graph attention. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2039-2048. ![]() [37]Su WJ, Zhu XZ, Cao Y, et al., 2019. VL-BERT: pre-training of generic visual-linguistic representations. https://arxiv.org/abs/1908.08530v1 ![]() [38]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9:2579-2605. ![]() [39]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. ![]() [40]Velič;cković P, Cucurull G, Casanova A, et al., 2018. Graph attention networks. Proc Int Conf on Learning Representations. ![]() [41]Wu AM, Zhu LC, Han YH, et al., 2019. Connective cognition network for directional visual commonsense reasoning. Proc 33rd Conf on Neural Information Processing Systems, p.5669-5679. ![]() [42]Xu K, Ba JL, Kiros R, et al., 2015. Show, attend and tell: neural image caption generation with visual attention. Proc 32nd Int Conf on Machine Learning, p.2048-2057. ![]() [43]Xu K, Wu LF, Wang ZG, et al., 2018. Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. Proc Conf on Empirical Methods in Natural Language Processing, p.918-924. ![]() [44]Zellers R, Bisk Y, Farhadi A, et al., 2019. From recognition to cognition: visual commonsense reasoning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6713-6724. ![]() [45]Zhou J, Cui GQ, Zhang ZY, et al., 2018. Graph neural networks: a review of methods and applications. https://arxiv.org/abs/1812.08434v3 ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>