CLC number: TP181
On-line Access: 2024-02-19
Received: 2023-05-31
Revision Accepted: 2024-02-19
Crosschecked: 2023-10-17
Cited: 0
Clicked: 1461
Citations: Bibtex RefMan EndNote GB/T7714
Yiming LEI, Jingqi LI, Zilong LI, Yuan CAO, Hongming SHAN. Prompt learning in computer vision: a survey[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(1): 42-63.
@article{title="Prompt learning in computer vision: a survey",
author="Yiming LEI, Jingqi LI, Zilong LI, Yuan CAO, Hongming SHAN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="1",
pages="42-63",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300389"
}
%0 Journal Article
%T Prompt learning in computer vision: a survey
%A Yiming LEI
%A Jingqi LI
%A Zilong LI
%A Yuan CAO
%A Hongming SHAN
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 1
%P 42-63
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300389
TY - JOUR
T1 - Prompt learning in computer vision: a survey
A1 - Yiming LEI
A1 - Jingqi LI
A1 - Zilong LI
A1 - Yuan CAO
A1 - Hongming SHAN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 1
SP - 42
EP - 63
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300389
Abstract: prompt learning has attracted broad attention in computer vision since the large pre-trained vision-language models (VLMs) exploded. Based on the close relationship between vision and language information built by VLM, prompt learning becomes a crucial technique in many important applications such as artificial intelligence generated content (AIGC). In this survey, we provide a progressive and comprehensive review of visual prompt learning as related to AIGC. We begin by introducing VLM, the foundation of visual prompt learning. Then, we review the vision prompt learning methods and prompt-guided generative models, and discuss how to improve the efficiency of adapting AIGC models to specific downstream tasks. Finally, we provide some promising research directions concerning prompt learning.
[1]Abdal R, Qin YP, Wonka P, 2019. Image2StyleGAN: how to embed images into the StyleGAN latent space? Proc IEEE/CVF Int Conf on Computer Vision, p.4431-4440.
[2]Avrahami O, Lischinski D, Fried O, 2022. Blended diffusion for text-driven editing of natural images. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18187-18197.
[3]Bahng H, Jahanian A, Sankaranarayanan S, et al., 2022. Exploring visual prompts for adapting large-scale models.
[4]Bar A, Gandelsman Y, Darrell T, et al., 2022. Visual prompting via image inpainting. Proc 36th Conf on Neural Information Processing Systems, p.25005-25017.
[5]Barnes C, Shechtman E, Finkelstein A, et al., 2009. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans Graph, 28(3):24.
[6]Cao Y, Zhang DC, Zheng X, et al., 2023. Mutual information boosted precipitation nowcasting from radar images. Remote Sens, 15(6):1639.
[7]Chao HQ, Wang K, He YW, et al., 2022. GaitSet: cross-view gait recognition through utilizing gait as a deep set. IEEE Trans Patt Anal Mach Intell, 44(7):3467-3478.
[8]Chen AC, Yao YG, Chen PY, et al., 2023. Understanding and improving visual prompting: a label-mapping perspective. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19133-19143.
[9]Chen GY, Yao WR, Song XC, et al., 2023. PLOT: prompt learning with optimal transport for vision-language models. Proc 11th Int Conf on Learning Representations.
[10]Chen Z, Duan YC, Wang WH, et al., 2023. Vision Transformer adapter for dense predictions. Proc 11th Int Conf on Learning Representations.
[11]Cuturi M, 2013. Sinkhorn distances: lightspeed computation of optimal transport. Proc 26th Int Conf on Neural Information Processing Systems, p.2292-2300.
[12]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.
[13]Dong BW, Zhou P, Yan SC, et al., 2023. LPT: long-tailed prompt tuning for image classification. Proc 11th Int Conf on Learning Representations.
[14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9th Int Conf on Learning Representations.
[15]Du Y, Wei FY, Zhang ZH, et al., 2022. Learning to prompt for open-vocabulary object detection with vision-language model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14064-14073.
[16]Feng CM, Li BJ, Xu XX, et al., 2023. Learning federated visual prompt in null space for MRI reconstruction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8064-8073.
[17]Gao P, Geng SJ, Zhang RR, et al., 2021. CLIP-Adapter: better vision-language models with feature adapters.
[18]Ge CJ, Huang R, Xie MX, et al., 2022. Domain adaptation via prompt learning.
[19]Ge JX, Luo HY, Qian SY, et al., 2023. Chain of thought prompt tuning in vision language models.
[20]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144.
[21]Gu XY, Lin TY, Kuo WC, et al., 2022. Open-vocabulary object detection via vision and language knowledge distillation. Proc 10th Int Conf on Learning Representations.
[22]Han K, Wang YH, Chen HT, et al., 2023. A survey on vision Transformer. IEEE Trans Patt Anal Mach Intell, 45(1):87-110.
[23]He KM, Sun J, 2014. Image completion approaches using the statistics of similar patches. IEEE Trans Patt Anal Mach Intell, 36(12):2423-2435.
[24]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[25]He KM, Chen XL, Xie SN, et al., 2022. Masked autoencoders are scalable vision learners. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15979-15988.
[26]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, p.574.
[27]Houlsby N, Giurgiu A, Jastrzebski S, et al., 2019. Parameter-efficient transfer learning for NLP. Proc 36th Int Conf on Machine Learning, p.2790-2799.
[28]Hu EJ, Shen YL, Wallis P, et al., 2022. LoRA: low-rank adaptation of large language models. Proc 10th Int Conf on Learning Representations.
[29]Huang ST, Gong B, Pan YL, et al., 2023. VoP: text-video co-operative prompt tuning for cross-modal retrieval. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6565-6574.
[30]Huang ZC, Zeng ZY, Liu B, et al., 2020. Pixel-BERT: aligning image pixels with text by deep multi-modal Transformers.
[31]Iizuka S, Simo-Serra E, Ishikawa H, 2017. Globally and locally consistent image completion. ACM Trans Graph, 36(4):107.
[32]Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38th Int Conf on Machine Learning, p.4904-4916.
[33]Jia ML, Tang LM, Chen BC, et al., 2022. Visual prompt tuning. Proc 17th European Conf on Computer Vision, p.709-727.
[34]Ju C, Han TD, Zheng KH, et al., 2022. Prompting visual-language models for efficient video understanding. Proc 17th European Conf on Computer Vision, p.105-124.
[35]Kang M, Zhu JY, Zhang R, et al., 2023. Scaling up GANs for text-to-image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10124-10134.
[36]Kaplan J, McCandlish S, Henighan T, et al., 2020. Scaling laws for neural language models.
[37]Karras T, Laine S, Aila T, 2019. A style-based generator architecture for generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4396-4405.
[38]Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107-8116.
[39]Karras T, Aittala M, Laine S, et al., 2021. Alias-free generative adversarial networks. Proc 35th Conf on Neural Information Processing Systems, p.852-863.
[40]Kawar B, Zada S, Lang O, et al., 2023. Imagic: text-based real image editing with diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6007-6017.
[41]Khan S, Naseer M, Hayat M, et al., 2022. Transformers in vision: a survey. ACM Comput Surv, 54(10s):200.
[42]Khattak MU, Rasheed H, Maaz M, et al., 2023. MaPLe: multi-modal prompt learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19113-19122.
[43]Kim W, Son B, Kim I, 2021. ViLT: vision-and-language Transformer without convolution or region supervision. Proc 38th Int Conf on Machine Learning, p.5583-5594.
[44]Kingma DP, Welling M, 2013. Auto-encoding variational Bayes.
[45]Kirillov A, Mintun E, Ravi N, et al., 2023. Segment anything.
[46]Kojima T, Gu SS, Reid M, et al., 2022. Large language models are zero-shot reasoners. Proc 36th Conf on Neural Information Processing Systems.
[47]Kwon H, Song T, Jeong S, et al., 2023. Probabilistic prompt learning for dense prediction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6768-6777.
[48]Lee JH, Choi I, Kim MH, 2016. Laplacian patch-based image synthesis. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2727-2735.
[49]Lei YM, Zhang JP, Shan HM, 2021. Strided self-supervised low-dose CT denoising for lung nodule classification. Phenomics, 1(6):257-268.
[50]Lei YM, Zhu HP, Zhang JP, et al., 2022. Meta ordinal regression forest for medical image classification with ordinal labels. IEEE/CAA J Autom Sin, 9(7):1233-1247.
[51]Lei YM, Li ZL, Shen Y, et al., 2023a. CLIP-Lung: textual knowledge-guided lung nodule malignancy prediction. Proc 26th Int Conf on Medical Image Computing and Computer-Assisted Intervention, p.403-412.
[52]Lei YM, Li ZL, Li YY, et al., 2023b. LICO: explainable models with language-image consistency.
[53]Li JN, Li DX, Xiong CM, et al., 2022. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. Proc 39th Int Conf on Machine Learning, p.12888-12900.
[54]Li JQ, Gao JQ, Zhang YZ, et al., 2023a. Motion matters: a novel motion modeling for cross-view gait feature learning. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.
[55]Li JQ, Zhang YZ, Shan HM, et al., 2023b. Gaitcotr: improved spatial-temporal representation for gait recognition with a hybrid convolution-Transformer framework. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.
[56]Li MK, Xu P, Li CG, et al., 2023. MaskCL: semantic mask-driven contrastive learning for unsupervised person re-identification with clothes change.
[57]Li WH, Huang XK, Zhu Z, et al., 2022. OrdinalCLIP: learning rank prompts for language-guided ordinal regression. Proc 36th Conf on Neural Information Processing Systems.
[58]Lin BB, Zhang SL, Yu X, 2021. Gait recognition via effective global-local feature representation and local temporal aggregation. Proc IEEE/CVF Int Conf on Computer Vision, p.14628-14636.
[59]Lin HZ, Cheng X, Wu XY, et al., 2022. CAT: cross attention in vision Transformer. Proc IEEE Int Conf on Multimedia and Expo, p.1-6.
[60]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007.
[61]Lin Y, Zhao ZC, Zhu ZJ, et al., 2023. Exploring visual prompts for whole slide image classification with multiple instance learning.
[62]Ling H, Kreis K, Li DQ, et al., 2021. EditGAN: high-precision semantic image editing. Proc 35th Conf on Neural Information Processing Systems, p.16331-16345.
[63]Liu PF, Yuan WZ, Fu JL, et al., 2023. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv, 55(9):195.
[64]Liu WH, Shen X, Pun CM, et al., 2023. Explicit visual prompting for low-level structure segmentations. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19434-19445.
[65]Liu YJ, Lu YN, Liu H, et al., 2023. Hierarchical prompt learning for multi-task learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10888-10898.
[66]Lu JS, Clark C, Zellers R, et al., 2023. Unified-IO: a unified model for vision, language, and multi-modal tasks. Proc 11th Int Conf on Learning Representations.
[67]Lu P, Mishra S, Xia T, et al., 2022. Learn to explain: multimodal reasoning via thought chains for science question answering. Proc 36th Conf on Neural Information Processing Systems, p.2507-2521.
[68]Lu YN, Liu JZ, Zhang YG, et al., 2022. Prompt distribution learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5196-5205.
[69]Lugmayr A, Danelljan M, Romero A, et al., 2022. Repaint: inpainting using denoising diffusion probabilistic models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11451-11461.
[70]Ma ZY, Luo G, Gao J, et al., 2022. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14054-14063.
[71]Mao CZ, Teotia R, Sundar A, et al., 2023. Doubly right object recognition: a why prompt for visual rationales. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2722-2732.
[72]Milletari F, Navab N, Ahmadi SA, 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation. Proc 4th Int Conf on 3D Vision, p.565-571.
[73]Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39th Int Conf on Machine Learning, p.16784-16804.
[74]Oh C, Hwang H, Lee HY, et al., 2023. BlackVIP: black-box visual prompting for robust transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.24224-24235.
[75]Perarnau G, van de Weijer J, Raducanu B, et al., 2016. Invertible conditional GANs for image editing.
[76]Pfeiffer J, Kamath A, Rücklé A, et al., 2020a. AdapterFusion: non-destructive task composition for transfer learning. Proc 16th Conf of the European Chapter of the Association for Computational Linguistics: Main Volume, p.487-503.
[77]Pfeiffer J, Rücklé A, Poth C, et al., 2020b. AdapterHub: a framework for adapting Transformers. Proc Conf on Empirical Methods in Natural Language Processing: System Demonstrations, p.46-54.
[78]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763.
[79]Radford A, Kim JW, Xu T, et al., 2023. Robust speech recognition via large-scale weak supervision. Proc 40th Int Conf on Machine Learning, p.28492-28518.
[80]Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38th Int Conf on Machine Learning, p.8821-8831.
[81]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents.
[82]Reed S, Akata Z, Yan XC, et al., 2016a. Generative adversarial text to image synthesis. Proc 33rd Int Conf on Machine Learning, p.1060-1069.
[83]Reed S, Akata Z, Mohan S, et al., 2016b. Learning what and where to draw. Proc 30th Int Conf on Neural Information Processing Systems, p.217-225.
[84]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674-10685.
[85]Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500-22510.
[86]Selvaraju RR, Cogswell M, Das A, et al., 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization. Proc IEEE Int Conf on Computer Vision, p.618-626.
[87]Shamshad F, Khan S, Zamir SW, et al., 2023. Transformers in medical imaging: a survey. Med Image Anal, 88:102802.
[88]Smith JS, Hsu YC, Zhang LY, et al., 2023. Continual diffusion: continual customization of text-to-image diffusion with C-LoRA.
[89]Sohn K, Chang HW, Lezama J, et al., 2023. Visual prompt tuning for generative transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19840-19851.
[90]Sung YL, Cho J, Bansal M, 2022. VL-Adapter: parameter-efficient transfer learning for vision-and-language tasks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5217-5227.
[91]Suvorov R, Logacheva E, Mashikhin A, et al., 2022. Resolution-robust large mask inpainting with Fourier convolutions. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.3172-3182.
[92]Tao M, Tang H, Wu F, et al., 2022. DF-GAN: a simple and effective baseline for text-to-image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16494-16504.
[93]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[94]Wang F, Li ML, Lin XD, et al., 2023. Learning to decompose visual features with latent textual prompts. Proc 11th Int Conf on Learning Representations.
[95]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369.
[96]Wang TC, Liu MY, Zhu JY, et al., 2018. High-resolution image synthesis and semantic manipulation with conditional GANs. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8798-8807.
[97]Wang XL, Wang W, Cao Y, et al., 2023. Images speak in images: a generalist painter for in-context visual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6830-6839.
[98]Wang ZF, Zhang ZZ, Lee CY, et al., 2022. Learning to prompt for continual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.139-149.
[99]Wei J, Wang XZ, Schuurmans D, et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Proc 36th Conf on Neural Information Processing Systems.
[100]Xiao ZX, Chen YZ, Zhang L, et al., 2023. Instruction-ViT: multi-modal prompts for instruction learning in ViT.
[101]Xie SA, Zhang ZF, Lin Z, et al., 2023. SmartBrush: text and shape guided object inpainting with diffusion model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22428-22437.
[102]Xing YH, Wu QR, Cheng D, et al., 2022. Class-aware visual prompt tuning for vision-language pre-trained model.
[103]Xu T, Zhang PC, Huang QY, et al., 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1316-1324.
[104]Xu ZB, Sun J, 2010. Image inpainting by patch propagation using patch sparsity. IEEE Trans Image Process, 19(5):1153-1165.
[105]Xu ZH, Shen B, Tang YL, et al., 2022. Deep clinical phenotyping of Parkinson’s disease: towards a new era of research and clinical care. Phenomics, 2(5):349-361.
[106]Xue H, Salim FD, 2022. Prompt-based time series forecasting: a new task and dataset. http://export.arxiv.org/abs/2210.08964v1
[107]Yao Y, Zhang A, Zhang ZY, et al., 2021. CPT: colorful prompt tuning for pre-trained vision-language models.
[108]Yu JH, Lin Z, Yang JM, et al., 2018. Generative image inpainting with contextual attention. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5505-5514.
[109]Yu JH, Lin Z, Yang JM, et al., 2019. Free-form image inpainting with gated convolution. Proc IEEE/CVF Int Conf on Computer Vision, p.4470-4479.
[110]Yu WW, Liu YL, Hua W, et al., 2023. Turning a CLIP model into a scene text detector. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6978-6988.
[111]Yu Y, Rong L, Wang MY, et al., 2022. Prompt learning for multi-modal COVID-19 diagnosis. Proc IEEE Int Conf on Bioinformatics and Biomedicine, p.2803-2807.
[112]Zhang H, Xu T, Li HS, et al., 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. Proc IEEE Int Conf on Computer Vision, p.5908-5916.
[113]Zhang LM, Rao A, Agrawala M, 2023. Adding conditional control to text-to-image diffusion models.
[114]Zhang ZJ, Zhao Z, Zhang Z, et al., 2020. Text-guided image inpainting. Proc 28th ACM Int Conf on Multimedia, p.4079-4087.
[115]Zhang ZS, Zhang A, Li M, et al., 2022. Automatic chain of thought prompting in large language models. Proc 11th Int Conf on Learning Representations.
[116]Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models.
[117]Zhou KY, Yang JK, Loy CC, et al., 2022a. Conditional prompt learning for vision-language models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16795-16804.
[118]Zhou KY, Yang JK, Loy CC, et al., 2022b. Learning to prompt for vision-language models. Int J Comput Vis, 130(9):2337-2348.
[119]Zhou YQ, Barnes C, Shechtman E, et al., 2021. TransFill: reference-guided image inpainting by merging multiple color and spatial transformations. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2266-2267.
[120]Zhu HP, Shan HM, Zhang YH, et al., 2022. Convolutional ordinal regression forest for image ordinal estimation. IEEE Trans Neur Netw Learn Syst, 33(8):4084-4095.
[121]Zhu JW, Lai SM, Chen X, et al., 2023. Visual prompt multi-modal tracking. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9516-9526.
Open peer comments: Debate/Discuss/Question/Opinion
<1>