CLC number: TP183
On-line Access: 2025-06-04
Received: 2024-03-14
Revision Accepted: 2024-10-25
Crosschecked: 2025-09-04
Cited: 0
Clicked: 1057
Citations: Bibtex RefMan EndNote GB/T7714
Shiyuan YANG, Zheng GU, Wenyue HAO, Yi WANG, Huaiyu CAI, Xiaodong CHEN. Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400395 @article{title="Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning", %0 Journal Article TY - JOUR
基于参数高效扩散微调的少样本参考图驱动图像补全1天津大学精密仪器与光电子工程学院光电信息技术教育部重点实验室,中国天津市,300072 2南京大学计算机软件新技术国家重点实验室,中国南京市,210008 摘要:文本到图像的扩散模型在图像生成方面展现了卓越的能力,并已广泛应用于图像补全任务。尽管文本提示能够为有条件的图像补全提供直观指导,但用户往往希望通过提供参考图像为特定对象补全个性化外观。然而,现有的参考图驱动图像补全方法难以实现高保真度的补全效果。为解决这一问题,我们基于预训练的文本驱动图像补全模型提出一种即插即用的低秩适配(LoRA)模块。该模块通过少样本微调学习参考图像的特定特征,显著提升了对自定义参考图像的拟合能力,并且无需在大规模数据集上进行大量训练。此外,引入GPT-4V提示词和先验噪声初始化技术,进一步提升补全结果的保真度。简而言之,去噪扩散过程首先从由复合参考-背景图像派生的初始噪声开始,进而由GPT-4V从参考图中生成的丰富提示词引导后续生成过程。大量实验表明,我们的方法在定性和定量指标上都达到目前最高水平,为用户提供了一个具有更强定制化能力的参考图驱动图像补全工具。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Achiam J, Adler S, Agarwal S, et al., 2024. GPT-4 technical report. ![]() [2]Avrahami O, Lischinski D, Fried O, 2022. Blended diffusion for text-driven editing of natural images. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18187-18197. ![]() [3]Avrahami O, Fried O, Lischinski D, 2023. Blended latent diffusion. ACM Trans Graph, 42(4):149. ![]() [4]Barnes C, Shechtman E, Finkelstein A, et al., 2009. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans Graph, 28(3):24. ![]() [5]Brooks T, Holynski A, Efros AA, 2023. InstructPix2Pix: learning to follow image editing instructions. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18392-18402. ![]() [6]Chefer H, Alaluf Y, Vinker Y, et al., 2023. Attend-and-Excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans Graph, 42(4):148. ![]() [7]Chen JS, Yu JC, Ge CJ, et al., 2023. PixArt-α: fast training of diffusion Transformer for photorealistic text-to-image synthesis. ![]() [8]Criminisi A, Pérez P, Toyama K, 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Trans Image Process, 13(9):1200-1212. ![]() [9]Dhariwal P, Nichol A, 2021. Diffusion models beat GANs on image synthesis. Proc 35th Int Conf on Neural Information Processing Systems, p.8780-8794. ![]() [10]Gal R, Alaluf Y, Atzmon Y, et al., 2022. An image is worth one word: personalizing text-to-image generation using textual inversion. ![]() [11]He KM, Sun J, 2014. Image completion approaches using the statistics of similar patches. IEEE Trans Patt Anal Mach Intell, 36(12):2423-2435. ![]() [12]Hertz A, Mokady R, Tenenbaum J, et al., 2022. Prompt-to-Prompt image editing with cross attention control. ![]() [13]Hessel J, Holtzman A, Forbes M, et al., 2022. CLIPScore: a reference-free evaluation metric for image captioning. ![]() [14]Heusel M, Ramsauer H, Unterthiner T, et al., 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Proc 31st Int Conf on Neural Information Processing Systems, p.6629-6640. ![]() [15]Ho J, Salimans T, 2022. Classifier-free diffusion guidance. ![]() [16]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, p.6840-6851. ![]() [17]Hu EJ, Shen YL, Wallis P, et al., 2021. LoRA: low-rank adaptation of large language models. ![]() [18]Kingma DP, Ba J, 2017. Adam: a method for stochastic optimization. ![]() [19]Kumari N, Zhang BL, Zhang R, et al., 2023. Multi-concept customization of text-to-image diffusion. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1931-1941. ![]() [20]Lei YM, Li JQ, Li ZL, et al., 2024. Prompt learning in computer vision: a survey. Front Inform Technol Electron Eng, 25(1):42-63. ![]() [21]Li JY, Wang N, Zhang LF, et al., 2020. Recurrent feature reasoning for image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7757-7765. ![]() [22]Li YH, Liu HT, Wu QY, et al., 2023. GLIGEN: open-set grounded text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22511-22521. ![]() [23]Liu GL, Reda FA, Shih KJ, et al., 2018. Image inpainting for irregular holes using partial convolutions. Proc 15th European Conf on Computer Vision, p.89-105. ![]() [24]Meng CL, He YT, Song Y, et al., 2022. SDEdit: guided image synthesis and editing with stochastic differential equations. ![]() [25]Mou C, Wang XT, Xie LB, et al., 2023. T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. ![]() [26]Nazeri K, Ng E, Joseph T, et al., 2019. EdgeConnect: generative image inpainting with adversarial edge learning. ![]() [27]Pathak D, Krähenbühl P, Donahue J, et al., 2016. Context encoders: feature learning by inpainting. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2536-2544. ![]() [28]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763. ![]() [29]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents. ![]() [30]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674-10685. ![]() [31]Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500-22510. ![]() [32]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. ![]() [33]Schuhmann C, Vencu R, Beaumont R, et al., 2021. LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. ![]() [34]Song JM, Meng CL, Ermon S, 2022. Denoising diffusion implicit models. ![]() [35]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369. ![]() [36]Xie SA, Zhang ZF, Lin Z, et al., 2023. SmartBrush: text and shape guided object inpainting with diffusion model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22428-22437. ![]() [37]Yang BX, Gu SY, Zhang B, et al., 2023. Paint by example: exemplar-based image editing with diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18381-18391. ![]() [38]Yang H, Zhang RM, Guo XB, et al., 2020. Towards photo-realistic virtual try-on by adaptively generating↔ preserving image content. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7847-7856. ![]() [39]Yang SY, Wang Y, Cai HY, et al., 2022. Residual inpainting using selective free-form attention. Neurocomputing, 510:149-158. ![]() [40]Ye H, Zhang J, Liu SB, et al., 2023. IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models. ![]() [41]Yu JH, Lin Z, Yang JM, et al., 2018. Generative image inpainting with contextual attention. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5505-5514. ![]() [42]Zhang HR, Hu ZZ, Luo CZ, et al., 2018. Semantic image inpainting with progressive generative networks. Proc 26th ACM Int Conf on Multimedia, p.1939-1947. ![]() [43]Zhang JP, Sun LY, Jin C, et al., 2024. Recent advances in artificial intelligence generated content. Front Inform Technol Electron Eng, 25(1):1-5. ![]() [44]Zhang LM, Rao AY, Agrawala M, 2023. Adding conditional control to text-to-image diffusion models. Proc IEEE/CVF Int Conf on Computer Vision, p.3813-3824. ![]() [45]Zhou J, Ke P, Qiu XP, et al., 2024. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, 25(1):6-11. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>