JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning

Author(s): Shiyuan YANG, Zheng GU, Wenyue HAO, Yi WANG, Huaiyu CAI, Xiaodong CHEN
Affiliation(s): Key Laboratory of Optoelectronics Information Technology, Ministry of Education, School of Precision Instruments and Optoelectronic Engineering, Tianjin University, Tianjin 300072, China; more
Corresponding email(s): yangshiyuan@tju.edu.cn, guzheng@smail.nju.edu.cn, wy_hao@tju.edu.cn, koala_wy@tju.edu.cn, hycai@tju.edu.cn, xdchen@tju.edu.cn
Key Words: Diffusion model; Image inpainting; Exemplar-driven; Few-shot fine-tuning

Share this article to： More <<< Previous Paper \|Next Paper >>>

Shiyuan YANG, Zheng GU, Wenyue HAO, Yi WANG, Huaiyu CAI, Xiaodong CHEN. Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400395

@article{title="Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning",
author="Shiyuan YANG, Zheng GU, Wenyue HAO, Yi WANG, Huaiyu CAI, Xiaodong CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400395"
}

%0 Journal Article
%T Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning
%A Shiyuan YANG
%A Zheng GU
%A Wenyue HAO
%A Yi WANG
%A Huaiyu CAI
%A Xiaodong CHEN
%J Frontiers of Information Technology & Electronic Engineering
%P 1428-1440
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400395"

TY - JOUR
T1 - Few-shot exemplar-driven inpainting with parameter-efficient diffusion fine-tuning
A1 - Shiyuan YANG
A1 - Zheng GU
A1 - Wenyue HAO
A1 - Yi WANG
A1 - Huaiyu CAI
A1 - Xiaodong CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1428
EP - 1440
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400395"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Text-to-image diffusion models have demonstrated impressive capabilities in image generation and have been effectively applied to image inpainting. While text prompt provides an intuitive guidance for conditional inpainting, users often seek the ability to inpaint a specific object with customized appearance by providing an exemplar image. Unfortunately, existing methods struggle to achieve high fidelity in exemplar-driven inpainting. To address this, we use a plug-and-play low-rank adaptation (LoRA) module based on a pretrained text-driven inpainting model. The LoRA module is dedicated to learn the exemplar-specific concepts through few-shot fine-tuning, bringing improved fitting capability to customized exemplar images, without intensive training on large-scale datasets. Additionally, we introduce GPT-4V prompting and prior noise initialization techniques to further facilitate the fidelity in inpainting results. In brief, the denoising diffusion process first starts with the noise derived from a composite exemplar–background image, and is subsequently guided by an expressive prompt generated from the exemplar using the GPT-4V model. Extensive experiments demonstrate that our method achieves state-of-the-art performance, qualitatively and quantitatively, offering users an exemplar-driven inpainting tool with enhanced customization capability.

基于参数高效扩散微调的少样本参考图驱动图像补全

杨诗远¹，顾峥²，郝文月¹，汪毅¹，蔡怀宇¹，陈晓冬¹
¹天津大学精密仪器与光电子工程学院光电信息技术教育部重点实验室，中国天津市，300072
²南京大学计算机软件新技术国家重点实验室，中国南京市，210008
摘要：文本到图像的扩散模型在图像生成方面展现了卓越的能力，并已广泛应用于图像补全任务。尽管文本提示能够为有条件的图像补全提供直观指导，但用户往往希望通过提供参考图像为特定对象补全个性化外观。然而，现有的参考图驱动图像补全方法难以实现高保真度的补全效果。为解决这一问题，我们基于预训练的文本驱动图像补全模型提出一种即插即用的低秩适配（LoRA）模块。该模块通过少样本微调学习参考图像的特定特征，显著提升了对自定义参考图像的拟合能力，并且无需在大规模数据集上进行大量训练。此外，引入GPT-4V提示词和先验噪声初始化技术，进一步提升补全结果的保真度。简而言之，去噪扩散过程首先从由复合参考-背景图像派生的初始噪声开始，进而由GPT-4V从参考图中生成的丰富提示词引导后续生成过程。大量实验表明，我们的方法在定性和定量指标上都达到目前最高水平，为用户提供了一个具有更强定制化能力的参考图驱动图像补全工具。

关键词组：扩散模型；图像补全；参考图驱动；少样本微调

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Achiam J, Adler S, Agarwal S, et al., 2024. GPT-4 technical report.

[2]Avrahami O, Lischinski D, Fried O, 2022. Blended diffusion for text-driven editing of natural images. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18187-18197.

[3]Avrahami O, Fried O, Lischinski D, 2023. Blended latent diffusion. ACM Trans Graph, 42(4):149.

[4]Barnes C, Shechtman E, Finkelstein A, et al., 2009. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans Graph, 28(3):24.

[5]Brooks T, Holynski A, Efros AA, 2023. InstructPix2Pix: learning to follow image editing instructions. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18392-18402.

[6]Chefer H, Alaluf Y, Vinker Y, et al., 2023. Attend-and-Excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans Graph, 42(4):148.

[7]Chen JS, Yu JC, Ge CJ, et al., 2023. PixArt-α: fast training of diffusion Transformer for photorealistic text-to-image synthesis.

[8]Criminisi A, Pérez P, Toyama K, 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Trans Image Process, 13(9):1200-1212.

[9]Dhariwal P, Nichol A, 2021. Diffusion models beat GANs on image synthesis. Proc 35^th Int Conf on Neural Information Processing Systems, p.8780-8794.

[10]Gal R, Alaluf Y, Atzmon Y, et al., 2022. An image is worth one word: personalizing text-to-image generation using textual inversion.

[11]He KM, Sun J, 2014. Image completion approaches using the statistics of similar patches. IEEE Trans Patt Anal Mach Intell, 36(12):2423-2435.

[12]Hertz A, Mokady R, Tenenbaum J, et al., 2022. Prompt-to-Prompt image editing with cross attention control.

[13]Hessel J, Holtzman A, Forbes M, et al., 2022. CLIPScore: a reference-free evaluation metric for image captioning.

[14]Heusel M, Ramsauer H, Unterthiner T, et al., 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Proc 31^st Int Conf on Neural Information Processing Systems, p.6629-6640.

[15]Ho J, Salimans T, 2022. Classifier-free diffusion guidance.

[16]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34^th Int Conf on Neural Information Processing Systems, p.6840-6851.

[17]Hu EJ, Shen YL, Wallis P, et al., 2021. LoRA: low-rank adaptation of large language models.

[18]Kingma DP, Ba J, 2017. Adam: a method for stochastic optimization.

[19]Kumari N, Zhang BL, Zhang R, et al., 2023. Multi-concept customization of text-to-image diffusion. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1931-1941.

[20]Lei YM, Li JQ, Li ZL, et al., 2024. Prompt learning in computer vision: a survey. Front Inform Technol Electron Eng, 25(1):42-63.

[21]Li JY, Wang N, Zhang LF, et al., 2020. Recurrent feature reasoning for image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7757-7765.

[22]Li YH, Liu HT, Wu QY, et al., 2023. GLIGEN: open-set grounded text-to-image generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22511-22521.

[23]Liu GL, Reda FA, Shih KJ, et al., 2018. Image inpainting for irregular holes using partial convolutions. Proc 15^th European Conf on Computer Vision, p.89-105.

[24]Meng CL, He YT, Song Y, et al., 2022. SDEdit: guided image synthesis and editing with stochastic differential equations.

[25]Mou C, Wang XT, Xie LB, et al., 2023. T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models.

[26]Nazeri K, Ng E, Joseph T, et al., 2019. EdgeConnect: generative image inpainting with adversarial edge learning.

[27]Pathak D, Krähenbühl P, Donahue J, et al., 2016. Context encoders: feature learning by inpainting. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2536-2544.

[28]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38^th Int Conf on Machine Learning, p.8748-8763.

[29]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with CLIP latents.

[30]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10674-10685.

[31]Ruiz N, Li YZ, Jampani V, et al., 2023. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22500-22510.

[32]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding.

[33]Schuhmann C, Vencu R, Beaumont R, et al., 2021. LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs.

[34]Song JM, Meng CL, Ermon S, 2022. Denoising diffusion implicit models.

[35]Wang S, Saharia C, Montgomery C, et al., 2023. Imagen Editor and EditBench: advancing and evaluating text-guided image inpainting. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18359-18369.

[36]Xie SA, Zhang ZF, Lin Z, et al., 2023. SmartBrush: text and shape guided object inpainting with diffusion model. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22428-22437.

[37]Yang BX, Gu SY, Zhang B, et al., 2023. Paint by example: exemplar-based image editing with diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18381-18391.

[38]Yang H, Zhang RM, Guo XB, et al., 2020. Towards photo-realistic virtual try-on by adaptively generating↔ preserving image content. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7847-7856.

[39]Yang SY, Wang Y, Cai HY, et al., 2022. Residual inpainting using selective free-form attention. Neurocomputing, 510:149-158.

[40]Ye H, Zhang J, Liu SB, et al., 2023. IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models.

[41]Yu JH, Lin Z, Yang JM, et al., 2018. Generative image inpainting with contextual attention. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5505-5514.

[42]Zhang HR, Hu ZZ, Luo CZ, et al., 2018. Semantic image inpainting with progressive generative networks. Proc 26^th ACM Int Conf on Multimedia, p.1939-1947.

[43]Zhang JP, Sun LY, Jin C, et al., 2024. Recent advances in artificial intelligence generated content. Front Inform Technol Electron Eng, 25(1):1-5.

[44]Zhang LM, Rao AY, Agrawala M, 2023. Adding conditional control to text-to-image diffusion models. Proc IEEE/CVF Int Conf on Computer Vision, p.3813-3824.

[45]Zhou J, Ke P, Qiu XP, et al., 2024. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, 25(1):6-11.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于参数高效扩散微调的少样本参考图驱动图像补全

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference