Full Text:   <639>

Summary:  <209>

CLC number: 

On-line Access: 2024-02-19

Received: 2023-04-30

Revision Accepted: 2024-02-19

Crosschecked: 2023-10-13

Cited: 0

Clicked: 824

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Wang QI

https://orcid.org/0000-0003-3279-7125

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2024 Vol.25 No.1 P.170-178

http://doi.org/10.1631/FITEE.2300313


Multistage guidance on the diffusion model inspired by human artists’ creative thinking


Author(s):  Wang QI, Huanghuang DENG, Taihao LI

Affiliation(s):  AI Research Institute, Zhejiang Lab, Hangzhou 311121, China; more

Corresponding email(s):   qiwang@zhejianglab.com, dhh2012@zju.edu.cn, lith@zhejianglab.com

Key Words: 


Share this article to: More <<< Previous Article|

Wang QI, Huanghuang DENG, Taihao LI. Multistage guidance on the diffusion model inspired by human artists’ creative thinking[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(1): 170-178.

@article{title="Multistage guidance on the diffusion model inspired by human artists’ creative thinking",
author="Wang QI, Huanghuang DENG, Taihao LI",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="1",
pages="170-178",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300313"
}

%0 Journal Article
%T Multistage guidance on the diffusion model inspired by human artists’ creative thinking
%A Wang QI
%A Huanghuang DENG
%A Taihao LI
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 1
%P 170-178
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300313

TY - JOUR
T1 - Multistage guidance on the diffusion model inspired by human artists’ creative thinking
A1 - Wang QI
A1 - Huanghuang DENG
A1 - Taihao LI
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 1
SP - 170
EP - 178
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300313


Abstract: 
Current research on text-conditional image generation shows parallel performance with ordinary painters but still has much room for improvement when compared to that of artist-ability paintings, which usually represent multilevel semantics by gathering features ofmultiple objects into one object. In a preliminary experiment, we confirm this and then seek the opinions of three groups of individuals with varying levels of art appreciation ability to determine the distinctions that exist between painters and artists. We then use these opinions to improve an artificial intelligence (AI) painting system from painter-level image generation toward artistic-level image generation. Specifically, we propose a multistage text-conditioned approach without any further pretraining to help the diffusion model (DM) move toward multilevel semantic representation in a generated image. Both machine and manual evaluations of the main experiment verify the effectiveness of our approach. In addition, different from previous onestage guidance, our method is able to control the extent to which features of an object are represented in a painting by controlling guiding steps between the different stages.

受艺术家创造性思维启发的扩散模型多阶段引导

齐旺1,邓晃煌2,李太豪1
1之江实验室跨媒体智能研究中心,中国杭州市,311500
2浙江大学计算机科学与技术学院,中国杭州市,310027
摘要:目前文本生成图像的研究已显示出与普通画家类似的水平,但与艺术家绘画水平相比仍有很大改进空间;艺术家水平的绘画通常将多个意象的特征融合到一个意象中,以表示多层次语义信息。在预实验中,我们证实了这一点,并咨询了3个具有不同艺术欣赏能力的群体的意见,以确定画家和艺术家之间绘画水平的区别。之后,利用这些观点帮助人工智能绘画系统从普通画家水平的图像生成改进为艺术家水平的图像生成。具体来说,提出一种无需任何进一步预训练的、基于文本的多阶段引导方法,帮助扩散模型在生成的图像中向多层次语义表示迈进。实验中的机器和人工评估都验证了所提方法的有效性。此外,与之前单阶段引导方法不同,该方法能够通过控制不同阶段之间的指导步数来控制各个意象特征在绘画中的表现程度。

关键词:文本生成图像;扩散模型;多层次语义;多阶段引导

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Arjovsky M, Chintala S, Bottou L, 2017. Wasserstein GAN. https://arxiv.org/abs/1701.07875

[2]Brock A, Donahue J, Simonyan K, 2019. Large scale GAN training for high fidelity natural image synthesis. Proc 7th Int Conf on Learning Representations.

[3]Chen M, Radford A, Child R, et al., 2020. Generative pretraining from pixels. Proc 37th Int Conf on Machine Learning, p.1691-1703.

[4]Chen N, Zhang Y, Zen H, et al., 2021. WaveGrad: estimating gradients for waveform generation. Proc 9th Int Conf on Learning Representations.

[5]Child R, Gray S, Radford A, et al., 2019. Generating long sequences with sparse transformers. https://arxiv.org/abs/1904.10509

[6]Dinh L, Krueger D, Bengio Y, 2015. NICE: non-linear independent components estimation. Proc 3rd Int Conf on Learning Representations.

[7]Dinh L, Sohl-Dickstein J, Bengio S, 2017. Density estimation using real NVP. Proc 5th Int Conf on Learning Representations.

[8]Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139-144.

[9]Gulrajani I, Ahmed F, Arjovsky M, et al., 2017. Improved training of wasserstein GANs. Proc 31st Int Conf on Neural Information Processing Systems, p.5767-5777.

[10]Ho J, Salimans T, 2021. Classifier-free diffusion guidance. Proc Workshop on Deep Generative Models and Downstream Applications.

[11]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. Proc 34th Int Conf on Neural Information Processing Systems, Article 574.

[12]Karras T, Laine S, Aittala M, et al., 2020. Analyzing and improving the image quality of StyleGAN. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8107-8116.

[13]Karras T, Laine S, Aila T, 2021. A style-based generator architecture for generative adversarial networks. IEEE Trans Patt Anal Mach Intell, 43(12):4217-4228.

[14]Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2nd Int Conf on Learning Representations.

[15]Kingma DP, Salimans T, Poole B, et al., 2021. Variational diffusion models. https://arxiv.org/abs/2107.00630

[16]Kong ZF, Ping W, Huang JJ, et al., 2021. DiffWave: a versatile diffusion model for audio synthesis. Proc 9th Int Conf on Learning Representations.

[17]Mescheder L, 2018. On the convergence properties of GAN training. https://arxiv.org/abs/1801.04406v1

[18]Metz L, Poole B, Pfau D, et al., 2017. Unrolled generative adversarial networks. Proc 5th Int Conf on Learning Representations.

[19]Mittal G, Engel JH, Hawthorne C, et al., 2021. Symbolic music generation with diffusion models. Proc 22nd Int Society for Music Information Retrieval Conf, p.468-475.

[20]Nichol AQ, Dhariwal P, 2021. Improved denoising diffusion probabilistic models. Proc 38th Int Conf on Machine Learning, p.8162-8171.

[21]Nichol AQ, Dhariwal P, Ramesh A, et al., 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc 39th Int Conf on Machine Learning, p.16784-16804.

[22]Ramesh A, Pavlov M, Goh G, et al., 2021. Zero-shot text-to-image generation. Proc 38th Int Conf on Machine Learning, p.8821-8831.

[23]Ramesh A, Dhariwal P, Nichol A, et al., 2022. Hierarchical text-conditional image generation with clip latents. https://arxiv.org/abs/2204.06125

[24]Razavi A, van den Oord A, Vinyals O, 2019. Generating diverse high-fidelity images with VQ-VAE-2. Proc 33rd Int Conf on Neural Information Processing Systems, Article 1331.

[25]Rombach R, Blattmann A, Lorenz D, et al., 2022. High-resolution image synthesis with latent diffusion models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10684-10695.

[26]Saharia C, Chan W, Saxena S, et al., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Proc 36th Int Conf on Neural Information Processing Systems, p.36479-36494.

[27]Sohl-Dickstein J, Weiss EA, Maheswaranathan N, et al., 2015. Deep unsupervised learning using nonequilibrium thermodynamics. Proc 32nd Int Conf on Machine Learning, p.2256-2265.

[28]Song J, Meng C, Ermon S, 2021. Denoising diffusion implicit models. Proc 9th Int Conf on Learning Representations.

[29]Song Y, Sohl-Dickstein J, Kingma DP, et al., 2021. Score-based generative modeling through stochastic differential equations. Proc 9th Int Conf on Learning Representations.

[30]van den Oord A, Kalchbrenner N, Espeholt L, et al., 2016a. Conditional image generation with pixelcnn decoders. Proc 30th Int Conf on Neural Information Processing Systems, p.4797-4805.

[31]van den Oord A, Kalchbrenner N, Kavukcuoglu K, 2016b. Pixel recurrent neural networks. Proc 33rd Int Conf on Machine Learning, p.1747-1756.

[32]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE