Volume- 12
Issue- 4
Year- 2024
DOI: 10.55524/ijircst.2024.12.4.9 | DOI URL: https://doi.org/10.55524/ijircst.2024.12.4.9 Crossref
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) (http://creativecommons.org/licenses/by/4.0)
Article Tools: Print the Abstract | Indexing metadata | How to cite item | Email this article | Post a Comment
Anmol Chauhan , Sana Rabbani, Devendra Agarwal, Nikhat Akhtar, Yusuf Perwej
An in-depth analysis of using stable diffusion models to generate images from text is presented in this research article. Improving generative models' capacity to generate high-quality, contextually appropriate images from textual descriptions is the main focus of this study. By utilizing recent advancements in deep learning, namely in the field of diffusion models, we have created a new system that combines visual and linguistic data to generate aesthetically pleasing and coherent images from given text. To achieve a clear representation that matches the provided textual input, our method employs a stable diffusion process that iteratively reduces a noisy image. This approach differs from conventional generative adversarial networks (GANs) in that it produces more accurate images and has a more consistent training procedure. We use a dual encoder mechanism to successfully record both the structural information needed for picture synthesis and the semantic richness of text. outcomes from extensive trials on benchmark datasets show that our model achieves much better outcomes than current state-of-the-art methods in diversity, text-image alignment, and picture quality. In order to verify the model's efficacy, the article delves into the architectural innovations, training schedule, and assessment criteria used. In addition, we explore other uses for our text-to-image production system, such as for making digital art, content development, and assistive devices for the visually impaired. The research lays the groundwork for future work in this dynamic area by highlighting the technical obstacles faced and the solutions developed. Finally, our text-to-image generation model, which is based on stable diffusion, is a huge step forward for generative models in the field that combines computer vision with natural language processing.
[1] Zhang, C. Zhang, S. Zheng, M. Zhang, M. Qamar, S.-H. Bae, and I. S. Kweon, "A survey on audio diffusion models: Text to speech synthesis and enhancement in generative AI," arXiv preprint arXiv:2303.13336, 2023. Available from: https://doi.org/10.48550/arXiv.2303.13336
[2] S. M. Kosslyn, G. Ganis, and W. L. Thompson, "Neural foundations of imagery," Nat. Rev. Neurosci., vol. 2, pp. 635-642, 2001. Available from: https://doi.org/10.1038/35090055
[3] Y. Perwej, "An Evaluation of Deep Learning Miniature Concerning in Soft Computing," Int. J. Adv. Res. Comput. Commun. Eng., vol. 4, no. 2, pp. 10-16, Feb. 2015. Available from: https://doi.org/10.17148/IJARCCE.2015.4203
[4] A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, Jun. 2015, pp. 3128-3137. Available from: https://doi.org/10.48550/arXiv.1412.2306
[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, Jun. 2015, pp. 3156-3164. Available from: https://doi.org/10.48550/arXiv.1411.4555
[6] C. Zhang et al., "A complete survey on generative AI (AIGC): Is chatgpt from GPT-4 to GPT-5 all you need?," arXiv:2303.11717, 2023. Available from: https://doi.org/10.48550/arXiv.2303.11717
[7] K. Xu et al., "Show, attend and tell: Neural image caption generation with visual attention," in Proc. Int. Conf. Mach. Learn., Lille, France, Jul. 2015, pp. 2048-2057. Available from: https://doi.org/10.48550/arXiv.1502.03044
[8] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, "Generative Adversarial Networks: Introduction and Outlook," IEEE/Caa J. Automatica Sinica, vol. 4, no. 4, pp. 588-598, Oct. 2017. Available from: https://10.1109/JAS.2017.7510583
[9] I. Goodfellow et al., "Generative adversarial nets," in Advances in Neural Information Processing Systems, 2014, pp. 2672-2680. Available from: https://doi.org/10.48550/arXiv.1406.2661
[10] Y. Zhou et al., "Shifted Diffusion for Text-to-image Generation," in 2023 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, 2023, pp. 10157-10166. Available from: https://10.1109/CVPR52729.2023.00979
[11] Y. Perwej, N. Akhtar, and F. Parwej, "The Kingdom of Saudi Arabia Vehicle License Plate Recognition using Learning Vector Quantization Artificial Neural Network," Int. J. Comput. Appl. (IJCA), USA, vol. 98, no. 11, pp. 32-38, 2014. Available from: https://10.5120/17230-7556
[12] J. Park et al., "LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data," in Proc. 2023 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, 2023, pp. 23401-23411. Available from: https://10.1109/CVPR52729.2023.02241
[13] I. J. Goodfellow et al., "Generative adversarial networks," Accessed on: Jul. 13, 2024. Available from: https://arxiv.org/abs/1406.2661
[14] S. Reed et al., "Generative adversarial text to image synthesis," Accessed on: Jul. 13, 2024. Available from: https://arxiv.org/abs/1605.05396
[15] T. Salimans et al., "Improved techniques for training GANs," in Proc. Advances in Neural Inf. Process. Syst. 29 (NIPS 2016), Spain, Dec. 5-10, 2016. Available from: https://doi.org/10.48550/arXiv.1606.03498
[16] Y. Perwej, "Recurrent Neural Network Method in Arabic Words Recognition System," Int. J. Comput. Sci. Telecommun. (IJCST), Sysbase Solution (Ltd), UK, London, ISSN 2047-3338, vol. 3, no. 11, pp. 43-48, Nov. 2012. Available from: https://doi.org/10.48550/arXiv.1301.4662
[17] T. Zia et al., "Text-to-Image Generation with Attention Based Recurrent Neural Networks,".Accessed on: Jul. 13, 2024. Available from: https://arxiv.org/abs/2001.06658
[18] Z. Yang et al., "ReCo: Region-Controlled Text-to-Image Generation," in Proc. 2023 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, 2023, pp. 14246-14255. Available from: https://doi.org/10.48550/arXiv.2211.15518
[19] J. Mao and X. Wang, "Training-Free Location-Aware Text-to-Image Synthesis," in Proc. 2023 IEEE Int. Conf. Image Process. (ICIP), Kuala Lumpur, Malaysia, 2023, pp. 995-999. Available from: https://10.1109/ICIP49359.2023.10222616
[20] Y. Perwej and F. Parwej, "A Neuroplasticity (Brain Plasticity) Approach to Use in Artificial Neural Network," Int. J. Sci. Eng. Res. (IJSER), France, vol. 3, no. 6, pp. 1-9, Jun. 2012. Available from: https://10.13140/2.1.1693.2808
[21] R. Morita, Z. Zhang, and J. Zhou, "BATINeT: Background-Aware Text to Image Synthesis and Manipulation Network," in Proc. 2023 IEEE Int. Conf. Image Process. (ICIP), Kuala Lumpur, Malaysia, 2023, pp. 765-769. Available from: https://10.1109/ICIP49359.2023.10223174
[22] Y. Perwej, F. Parwej, and A. Perwej, "Copyright Protection of Digital Images Using Robust Watermarking Based on Joint DLT and DWT," Int. J. Sci. Eng. Res. (IJSER), France, ISSN 2229-5518, vol. 3, no. 6, pp. 1-9, Jun. 2012. Available from: https://10.13140/2.1.1693.2808
[23] Y. Perwej, A. Perwej, and F. Parwej, "An Adaptive Watermarking Technique for the Copyright of Digital Images and Digital Image Protection," Int. J. Multimedia Its Appl. (IJMA), Academy & Industry Res. Collab. Center (AIRCC), vol. 4, no. 2, pp. 21-38, Apr. 2012. Available from: https://10.5121/ijma.2012.4202
[24] Y. Dong, Y. Zhang, L. Ma, Z. Wang, and J. Luo, "Unsupervised text-to-image synthesis," Pattern Recognit., vol. 110, p. 107573, 2021. Accessed on: Jul. 13, 2024. Available from: https://doi.org/10.1016/j.patcog.2020.107573
[25] M. Berrahal and M. Azizi, "Optimal text-to-image synthesis model for generating portrait images using generative adversarial network techniques," Indones. J. Electr. Eng. Comput. Sci., vol. 25, pp. 972-979, 2022. Accessed on: Jul. 13, 2024. Available from: https://doi.org/10.11591/ijeecs.v25.i3.pp972-979
[26] Y. Zhang, S. Han, Z. Zhang, J. Wang, and H. Bi, "CF-GAN: Cross-domain feature fusion generative adversarial network for text-to-image synthesis," Vis. Comput., pp. 1-11, 2022. Accessed on: Jul. 13, 2024. Available from: https://doi.org/10.1007/s00371-022-02175-4
[27] M. Tao, H. Tang, F. Wu, X. Jing, B.-K. Bao, and C. Xu, "DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis," in Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), New Orleans, LA, USA, 2022, pp. 16494-16504. Available from: https://doi.org/10.48550/arXiv.2008.05865
[28] Y. Perwej, "An Optimal Approach to Edge Detection Using Fuzzy Rule and Sobel Method," Int. J. Adv. Res. Electr. Electron. Instrum. Eng. (IJAREEIE), ISSN 2320-3765 (Print), ISSN 2278-8875 (Online), vol. 4, no. 11, pp. 9161-9179, 2015. Available from: https://10.15662/IJAREEIE.2015.0411054
[29] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, "Generating images from captions with attention," ICLR, 2016. Accessed on: Jul. 13, 2024. Available from: https://arxiv.org/abs/1511.02793
[30] L. Maiano et al., "Human Versus Machine: A Comparative Analysis in Detecting Artificial Intelligence-Generated Images," IEEE Secur. Privacy, vol. 22, no. 3, pp. 77-86, 2024. Available from: https://10.1109/MSP.2023.3062239002E
[31] S. Banerjee, G. Mittal, A. Joshi, C. Hegde, and N. Memon, "Identity-Preserving Aging of Face Images via Latent Diffusion Models," in Proc. 2023 IEEE Int. Joint Conf. Biometrics (IJCB), 2023, pp. 1-10. Available from: https://doi.org/10.48550/arXiv.2307.08585
[32] A. O. Levin and Y. S. Belov, "A Study on the Application of Using Hypernetwork and Low Rank Adaptation for Text-to-Image Generation Based on Diffusion Models," in Proc. 2024 6th Int. Youth Conf. Radio Electron. Electr. Power Eng. (REEPE), 2024, pp. 1-5. Available from: https://doi.org/10.1109/REEPE60449.2024.10479561
[33] D. Husain, Y. Perwej, S. K. Vishwakarma, Prof. (Dr.) S. Rastogi, V. Singh, and N. Akhtar, "Implementation and Statistical Analysis of De-noising Techniques for Standard Image," Int. J. Multidiscip. Educ. Res. (IJMER), ISSN: 2277-7881, vol. 11, no. 10 (4), pp. 69-78, 2022. Available from: https://10.IJMER/2022/11.10.72
[34] S. Siemens, M. Kastner, and E. Reithmeier, "Synthetically generated microscope images of microtopographies using stable diffusion," in Proc. Automated Visual Inspection Machine Vision V, 2023, p. 7. Available from: https://doi.org/10.1117/12.2673643
[35] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022. Available from: https://doi.org/10.48550/arXiv.2112.10752
[36] A. Dwivedi, Dr. B. B. Dumka, N. Akhtar, F. Shan, and Y. Perwej, "Tropical Convolutional Neural Networks (TCNNs) Based Methods for Breast Cancer Diagnosis," Int. J. Sci. Res. Sci. Technol. (IJSRST), Print ISSN: 2395-6011, Online ISSN: 2395-602X, vol. 10, no. 3, pp. 1100-1116, May-Jun. 2023. Available from: https://10.32628/IJSRST523103183
[37] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler et al., "Align your latents: High-resolution video synthesis with latent diffusion models," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023. Available from: https://doi.org/10.48550/arXiv.2304.08818
[38] N. Akhtar, Dr. H. Pant, A. Dwivedi, V. Jain, and Y. Perwej, "A Breast Cancer Diagnosis Framework Based on Machine Learning," Int. J. Sci. Res. Sci. Eng. Technol. (IJSRSET), Print ISSN: 2395-1990, vol. 10, no. 3, pp. 118-132, 2023. Available from: https://10.32628/IJSRSET2310375
[39] L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018. Available from: https://doi.org/10.48550/arXiv.1802.02611
[40] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, "Cascaded diffusion models for high fidelity image generation," J. Mach. Learn. Res., vol. 23, pp. 47-1, 2022. Available from: https://doi.org/10.48550/arXiv.2106.15282
[41] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, "Glide: Towards photorealistic image generation and editing with text-guided diffusion models,". Accessed on: Jul. 13, 2024. Available from: https://arxiv.org/abs/2112.10741
[42] J. Nickolls and W. J. Dally, "The GPU computing era," IEEE Micro, vol. 30, no. 2, pp. 56-69, Mar./Apr. 2010. Available from: https://10.1109/MM.2010.41
Professor, Computer Science & Engineering, Goel Institute of Technology & Management, Lucknow, India
No. of Downloads: 29 | No. of Views: 1133
Dr S. A. Talekar, Shravani A. Lajurkar, Divya S. Patil, Rutika A. Benke, Pranjal A. Kunde.
May 2024 - Vol 12, Issue 3
Dr. Deepika Rani.
May 2024 - Vol 12, Issue 3
Deepa Ajish.
March 2024 - Vol 12, Issue 2