Smart Document Analysis Using AI-ML

Sindhu Rashmi. H. R; Prof. Anisha. B. S; Dr. Ramakanth Kumar. P

doi:10.21276/ijircst.2019.7.3.6

Abstract

In this era of digitalization, everything is smart and digitalized. All the documents are presented, prepared and shared as soft copies. Classifying those soft copy documents is gaining an important insight in recent times. It is attracting digital word with its impact in different fields like spam filtering, email routing, language identification, genre classification, sentimental analysis, readability assessment. Classifying documents that are available online using smart techniques helps different business. The easiest and efficient way of doing it is through machine learning and it makes human work much easier. To perform classification of document more statistically, documents should be given in a much understandable format to the machine learning classifier. In this report, I’m discussing the types of feature depending on which an document can be classified and later represented. Record arrangement or classifying the documents is the purpose of document collection and classifications based upon the information it consists off and features that it contains. Record arrangement is a huge learning issue that is at the center of numerous data executives and recovery. Document grouping plays an important role in different applications that help with sorting out, ordering, looking and briefly speaking to a lot of data. In this report, we will be discussing the uses of document classification and important steps used for classifying the document or text by considering a small use case to know how document classification is done, basic steps of document classification, processing and analyzing the documents that are collected. We have considered two different categories of data sets for classification and analysis. The problem statement here is to distinguish those two documents where one is Rhyme document and each rhyme is taken as a single file and the other is normal sentences that are a Non-Rhyme document that contains normal Wikipedia text where few statements of Wikipedia is considered as a single file. The precise objective of my project is to develop scalable and efficient document classification project that classifies the document more precisely depending on the feature that it contains and to know the basic techniques that are used for the document a classification like, data collection, data cleaning, pre-processing and constructing an ML model and applying the ML algorithm. Another objective of the project is to work on machine learning concepts and to get insight into different classification algorithms with the help of this case study.

Keywords

ML (Machine Learning), Document classification, Rhyme, Non-Rhyme, Decision Tree Algorithm, Digitalization, Machine Learning Model, Random Forest Algorithm

References

[2] Berina Alic, Lejila Gurbeta and Almir Badnjevic, “Machine Learning Techniques for Classification of Diabetes and Cardiovascular Diseases” 2017, 6th MEDITERRANEAN CONFERENCE ON EMBEDDED COMPUTING," (MECO), 11-15 JUNE 2017, BAR, MONTENEGRO, 978-1-5090-6742-8/17/$31.00 ©2017 IEEE

[3] Zhongmin Luo, “CDS Rate Construction Methods by Machine learning Techniques”, Social Science Research Network (SSRN) Electronic journal on May 12 2107.

[4] Suresh Yaram “Machine Learning Algorithms for Document clustering and Fraud Detection” 2016 IEEE International Conference on Data Science and Engineering (ICDSE) 978-1-5090-1281-7/16/$31.00 ©2016 IEEE

[5] Leila Arras, Franziska Horn, Gregoire Montavon, Klaus-Robert Muller, “What is Relevant in a Text Document?” An Interpretable Machine Learning Approach, International Workshop on Analytics and Networking arXiv:1612.07843v1 [cs.CL] 23 Dec 2016

[6] Arthi Venkataraman, “Deep Learning Algorithms Based Text Classifier”, 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 978-1-5090-2399-8/16/$31.00_c 2016 IEEE

[7] V. V. Gulin and A. B. Frolov, “On the Classification of Text Documents Taking into Account Their Structural Features” Journal of Computer and Systems Sciences International, 2016, Vol. 55, No. 3, pp. 394–403. © Pleiades Publishing, Ltd., ISSN 1064_2307, 2016.

[8] P. O. Lima Junior, L. G. Castro Junior and A. L. Zambalde, “Analysis of Machine Learning Tecniques to Classify News for Information Management in Coffee Market”, International Conference on Digitalization, IEEE LATIN AMERICA TRANSACTIONS, VOL. 13, NO. 7, JULY 2015

[9] Siwei Lai, Liheng Xu, Kang Liu and Jun Zhao, “Recurreny Convolution Neural Networks for Text Classification”, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence on 2015.

[10] Liang Yang and Hongfei Lin, “C.nstruction and Application of Chinese Emotional Corpus”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2015. © Springer-Verlag Berlin Heidelberg 2015.

[11] Marenglen Biba and Mersida Mane, “Sentiment Analysis through Machine Learning: An Experimental Evaluation for Albanian”, Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 235,195 DOI: 10.1007/978-3-319-01778-5_20, © Springer International Publishing Switzerland 2014.

[12] Bina Kotiyal, Ankit Kumar, Bhaskar Pant and R. H. Goudr, “Classification Technique for Improving User Acces on Web Log Data”, International conference on Intelligent Computing, Networking and Informatics, Online ISBN978-81-322-1665-0 on 18 December 2014

[13] Maofu Liu, Yu Xiao, Chunwei Lei and Xin Zhou, “Social Relation Extraction Based on Chinese Wikipedia Articles”, Chinese Lexical Semantics Workshop (CLSW) 2014, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2014.

[14] B. S. Harish and B. Udayasri, “Document Classification: An Approach Using Feature Clustering”, IEEE Conference on Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 235, DOI: 10.1007/978-3-319-01778 5_17, © Springer International Publishing Switzerland 2014

[15] Guo-Nian Wang, Yi Qin, Mini Jiang, Qiu-Rong Zhao, “MT-Oriented and Computer- Based Subject Restoration for Chinese Empty-Subject Sentences”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013

[16]Muhammad Shahbaz, Qanta Ahmed and Aziz Guergachi, “An Expert Framework For Effective Document Classifictaion Using Support Vector Machine”, International Journal of Innovative Computing Information and Control ICIC International Conference, Volume 9, Number 4, April 2013 ©2013 ISSN 1349-4198.

[17] Yonglei Zhang, Cheng Peng and Hongling Wang, “Research on Chinese Sentence Compression for the Titke Generation”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013

[18] Shengfeng ju and Shoushan Li, “Active Learning in Sentiment Classification by Selecting Both Words and Documents”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

[19] Xiuli Hua, Shoushan Li, Peifeng Li and Qiaoming Zhu, Reseach on Intrinsic Plagiarism Detection Resolution: A supervised Learning Approach”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

[20] Zhu Zhu, Daming Dai, Yaxing Ding, Jianbin Qian and Shoushan Li, “Employing Emotion Keywords to Improve Cross-Domain Sentiment Classification”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

[21] Ge Xu, Chu-Ren Huang and Houfeng Wang, “Extracting Chinese Product Features: Representing a Sequence by a Set of Skip-Bigrams”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

[22] Charles Smutz and Angelos Stavrou, “Malicious PDF Detection using Metadata and Structural Features”, Annual Computer Security Applications Conference (ACSAC) 2012 ACSAC ’12 Dec. 3-7, 2012, Orlando, Florida USA, 2012 ACM 978-1-4503-1312-4/12/12 ...$15.00.

[23] Gerhard Paass and Luliu Konya, “Machine Learning for Document Structure Recognition”, Studies in Computational Intelligence on June 22nd, 2011. [24] Jyri Saarikoski, Jorma Laurikkala, Kalervo Jarvelin and Martti Juhola, “Self-Organizing Maps in Document Classification: A Comparision with Six Machine Learning Methods”, Internation Conference on Adaptive and Natural Computing Algorithms (ICANNGA) 2011, Part I, LNCS 6593, pp. 260–269, 2011. © Springer-Verlag Berlin Heidelberg 2011

[25] Bhawna Nigam, Poorvi Ahirwal, Sonal Salve, Swati Vamney, “Document Classification Using Expectation Maximization with Semi Supervised Learning”, International Journal on Soft Computing (IJSC) Vol.2, No.4, DOI: 10.5121/ijsc.2011.2404 November 2011.

[26] Dilara Torunoglu, Erhan Cakirman, Murat Can Ganiz, et.al, “Analysis of Processing Methods on Classification of Turkish Texts”, InternationalConference on Informational Technology with Machine Learning” 978-1-61284-5/11/$26.00 ©2011 IEEE

[27] Yu Wanjun and Song Xiaoguang, “Research on Text Categorization Based on Machine Learning” IEEE International Journal on Machine Learning and its Implementation, 978-1-4244-6932-1/10/$26.00 ©2010 IEEE

[28] R. Deepa Lakshmi and N.Radha, “Spam Classification using Supervised Learning Techniques”, International Conference onWomen in Applied Computing and Information Technology. A2CWiC 2010, September 16-17, 2010, India Copyright © 2010 978-1-4503-0194-7/10/0009… $10.00

[29] Baharum Baharudin, Khairullah khan, Lam Hong Lee, Aurangzeb Khan, “A Review of Machine Learning Algorithms for Text-Document Classification”, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 Published on 2010

[30] Janusz Wnek, “Machine Learning of Document Templates for Data Extraction”, U. S. Conference on Science and Application, U.S. Patent, US 7,764,830 B1, July 27, 2010

[31] Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, Khairullah Khan, “A Review of Machine Learning Algorithms for Text-Documents Classification”, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 © 2010 ACADEMY PUBLISHER doi:10.4304/jait.1.1.4-20

[32] Simon Tong and Daphne Koller, “Support Vector Machine Active Learning with Applications to Text Classification”, Journal of Machine Learning Research 2010 on 11/01/2010

[33] Konstantin Mertsalov and Michael McCreary, “Document Classification with Support Vector Machines”, International Conference on IEEE Transactions on Knowledge and Data Engineering on January 2009.

Cites this article as

S. R. H. R, P. A. B. S, D. R. K. P, "Smart Document Analysis Using AI-ML", International Journal of Innovative Research in Engineering & Management (ijircst), Vol-7, Issue-3, Page No-54-70, 2019. Available from: https://doi.org/10.21276/ijircst.2019.7.3.6

Corresponding Author

Sindhu Rashmi. H. R

Department of Software Engineering, RV College of Engineering, Bengaluru, India, 9035383054(sindhu55putani@gamil.com)

Download Full Paper

Download PDF

No. of Downloads: 28 | No. of Views: 1938

A Comparative Study of ChatGPT, Gemini, and Perplexity

Manali Shukla, Ishika Goyal, Bhavya Gupta, Jhanvi Sharma.

July 2024 - Vol 12, Issue 4
Helmet Detection and Number Plate Recognition Using YOLOv8 and Tensorflow Algorithm in Machine Learning

Dipti Prajapati, Samishtarani Sabat, Sanika Bhilare, Rashmi Vishe, Prof. Suman Bhujbal.

March 2024 - Vol 12, Issue 2
Machine Learning Prospects: Insights for Social Media Data Mining and Analytics

Anu Sharma, Vivek Kumar.

May 2023 - Vol 11, Issue 3

Smart Document Analysis Using AI-ML

Citations

Download Full Paper PDF

Total View 1938

Total Download 28