Data Mining and Text Mining with Big Data: Review of Differences

Filiz Ersoz

Data Mining and Text Mining with Big Data: Review of Differences

Filiz Ersoz

2019, International Journal of Recent Advances in Multidisciplinary Research

visibility

…

description

5 pages

link

1 file

In recent years, organizations have come across complex databases due to the development of technology, the growth of databases and information technologies, and the use of widespread information technologies. If the data found in line with the needs of the institutions are managed successfully and effectively, it is obvious that the institutions and organizations will offer great advantages and opportunities in economic terms. Each action performed in the digital environment leaves a data record behind it. For this purpose, big data, science, data mining and text mining concepts, methods and usage areas were mentioned, and similar and different aspects of data mining and text mining were determined. Through this study, it is expected that the separation of data mining and text mining methods will accelerate the process of strategic decisions of institutions and organizations and provide the required support.

www.ijramr.com sZ InternationalJournalof Recent Advances in Multidisciplinary Research Vol. 06, Issue 01, pp.4391-4396, January, 2019 RESEARCH ARTICLE DATA MINING AND TEXT MINING WITH BIG DATA: REVIEW OF DIFFERENCES *Filiz Ersoz ARTICLEINFO Article History: th Received 10 October, 2018 Received in revised form 18th November, 2018 Accepted 14th December, 2018 Published online 30th January, 2019 Keywords: Big Data, Data Mining, Text Mining, Knowledge Discovery. Industrial Engineering, Karabük University, Karabük, Turkey ABSTRACT In recent years, organizations have come across complex databases due to the development of technology, the growth of databases and information technologies, and the use of widespread information technologies. If the data found in line with the needs of the institutions are managed successfully and effectively, it is obvious that the institutions and organizations will offer great advantages and opportunities in economic terms. Each action performed in the digital environment leaves a data record behind it. For this purpose, big data, science, data mining and text mining concepts, methods and usage areas were mentioned, and similar and different aspects of data mining and text mining were determined. Through this study, it is expected that the separation of data mining and text mining methods will accelerate the process of strategic decisions of institutions and organizations and provide the required support. INTRODUCTION The data gathered in the quantitative and qualitative data collected in databases and the data warehouses and increasing the use of information technologies in parallel with the development of technology have increased. As a result of this situation, there is a need to reveal meaningful relationships, patterns and trends from large masses of data, and the importance of processing of data in making accurate and strategic decisions has increased.As a result of the exponential increase in the amount of data produced, stored and processed as the computer and internet technology develops, the big data concept and the new field of data science have begun to develop (Gursakal, 2014). As management information systems evolved from the 1960s and information systems emerged as a software category and field of application in the 1980s and 1990s, the numerical data stored in relational databases constituted a data mining discipline. In the same way, text mining in the processing of text in unstructured documents has also revealed the discipline of text mining. The increase in advanced technologies and the increase of work experience data or text as records have led not only to the large data mining studies but also to the work volume on large textual data. In today's world, data mining and text mining are now considered important disciplines for enhancing digitalization and discovering valuable and meaningful information.Whether it is numerical or textual, it is used to collect information from these masses and use automated systems for information discovery and management. If the data found in line with the needs of the institutions are managed successfully and effectively, it is obvious that the institutions and organizations will offer great advantages and opportunities in economic terms. *Corresponding author: Filiz Ersoz Industrial Engineering, Karabük University, Karabük, Turkey. IDC (International Data Corporation) According to the data of the large data and business analytics forum, the recorded data volume increased to 16 ZB (1 ZB 1.09 Trillion Gigabyte) in 2016; It is estimated that the record will be Zettabyte 163 (1024 ZB = 1 Yottabyte (YB)) (IDC, 2018). Such a large amount of information has brought about the emergence of the field of data science and the widespread use of it and necessitated the employment of Data Scientist. The areas related to “Big Data Science” (Analytics) which are already in the Business Intelligence that supports all kinds of business decision-making processes; Data Mining, Text Mining, Machine Learning and Artificial Intelligence are the concepts.State or private institutions/organizations can make strategic and correct decisions both in numerical data and in text data. In Turkey, it has further increased the importance of information technologies, large data and data mining (business analytics) in the rapid development of projects such as national data centers or city hospitals, 4.5G transition, the continued growth of cloud computing, and the formation of future data centers. Especially in recent years, in the financial sector, energy sector, health sector, telecom sector and public institutions, investment projects for transformational information technologies for different technologies have started to increase rapidly. There is a requirement for these technologies to reach their corporate goals, to increase customer satisfaction and to configure data centers to meet their business needs.Data mining is the process of discovering rules and patterns related to each other from large heaps of data. It is not just a technique but a data approach that hosts many techniques. Converts all information from data heaps into an easy and understandable structure. It is the extraction of valuable information from the data. Briefly, data mining can be defined as the way to convert data into valuable information. Data mining today; it is defined as “Business intelligence” and “Business analytics”, “Knowledge mining in databases”, International Journal of Recent Advances in Multidisciplinary Research “Knowledge extraction”, “Data and pattern analysis” and “Data archeology” (Ersoz, 2016).To be able to perform data mining; access to data and a clear definition of the subject or research, effective access methods and algorithms, and a highbased application server are required. Data mining is closely related to the concept of business intelligence. When you look at the common functions of business intelligence technologies, analytical solution, reporting, and most importantly, data mining is well known. Business intelligence, which is also defined as corporate intelligence, is the systems that enable employees and managers to devote time to more efficient jobs in today's competitive world. In our country, this concept has begun to be newly formed and its importance has been newly discovered. In data mining, with a review aiming to achieve special results from large and meaningless data heaps, data are passed through many stages before modeling. In the first stage, the data is cleaned before modeling. Detection of outliers and end values ensures clean and high-quality data (Cleaning) so that the data can be spoken by combining the same language. Here, the selection of relevant and important variables for the research subject and the size reduction have been conducted. Before data mining, the transformation of the available data into a format suitable for reuse (Transformation) and the model suitable for research have been established. These stages prior to the establishment of a model in data mining are considered as data preparation. The data are analyzed by applying the most appropriate techniques to the problem and research. The data mining cycle is completed by drawing the information from the databases and the results of the analysis are completed with the interpretation of the decision maker. Data Mining process; the knowledge discovery in databases can be referred to as a step of the process, which is also called the Decision Support System (KDD) (Han, 2001,Ersoz, 2016). Information discovery in data mining is given below ingeneral. 4392 When we look at the steps in the data mining process, it is important to understand the business problem in modeling and to understand the data. The modeling varies according to each problem or data types. Data mining methods in general; The classifier is defined in three basic groups: Predictive, Clustering and Association rules and Sequential patterns. Commonly known data mining stages are given below (Han, 2012, Ersoz, 2016)      Identification of the problem or project; The purpose of research and data mining involves the determination of the planning process by evaluating the current situation. Understanding and preparation of data; data selection, connecting to data sources, recognizing data, understanding the quality of data. At this stage; collection of data, integration of data (Data from different databases collected in a single database), clearing data from contradictory and extreme values, transformation of data (Transformation of data into a form suitable for data mining), reduction of data (Reduction of unnecessary data likely to be introduced. Establishing the model Evaluation of the model The use of model results by thedecision-maker. Data mining; it is an interdisciplinary study in which machine learning, statistics, database technology, artificial intelligence and visualization are used together. The most important of these fields is the science of statistics. For these reasons, the value of machine learning techniques has increased in parallel with the data analysis and modeling studies. In text mining, however statistics are related to data mining, machine learning, management science, artificial intelligence, computer science and other disciplines. Statistics science and machine learning can be considered as basic elements in data and text mining. Although data analysis has been carried out successfully with statistical methods in data mining, the statistical methods in text mining are insufficient in the analysis of non-structural big data. The obtained data can also contain text, images, sound, etc. analysis of non-structural data in the form of text mining has been performed. Considering the literature on text mining, it is observed that instead of the text mining concept, text data mining (text data mining), data discovery in databases, text analysis and text data mining are used (Yıldız et al., 2018). Fig. 1. Information discovery with data mining Text mining studies is a data mining study that considers the text as a data source. Another definition aims to obtain structured data (Seker, 2015). The discovery of meaningful and important information within textual data can be defined as text mining. Text mining is used to extract facts and relationships in a structured form to support operational and strategic decision-making processes in business intelligence by considering the relevance of texts to private databases. In short, text mining is an approach to identify and extract information in unstructured text. Text mining can be defined as the information-intensive processes that provide the interaction of users with the data collected over time with the special tools it uses (Feldman, 2007). Miner et al. (2012) defined text mining as transforming texts into numbers technologies. Also, the importance of text mining is increasing with the increasing International Journal of Recent Advances in Multidisciplinary Research use of social media tools, such as Facebook, Instagram, Twitter etc. (Oguzlar, 2011) defined text mining as the process of extracting confidential information from the data in the text and formatting the irregular data with no specific format. Hearst has described text mining as a subfield of data mining (Hearst, 1999). Yıldız (2018) showed that the use of text mining methods in the accounting field accelerated and facilitated the processes.With text mining technology, all texts can be scanned, and similar studies can be evaluated together, and comparisons can be made easily. It is integrated into a much broader analytical market, business intelligence solutions, and ready to enable semantic search. Google platform and Amazon are one of the most important examples of text mining. Text mining can be defined as text data mining. Text mining is a process required to obtain meaningful structural information from sources that are not structured. It requires advanced linguistic and statistical methods that can analyze unstructured text formats and techniques that combine each document with actionable metadata. Once the content is released, it can be directly summarized, visualized and classified by link mapping, and making it easier to search.Text mining can be defined as transforming texts into a single form, pre-processing texts and transforming data into a structural form.Text mining usually consists of four stages (Faro et al., 2011) (Zhu et al., 2013):  Information retrieval (IR), A series of text materials are collected for a specific topic and the collected texts are textual features.  Information extraction (IE). Uncovering meaningful relationships between texts with similar features.  Knowledge discovery (KD). The resulting meaningful relationships are the definition of particular patterns and trends. Application Areas: The use of data mining is quite wide. Since meaningless data is rendered meaningful by processing information, it is used in many fields. Data mining; in addition to banking, biology, finance, marketing, insurance and medicine, it is also used in text and web mining. Text mining practices are often used in text analysis studies, in the classification of texts and the problems of word analysis. Web text mining, on the other hand, examines the content of the websites that constitute the basic structure of the Internet. Headings on sites, words on the pages, menus, subject structure, images and so on. The content information is examined and the relationships between the sites are determined. According to this finding, websites can be divided into classes and categories (Dolgun et al., 2010). Considering the studies conducted with data mining and text mining methods, especially in the field of text mining, many studies about the health sector can be encountered. When the literature is examined, it is observed that text mining is more widely used in banking and finance (Yıldız, 2018) and electronic commerce (Altan, 2018) sectors. Karami et al. (2018) and Solloum et al. (2017), while analyzing text mining methods by making use of the general perspective and feedback of society, worked on the impact of Altan (2018) news on the corporate image and the hospital sample, Ye et al. (2016), McTaggart et al. (2018) and Delespierre(2017) analyzed the data by using the large data defined in the system. In addition, Lammey (2015) has revealed the current status of launching the service among CrossRef member publishers by 4393 means of text mining. According to the results of all these studies, data were classified by using various text mining techniques and their relationship was determined. In addition, a new system was developed in line with the analysis results or additions were made to the system and the existing system was developed. When we look at the data mining literature, it is observed that it is more widely used in the finance and banking sector (BHambri, 2011) (Bhasin, 2006). In addition, there are studies in the areas of health (Lopez-Pineda et al., 2018) (Shameer et al., 2018) (Yaday et al., 2018) management and informatics to reveal customer satisfaction and profile (Minghetti, 2003). According to the credit card expenditures, the data mining application was determined by Parman. Parman(2003) conducted customer segmentation and risk valuation analyzes on the data of 1704 credit card customers belonging to a single branch of a medium-sized bank from private banks. It has been tried to determine the areas where credit card holders differ from each other according to their spending characteristics (Parman, 2003). In their study, Hsia et al. (2008) used a data mining technique to analyze course preferences and course completion rates at a university in Taiwan. The aim of the studies is to use the data mining technique to determine the course preferences of the students and the course preferences of the students who are continuing their education. Decision trees were used to find the students' course preferences, to determine the correlation between the link analysis course category and the participant profession, and the decision forest was used to find the participants the possibility of completing their preferred course. Chaid was used as a decision tree. As a result, high estimation success was achieved.Farquad, Ravi and Raju propose a hybrid method that uses a combination of Support Vector Machine and Naive Bayes methods to estimate loss rates of bank credit card customer. A data set consisting of 6.89% of lost customers has been used in the study. 68.52% of the correct classification rules have been obtained (Farquad et al., 2009). Comparison of Data Mining and Text Mining: The main difference in data mining and text mining is the homogeneous and heterogeneity of the structures used. It analyzes the information in a homogeneous structure format in data mining. It extracts, converts and loads data into a data warehouse. Business analysts use data mining software applications to analyze data and present data in easily understandable forms such as tables or charts. In text mining, it generates valuable information by using multilingual texts and abbreviations such as heterogeneous structure format (text documents, e-mails, social media writings, verbatim texts, etc.). Data mining analyzes on digital data stacks. Information is easily accessed and homogeneous. Rapid solutions are obtained with the algorithms used. In text mining, the complexity of the processed data is quite high, and the solution lasts longer. Text mining needs several intermediate linguistic analysis stages before it can enrich the content. After linguistic analysis, the metadata association steps address the configuration of unstructured content and domain-specific applications. Data mining uses only structural data as opposed to text mining, and hence meaningful relationships and trends can be found in text data using data mining techniques and International Journal of Recent Advances in Multidisciplinary Research algorithms. With text mining, non-structural data can be transformed to be structured and data mining can be done. In text mining, as in data mining, the investigator explores real usable information from data sources through investigations and descriptions of the area of interest. However, text mining does not include unstructured textual data, but there are no database records formulated in data sources. Besides, text mining has a lot in common with data mining. For example, most systems are based on preliminary processes, pattern discovery algorithms, and presentation layer elements such as visualization tools (Feldman et al., 2007). The summary of data mining and text mining is given in Table 1 in below. Table 1.Comparison of data mining and text mining Data Mining It targets the process of obtaining high-quality information from large data heaps in data mining. Text Mining In text mining, it targets the process of obtaining high-quality information from a large mass of textual data. Data mining illustrates the meaningful relationships in data with modeling approaches. Text processed by text mining; language, religion, race, as well as ethnic factors such as depending on the method applied according to the method and linguistic approaches may vary. Data Mining is a necessary process to convert meaningful information into structured data. Text mining is a necessary process to convert an unstructured text document into valuable structured information. Data mining is related to statistics, machine learning, optimization, data warehouse, expert systems, pattern recognition, artificial intelligence and algorithm concepts in computer science. Text mining is associated with Statistics, Machine Learning, Management Science, Artificial Intelligence, Computer Science, Natural Language Processing (NLP) and other disciplines. In data mining, business models are generated using numerical data and using data mining models. Text mining is the use of text methods to discover a lexical, syntactic and semantic feature in the text. Data mining is the process of discovery of information from structural data that is homogeneous and easy to access. Text mining is a process of text discovery from heterogeneous data and non-structural data. Data mining enables the discovery of information from large data with algorithms for modern machine learning and data mining models (classifiers, clusters, and association rules). In text mining, valuable information is discovered through classification, clustering, association, information extraction and summarization. It does this by using machine learning, artificial intelligence and algorithms. It enables the determination and analysis of the appropriate method for collecting and using the data. Multiple steps and methods may need to be applied to reach meaningful data. Data mining works better with large amounts of data. Text data size in text mining can vary depending on the area studied. As the area of interest grows, the results may be complicated. The best result arises from the large areas of text where the relevant fields are small. Conclusion With the development of technology, the importance of collecting and evaluating the data has increased and this increase has caused the accumulation of big data. For this reason, it has become increasingly important to distinguish the data stacks stored in databases and data warehouses according to the nature of the data and to establish meaningful 4394 relationships between the data. In order to increase the use of information technologies and to meet the increasing need, large data phenomena and data science have begun to develop. The data mining discipline has emerged in order to analyze large amounts of numerical data stored as data science and large data building systems develop. A text mining discipline has also emerged for the processing of texts in unstructured documents.It is a known fact that the efficient management of data and text mining will bring financial and managerial benefits to institutions and organizations. In recent years, especially in the finance sector, energy sector, health sector, energy sector, telecom sector and public institutions, investment projects for information technologies have started to increase rapidly. With these techniques, which can be used in every field, the institutions can classify their data according to the determined criteria, analyze the determined elements together, establish infrastructures that will facilitate the operation of the system, and allow the data to be clustered. In this way, it can be ensured that institutions and organizations receive their strategic decisions effectively and quickly.There are also organizations such as CrossRef which are established to provide the mentioned benefits. These organizations are aimed at gathering the work done under the discipline of data mining and text mining, to facilitate the easy access to the necessary data and to identify the relevant methods (Lammey, 2015).In this respect, it is important to understand the data mining that converts data into qualified information and to determine its usability. The use of data mining techniques depends on the availability of data, a clear definition of the subject, the definition of access methods and algorithms, and the provision of technological competences. If these conditions are met, the appropriate modeling method is chosen according to the problem or data type. These methods are generally grouped as predictive, clustering, association rules, and sequential patterns.Text mining, which is one of the subbranches of data mining, can reach the desired self-knowledge in high-capacity texts. Most of the data mining applications are carried out by text mining since the data is mostly textual. According to Miner et al., text mining is very general in terms of its applications and is very diverse in terms of its objectives. Compared to other well-established methods, text mining is thought to be a relatively new and non-standardized analytical method for discovery of information (Miner et al., 2012). The main differences between data mining and text mining are the structure of data; According to the text mining of data mining method of data complexes because of the complexity of the solution produces faster; While data mining techniques are used in data mining, it is revealed that significant relations and trends in data mining are used.Data Mining is based on Knowledge Discovery (KDD) in its databases and it is the process of finding valuable information from the information contained in the databases. Data mining is often used interchangeably with KDD.As a result, text mining, such as data mining, is now becoming increasingly widespread. Especially through the use of text mining methods that fill the missing aspect of data mining regarding the processing of linguistic expressions, it becomes easier to establish effective and fast systems and to analyze the data which is more complex than data mining.Data science includes many disciplines, many of which are data mining and text data mining. These include large data analytics, estimated modeling, data visualization, natural language processing (NLP), statistics, mathematics and artificial intelligence. Data science discovers knowledge through machine learning International Journal of Recent Advances in Multidisciplinary Research through data mining and modeling techniques with text data mining.The data is a valuable product for information. Data mining and text mining combined with large data analytics helps to solve problems or make better decisions and contribute to reducing time and effort. With the increase in the use of such methods in institutions and organizations, rapid and effective improvements in many sectors, the accurate analysis of the system and meaningful data can be provided. REFERENCES Altan S. 2018. “Analysis of the Impact of News Stories on the Organizational Image and a Text Mining Analysis on the News Stories about Hospitals in Turkey”, Journal of Communication Theory and Research, vol 46: p. 222-240. BHambri V. 2011. “Application of Data Mining in Banking Sector”, IJCST 2(2), 199-202. Bhasin M.L. 2006. “Data Mining: A Competitive Tool in the Banking and Retail Industries”, The Chartered Account, 588-594. Delespierre T., Denormandie P., Barhen A., Josseran L. 2017. “Empirical Advances with Text Mining of Electronic Health Records”, BMC Medical Informatics and Decision Making, 17 (1), 127. Dolgun, O.M., Ozdemir, T.G. and Oguz D. 2009. Analysis of Non-Structural Data in Data Mining: Text and Web Mining. The Journalists' Journal 2 (2), pp.48-58. Ersoz F. 2016. Data Mining Techniques and Applications, 72 Digital Printing, Ankara. Faro A., Giordano D., Spampinato C. 2011. Combining Literature Text Mining with Microarray Data: Advances for System Biology Modeling. Brief Bioinform. 2011;13(1):61–82. Farquad M. A. H., Ravi V., Raju S.B. 2009. Data Mining Using Rules Extracted from SVM: An Application to Churn Prediction in Bank Credit Cards, Springer-Verlag, Berlin Heidelberg, 390 – 397. Feldman R., Sanger J. 2007. “The Text Mining Handbook – Advanced Approaches in Analyzing Unstructured Data”, Cambridge University Press. Gursakal, N. 2014. Big Data, Dora Publications, Bursa. Hearst MA. 1999. Untangling Text Data Mining. In: Proceedings of 37th Annual Meeting of the Association for Computational Linguistics;3–10. Han, Kamber, Han, J. ve Kamber M., 2001. “Data Mining Concepts and Techniques”, Morgan Kaufmann Publishers. Hsia T. C., A. Shie J., Chen L. C. 2008. Course Planning of Extension Education to Meet Market Demand by Using Data Mining Techniques – An Example ofChinkuo Technology University in Taiwan, Expert Systems with Applications, 34: 596–602. Karami A., Dahl A., Turner-McGrievry G., Kharrazi H, Shaw G. 2018. “Characterizing Diabetes, Diet, Exercise, and Obesity Comments on Twitter”, International Journal of Information Management, Vol 38:1, p1-6. 4395 Lammey R. 2015. “CrossRef Text and Data Mining Services”, Insights, 28(2), July. López-Pineda A., Rodríguez-Moran M. F., Álvarez-Aguilar C, Fuentes Valle S. M., Acosta-Rosales R., Bhatt A.S., Sheth S.N. and Bustamante C. D. 2018. “Data Mining of Digitized Health Records in a Resource-Constrained Setting Reveals That Timely Immunophenotyping is Associated with Improved Breast Cancer Outcomes”, BMC Cancer, Vol 18:933. Mc Taggart S., Nangle C., Caldwell J., Alvarez-Madrazo S., Colhoun H., Bennie M. 2018. “Use of Text Mining Methods to Improve Efficiency in the Calculation of Drug Exposure to Support Pharmacoepidemiology Studies”, International Journal of Epidemiology 47 (2), 617-624. Miner, G., D. Delen, A. Fast, T. Hill, J. Elder ve B. Nisbet. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications, Elsevier, USA Minghetti V. 2003. “Building Customer Value in the Hospitality Industry: Towards the Definition of CustomerCentric Information System” Information Technology & Tourism 6 (2), 141-152. Oguzlar, A. 2011. Basic Text Mining, Dora Publications, Bursa. Parman D. 2003. Data Mining Method for Customer Relationship Management in Banking Sector: Implementation in a Private Bank. PhD Thesis, Marmara University, Institute of Banking and Insurance, Istanbul. Shameer K., Perez-Rodriguez M. M., Bachar R., Li L., Johnson A., Johnson K. W., Glucksberg B. S., Smith M.R., Redhead B., Scarpa Jhf, Kebakaran J., Kovatch P., Lim S., Goodman W., Reich D.L., Kasarskis A., Tatonetti N.P. and Dudley J.T. 2018. “Pharmacological Risk Factors Associated with Hospital Readmission Rates in a Psychiatric Cohort Identified Using Prescriptome Data Mining”, BMC Medical Informatics and Decision Making 18(Suppl 3):79. Solloum S.A., Al-Emran M., Monem A. and Shaalan K. 2017. “A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives”, Adv. Sci. Technol. Eng. Syst. J 2(1), 127-133. Yadav P., Steinbach M., Kumar V. And Simon G. 2018. “Mining Electronic Health Records (EHRs): A Survey”, ACM Computing Surveys, Vol. 50, No. 6, Article 85. Publication date: January. Ye Z., Tafti A.P., He K. T., Wang K., He M. M., “Spartext: Biomedical Text Mining on Big Data Framework”, Journal of Plos, 11(9), 2016. Yıldız D., Agdeniz S. 2018. “Text Mining as an Analyzing Method in Accounting”, Journal of The World of Accounting Science, 20(2); 286-315. Zhu F., Patumcharoenpol P., Zhang C., Yang Y., Chan J., Meechai A., Vongsangnak W., Shen B. 2013. Biomedical Text Mining and Its Applications in Cancer Research. J Biomed Inform.,46(2):200–11. IDC, https://www.idc.com *******

Log In

Data Mining and Text Mining with Big Data: Review of Differences

Sign up to get access to over 50M papers

Related papers

Related papers

Related topics

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!