01 Unit1
01 Unit1
01 Unit1
Unit 1
Unit 1
Structure 1.1 Introduction Objectives 1.2 Meaning and Working of Data Mining 1.3 Data, Information and knowledge 1.4 Data Warehousing and Data Mining Relation 1.5 Data Mining and knowledge Discovery 1.6 Data Mining and OLAP 1.7 Data Mining and Statistics 1.8 Data Mining Technologies 1.9 Data Mining Software 1.10 Summary 1.11 Terminal Questions 1.12 Answers
1.1 Introduction
Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cut costs, or both. Objectives At the end of this unit, you should be able to: explain the Basics of Data Mining describe the relationship between Data mining and various Business Intelligence tools like Data Warehousing, OLAP and Statistics.
Page No. 1
Data Mining
Unit 1
We will discuss the purpose of data mining with POS (point of sale system) system. Usually supermarkets employ a POS (Point Of Sale) system that collects data from each item that is purchased. The POS system collects data on the item brand name, category, size, time and date of the purchase and at what price the item was purchased. In addition, the supermarket usually has a customer rewards program, which is also an input to the POS system. This information can directly link the products purchased with an individual. All this data for every purchase made for years and years is stored in a database in a computer by the supermarket. Now that you have a database with millions of records. What will you do with this huge data? How do you use this data to forecast or control your business activities? The solution for this is Data Mining, using data mining techniques or Alogorithm, you can uncover trends, statistical correlations, relationships and patterns that can help your business become more efficient, effective and streamlined. The supermarket can now figure out which brands sell the most, what time of the day, week, month or year is the most busiest, what products do consumers buy along with certain items. For instance, if a person buys white bread, what other item would they be inclined to buy? Typically we can find its peanut butter and jelly. There is so much good information that a supermarket can use just by data mining their own data that they have collected. There are various definitions are given by the several technical bodies. Some of them are listed below. Data Mining Definitions Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data. Knowledge discovery in databases is the nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns in the data. It is the automatic discovery of new facts and relationships in data that are like valuable nuggets of business data. It is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions.
Sikkim Manipal University Page No. 2
Data Mining
Unit 1
It is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, visualization, and neural networks. Data mining streamlines the transformation of masses of information into meaningful knowledge, which is essential or bottom-line of Business intelligence. Typical techniques for data mining involve decision trees, neural networks, nearest neighbor clustering, fuzzy logic, and Genetic algorithms. How does data mining work Although data mining is still in its infancy, companies in a wide range of Industries including finance, health care, manufacturing, transportation, are already using data mining tools and techniques to take advantage of historical data. The whole logic of data mining is based on modeling. Modeling is simply the act of building a model (a set of examples or a mathematical relationship) based on data from situations where the answer is known and then applying the model to other situations where the answers are not known. Modeling techniques have been around for centuries, of course, but it is only recently that data storage and communication capabilities required to collect and store huge amounts of data, and the computational power to automate modeling techniques to work directly on the data, have been available. As a simple example of building a model, consider the director of marketing for a telecommunications company. He would like to focus his marketing and sales efforts on segments of the population most likely to become big users of long-distance services. He knows a lot about his customers, but it is impossible to discern the common characteristics of his best customers because there are so many variables. From this existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long-distance calls. For instance, he might learn that his best customers are unmarried females between the ages of 21 and 35 who earn in excess of $60,000 per year. This, then, is his model for high-value customers, and he would budget his marketing efforts accordingly.
Sikkim Manipal University Page No. 3
Data Mining
Unit 1
Remember, data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses or other information repositories.
Data Mining
Unit 1
Page No. 5
Data Mining
Unit 1
Any organization or a system in general is faced with a wealth of data that is maintained and stored, but the inability to discover valuable, often previously unknown information hidden in the data, prevents it from transferring these data into knowledge or wisdom. To satisfy these requirements, these steps are to be followed. 1. Capture and integrate both the internal and external data into a comprehensive view Mine for the integrated data information organize and present the information and knowledge in ways that expedite complex decision making.
Data Mining
Unit 1
o o
5. Selection of appropriate data-mining task (Data Mining Task) o Summarization, classification, regression, clustering, etc. 6. Selection of data-mining algorithm(s) (Data Mining Task) o Methods to search for patterns o Decision of which models and parameters may be appropriate o Match method to goal of KDD process 7. Data-Mining o searching for patterns of interest in one or more representational forms 8. Interpretation and visualization o interpretation of mined patterns o visualization of extracted patterns and models o visualization of the data with given the extracted models 9. Consolidating discovered knowledge o Incorporating the discovered knowledge into another system o Documenting and reporting knowledge to interested parties o Checking for inconsistencies with other prior extracted or believed knowledge
Page No. 7
Data Mining
Unit 1
Data Mining
Unit 1
these relations using statistical nomenclature. Without statistics, there would be no data mining, as statistics are the foundation of most technologies on which data mining is built. Classical statistics embrace concepts such as Regression Analysis, Standard Distribution, Standard Deviation, Standard Variance, Discriminant Analysis, Cluster Analysis, and Confidence Intervals, all of which are used to study data and data relationships. These are the very building blocks with which more advanced statistical analyses are underpinned. Certainly, within the heart of today's data mining tools and techniques, classical statistical analysis plays a significant role. Note: Data Mining has its roots from Statistics, Artificial Intelligence and Machine Learning. Please note, Statistics, AI and Machine Learning are out of our study here, so we are not exploring much about them. The details about data mining techniques will be explored in the forthcoming units.
Page No. 9
Data Mining
Unit 1
Nearest neighbor A classification technique that classifies each record based on the records most similar to it in a historical database. Data Mining has different applications in the industry. Some of them are given below, Identifying new customers Predicting customer buying habits Confirming suitable loan applicants Revealing fraud Relationship marketing Managing equity portfolios Diagnosing medical problems Inventory management Conducting certain aspects of Marketing Customer segmentation Web site design and promotion Data Mining Industries Banking Insurance Credit marketing Telecommunications Pharmaceuticals Bioinformatics etc
Data Mining
Unit 1
integrated with the customers existing systems to provide scalable, high performing predictive analysis without moving data into proprietary data mining platforms. Enterprise Miner (SAS Institute Inc.) It provides the most powerful, complete data mining solution on the market with unparalleled model development and deployment alternatives and extensive integration opportunities. Delivered as a distributed client-server system, it is especially well suited for data mining in large organizations Clementine (SPSS Inc - Integral Solutions) Clementine is an enterprise data mining workbench that enables you to develop predictive models quickly using business expertise and deploy them into business operations to improve decision making. DMMiner (DBMiner Technology Inc.) DB Miner Insight solutions are world's first server applications providing powerful and highly scalable association, sequence and differential mining capabilities for Microsoft SQL Server Analysis Services platform, and they also provide market basket, sequence discovery and profit optimization for Microsoft Accelerator for Business Intelligence. Weak 3 A It is a collection of machine learning algorithms for solving data mining problems. It is written in java. So it is portable across all platforms. For details visit, http://www.cs.waikato.ac.nz/weak/ Oracle 10 g: oracle 10 g provides software called Darwin, which is data mining tool. It incorporates Cluster Analysis, Classification, Prediction and Association rules. In addition to the above list, the following are popular, Ghost Minor, Mantas,CART and MARS
1.10 Summary
Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. Data: Data are any facts, numbers, or text that can be processed by a computer.
Page No. 11
Data Mining
Unit 1
Metadata: data about the data itself such as logical database design or data Dictionary definitions. Information: The patterns, associations, or relationships among data that can provide information or processed data is called information. Data Mining is a multidisciplinary field drawing works from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. Data Mining consists of many up-to-date techniques such as o Classification o Clustering o Association Data mining is a process, and its successful application requires Data Preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (understandability, summary, presentation), good understanding of problem domains and domain expertise. Data mining is also referred to as knowledge discovery in databases (KDD). OLAP and Data Mining can complement each other OLAP stands for Online Analytical Processing Data Mining is a step in the KDD (Knowledge Discovery Process) Process.
Page No. 12
Data Mining
Unit 1
1.12 Answers
Self Assessment Questions 1. Discovering 2. Historical, future 3. Meta data 4. Data 5. Data warehouse and data mining 6. Knowledge discovery 7. Decision support 8. Compliment 9. Multiple Terminal Questions 1. Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Refer section 1.2 and 1.8 2. Online Analytical Processing (OLAP) is a technology that is used to create decision support software. Refer section 1.6 3. Data Mining is a multidisciplinary field drawing works from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. Refer section 1.4 4. Artificial neural networks , Decision trees , Rule induction ,etc Refer section 1.8 5. Data Mining is also referred to as knowledge discovery in databases (KDD). Refer section 1.6 6. i. Classification ii. Clustering iii. Association dimensionality reduction, cleaning,
Page No. 13