PPT 1
PPT 1
Datawarehousing
CS-303
2
1.1 Why Data Mining?
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
– Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area
– Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of CD player
7
Ex.: Market Analysis and Management (2)
• Fraud detection
• Find outliers of unusual transactions
• Financial planning
• Summarize and compare the resources and spending
8
Knowledge Discovery (KDD) Process
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Informati Machine
on Data
Mining Learning
Science
Visualiz Other
ation Disciplin
es
15
• Data are organized around major subjects, e.g. customer, item, supplier and
activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
23
25
– Classification
• The process of finding a model that describes and distinguishes the data classes or
concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown.
• The derived model is based on the analysis of a set of training data (data objects whose
class label is known).
• The model can be represented in classification (IF-THEN) rules, decision trees,
neural networks, etc.
– Prediction
• Predict missing or unavailable numerical data values
26
27
• Cluster Analysis
• Class label is unknown: group data to form new classes
• Clusters of objects are formed based on the principle of
maximizing intra-class similarity & minimizing interclass
similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.
28
• Outlier Analysis
• Data that do no comply with the general behavior or model.
• Outliers are usually discarded as noise or exceptions.
• Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
• Evolution Analysis
• Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
29
30
• Database
• Relational, data warehouse, transactional, stream, object-oriented/relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
• Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
1.9 Major Issues in Data Mining
33