0% found this document useful (0 votes)
3 views

DMW-M1-Ktunotes.in

Uploaded by

MRX Clay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DMW-M1-Ktunotes.in

Uploaded by

MRX Clay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

MODULE 1

1.1 DATA MINING

● Terabytes or petabytes of data pour into our computer networks, the World Wide Web (WWW),
and various data storage devices every day from business,society, science and engineering,
medicine, and almost every other aspect of daily life
● This explosive growth of available data volume is a result of the computerization of our society
and the fast development of powerful data collection and storage tools.
● This explosively growing, widely available, and gigantic body of data makes our time truly the
data age. Powerful and versatile tools are badly needed to automatically uncover valuable
information from the tremendous amounts of data and to transform such data into organized
knowledge.
● In summary, the abundance of data, coupled with the need for powerful data analysis tools, has
been described as a data rich but information poor situation.
● Data mining is a powerful new technology with great potential to help companies focus on the
most important information in their data warehouses.
● It has been defined as
“The automated analysis of large or complex data sets in order to discover significant patterns or
trends that would otherwise go unrecognised.”
● In addition, many other terms have a similar meaning to data mining—for example, knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
● Data mining uses mathematical analysis to derive patterns and trends that exist in data.
Typically, these patterns cannot be discovered by traditional data exploration because the
relationships are too complex or because there is too much data.
● KINDS OF DATA THAT CAN BE MINED
o Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level. Flat files are simple data files in text or
binary format with a structure known by the data mining algorithm to be applied. The
data in these files can be transactions, time-series data, scientific measurements, etc.
o Relational Databases: A relational database consists of a set of tables containing
either values of entity attributes, or values of attributes from entity relationships.
Tables have columns and rows, where columns represent attributes and rows represent
tuples. A tuple in a relational table corresponds to either an object or a relationship
between objects and is identified by a set of attribute values representing a unique key.
o Data Warehouses: A data warehouse as a storehouse, is a repository of data collected
from multiple data sources (often heterogeneous) and is intended to be used as a whole
under the same unified schema. A data warehouse gives the option to analyze data from
different sources under the same roof.
o Transaction Databases: A transaction database is a set of records representing
transactions, each with a time stamp, an identifier and a set of items. Since relational
databases do not allow nested tables (i.e. a set as attribute value), transactions are usually
stored in flat files or stored in two normalized transaction tables, one for the transactions
and one for the transaction items.
o Multimedia Databases: Multimedia databases include video, images, audio and text
media. They can be stored on extended object-relational or object-oriented databases, or
simply on a file system.
o Spatial Databases: Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional positioning.
1
o Time-Series Databases: Time-series databases contain time related data such stock
market data or logged activities. These databases usually have a continuous flow of new
data coming in, which sometimes causes the need for a challenging real time analysis.
o World Wide Web: The World Wide Web is the most heterogeneous and dynamic
repository available. A very large number of authors and publishers are continuously
contributing to its growth and metamorphosis, and a massive number of users are
accessing its resources daily. Data in the World Wide Web is organized in inter-connected
documents. These documents can be text, audio, video, raw data, and even applications.
o KINDS OF PATTERN THAT CAN BE DISCOVERED
▪ Characterization: Data characterization is a summarization of general features
of objects in a target class, and produces what is called characteristic rules. The
data relevant to a user-specified class are normally retrieved by a database query
and run through a summarization module to extract the essence of the data at
different levels of abstractions.
▪ Discrimination: Data discrimination produces what are called discriminant rules
and is basically the comparison of the general features of objects between two
classes referred to as the target class and the contrasting class. For example, one
may want to compare the general characteristics of the customers who rented
more than 30 movies in the last year with those whose rental account is lower
than 5. The techniques used for data discrimination are very similar to the
techniques used for data characterization with the exception that data
discrimination results include comparative measures.
▪ Association analysis: Association analysis is the discovery of what are
commonly called association rules. It studies the frequency of items occurring
together in transactional databases, and based on a threshold called support,
identifies the frequent item sets. Another threshold, confidence, which is the
conditional probability than an item appears in a transaction when another item
appears, is used to pinpoint association rules. Association analysis is commonly
used for market basket analysis. For example, it could be useful for the
VideoStore manager to know what movies are often rented together or if there is
a relationship between renting a certain type of movies and buying popcorn.
▪ Classification: Classification analysis is the organization of data in given
classes. Also known as supervised classification, the classification uses given
class labels to order the objects in the data collection. Classification approaches
normally use a training set where all objects are already associated with known
class labels. The classification algorithm learns from the training set and builds a
model. The model is used to classify new objects. For exampleEmail spam
classification.
▪ Prediction: Prediction has attracted considerable attention given the potential
implications of successful forecasting in a business context. There are two major
types of predictions: one can either try to predict some unavailable data values or
pending trends, or predict a class label for some data. The latter is tied to
classification. Once a classification model is built based on a training set, the
class label of an object can be foreseen based on the attribute values of the object
and the attribute values of the classes. Prediction is however more often referred
to the forecast of missing numerical values, or increase/ decrease trends in time
related data. The major idea is to use a large number of past values to consider
probable future values.
▪ Clustering: Similar to classification, clustering is the organization of data in
classes. However, unlike classification, in clustering, class labels are unknown
and it is up to the clustering algorithm to discover acceptable classes. Clustering

2
is also called unsupervised classification, because the classification is not
dictated by given class labels.
▪ Outlier analysis: Outliers are data elements that cannot be grouped in a given
class or cluster. Also known as exceptions or surprises, they are often very
important to identify. While outliers can be considered noise and discarded in
some applications, they can reveal important knowledge in other domains, and
thus can be very significant and their analysis valuable.
● Data Mining Applications
o Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are
as follows –
● Design and construction of data warehouses for multidimensional data analysis
and data mining.
● Loan payment prediction and customer credit policy analysis.
● Classification and clustering of customers for targeted marketing.
● Detection of money laundering and other financial crimes.
o Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer
retention and satisfaction.
o Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason
why data mining is become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve
quality of service.
o Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection
● Development of data mining algorithm for intrusion detection.
▪ Analysis of Stream data.
▪ Distributed data mining.
▪ Visualization and query tools.
o Medicine

3
▪ Data mining enables to characterize patient activities to see incoming office
visits.
▪ Data mining helps identify the patterns of successful medical therapies for
different illnesses.

1.2 DATA MINING STAGES (KNOWLEDGE DISCOVERY IN DATABASES(KDD))

The iterative process consists of the following steps:


o Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
o Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
o Data selection: at this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
o Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
o Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
o Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
o Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results.
o It is common to combine some of these steps together. For instance, data cleaning and data
integration can be performed together as a pre-processing phase to generate a data warehouse.
o Data selection and data transformation can also be combined where the consolidation of the data
is the result of the selection, or, as for the case of data warehouses, the selection is done on
transformed data.
o The KDD is an iterative process. Once the discovered knowledge is presented to the user, the
evaluation measures can be enhanced, the mining can be further refined, new data can be selected
or further transformed, or new data sources can be integrated, in order to get different, more
appropriate results.
4
1.3 DATA MINING MODEL

o A mining model is created by applying an algorithm to data, but it is more than an algorithm or a
metadata container: it is a set of data, statistics, and patterns that can be applied to new data to
generate predictions and make inferences about relationships.
o Data Mining Model Architecture
o A data mining model gets data from a mining structure and then analyzes that data by
using a data mining algorithm. The mining structure and mining model are separate
objects.
o The mining structure stores information that defines the data source. A mining model
stores information derived from statistical processing of the data, such as the patterns
found as a result of analysis.
o A mining model is empty until the data provided by the mining structure has been
processed and analyzed. After a mining model has been processed, it contains metadata,
results, and bindings back to the mining structure.

o The metadata specifies the name of the model and the server where it is stored, as well
as a definition of the model, including the columns from the mining structure that were
used in building the model, the definitions of any filters that were applied when
processing the model, and the algorithm that was used to analyze the data.
o For example, same data can be used to create multiple models, using perhaps a clustering
algorithm, decision tree algorithm, and Naïve Bayes algorithm. Each model type creates
different set of patterns, item sets, rules, or formulas, which you can use for making
predictions. Generally each algorithm analyses the data in a different way, so
the content of the resulting model is also organized in different structures such as
clusters, trees, branches, etc.
o Model is also affected by the data that you train it on: even models trained on the same
mining structure can yield different results if you filter the data differently.
o The model does contain a set of bindings, which point back to the data cached in the
mining structure. If the data has been cached in the structure and has not been cleared
after processing, these bindings enable you to drill through from the results to the cases
that support the results. However, the actual data is stored in the structure cache, not in
the model.

o Defining Data Mining Models

Data mining model can be defined by following these general steps:


5
o Create the underlying mining structure and include the columns of data that might be
needed.
o Select the algorithm that is best suited to the analytical task.
o Choose the columns from the structure to use in the model, and specify how they should
be used-which column contains the outcome you want to predict, which columns are for
input only, and so forth.
o Optionally, set parameters to fine-tune the processing by the algorithm.
o Populate the model with data by processing the structure and model.
o Each mining model contains two special properties:
o Algorithm property Specifies the algorithm that is used to create the model. The
algorithms that are available depend on the provider that you are using.
o Usage property Defines how each column is used by the model. You can define
the column usage as Input, Predict, Predict Only, or Key. The Usage property
applies to individual mining model columns and must be set individually for
every column that is included in a model. If the structure contains a column that
you do not use in the model, the usage is set to Ignore.

o Types of Data Mining Models


o There are two main data mining models types. These are: Predictive and
Descriptive.
o The descriptive model recognizes the designs or relationships in data and
discovers the properties of the data studied. For instance, Clustering,
Summarization, Association rule, Sequence discovery etc. Clustering is like
classification however the groups are not predefined, but then again are well-
defined by the data alone. It is also referred to as unsubstantiated learning or
subdivision. It is the wall off or splitting up of the data into collections or
clusters. The clusters are well-defined by learning the performance of the data by
the domain experts. The term splitting up is used in very precise framework; it is
a process of separation of database into split grouping of related tuples.
o Predictive modeling have data modeling as a prerequisite when making
authoritative predictions about the future using business forecasting and
simulation. These address the questions of “what will happen?” and “why will it
happen?”.Predictive analytics is a tool that “uses statistical techniques, machine
learning, and data mining to discover facts in order to make predictions about
unknown future events,” in investigating a domain-specific framework for
Predictive analytics in manufacturing. The predictive model makes forecast about
unidentified data values by using the identified values. For instance,
Classification, Regression, Time series analysis, Prediction etc.

1.3 DATA WAREHOUSING (DWH) AND ON-LINE ANALYTICAL PROCESSING (OLAP)

● Data warehouses generalize and consolidate data in multidimensional space. The construction
of data warehouses involves data cleaning, data integration, and data transformation, and can be
viewed as an important preprocessing step for data mining.
● Data warehouses provide online analytical processing (OLAP) tools for the interactive analysis
of multidimensional data of varied granularities, which facilitates effective data generalization
and data mining.

6
● Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
● Data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases. Data warehouse systems allow for integration of a variety of
application systems. They support information processing by providing a solid platform of
consolidated historic data for analysis.
● According to William H. Inmon “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s decision making process”.
● The four keywords—subject-oriented, integrated, time-variant, and nonvolatile—distinguish data
warehouses from other data repository systems, such as relational database systems, transaction
processing systems, and file systems.
o Subject-oriented: A data warehouse is organized around major subjects such as
customer,supplier, product, and sales. Rather than concentrating on the day-to-day
operations and transaction processing of an organization, a data warehouse focuses on the
modeling and analysis of data for decision makers. Hence, data warehouses typically
provide a simple and concise view of particular subject issues by excluding data that are
not useful in the decision support process.
o Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and online transaction
records. Data cleaning and data integration techniques are applied to ensure consistency
in naming conventions, encoding structures, attribute measures, and so on.
o Time-variant: Data are stored to provide information from an historic perspective (e.g.,
the past 5–10 years). Every key structure in the data warehouse contains, either implicitly
or explicitly, a time element.
o Nonvolatile: A data warehouse is always a physically separate store of data transformed
from the application data found in the operational environment. Due to this separation, a
data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms. It usually requires only two operations in data accessing: initial
loading of data and access of data.
● DWH stores the information an enterprise needs to make strategic decisions. A data warehouse is
also often viewed as an architecture, constructed by integrating data from multiple heterogeneous
sources to support structured and/or ad hoc queries, analytical reporting, and decision making.
● Data warehousing as the process of constructing and using data warehouses. The construction of
a data warehouse requires data cleaning, data integration, and data consolidation. The utilization
of a data warehouse often necessitates a collection of decision support technologies. This allows
“knowledge workers” (e.g., managers, analysts, and executives) to use the warehouse to quickly
and conveniently obtain an overview of the data, and to make sound decisions based on
information in the warehouse.

● Data warehousing is also very useful from the point of view of heterogeneous database
integration. Organizations typically collect diverse kinds of data and maintain large databases
from multiple, heterogeneous, autonomous, and distributed information sources.
● The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed
7
to a client site, a metadata dictionary is used to translate the query into queries appropriate for the
individual heterogeneous sites involved. These queries are then mapped and sent to local query
processors. The results returned from the different sites are integrated into a global answer set.
This query-driven approach requires complex information filtering and integration processes,
and competes with local sites for processing resources. It is inefficient and potentially expensive
for frequent queries, especially queries requiring aggregations.
● An alternative to this traditional approach is update driven approach in which information from
multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct
querying and analysis.

DIFFERENCE BETWEEN OPERATIONAL DATABASE SYSTEMS AND DATA WAREHOUSES

● The major task of online operational database systems is to perform online transaction and query
processing. These systems are called online transaction processing (OLTP) systems. They
cover most of the day-to-day operations of an organization such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting.
● Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data
analysis and decision making. Such systems can organize and present data in various formats in
order to accommodate the diverse needs of different users. These systems are known as online
analytical processing (OLAP) systems.
● The major distinguishing features of OLTP and OLAP are summarized as follows:
o Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
o Data contents: An OLTP system manages current data that, typically, are too detailed to
be easily used for decision making. An OLAP system manages large amounts of historic
data, provides facilities for summarization and aggregation, and stores and manages
information at different levels of granularity. These features make the data easier to use
for informed decision making.
o Database design: An OLTP system usually adopts an entity-relationship (ER) datA
model and an application-oriented database design. An OLAP system typically adopts
either a star or a snowflake model and a subject-oriented database design.
o View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema, due to the
evolutionary process of an organization. OLAP systems also deal with information that
originates from different organizations, integrating information from many data stores.
Because of their huge volume, OLAP data are stored on multiple storage media.
o Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only operations (because most data
warehouses store historic rather than up-to-date information), although many could be
complex queries.

8
Need for Data Warehousing
1. The data ware house market supports such diverse industries as manufacturing, retail,
telecommunications, and health care. Think of a personnel database for a company that is
continually modified as personnel are added and deleted... If management wishes
determine if there is a problem with too many employees quitting. To analyze this
problem, they would need to know which employees have left, when they left, why they
left, and other information about their employment. For management to make these types
of high-level business analyses, more historical data not just the current snapshot are
required.
A data warehouse is a data repository used to support decision support systems
2. The basic motivation is to increase business profitability. Traditional data processing
applications support the day-to-day clerical and administrative decisions, while data
warehousing supports long-term strategic decisions.
3. For increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending)
4. For repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions in order to fine tune
production strategies; analyzing operations and looking for sources of profit.
5. For managing the customer relationships, making environmental corrections, and
managing the cost of corporate assets.
6. The below figure shows a simple view of a data warehouse. The basic components of a
data warehousing system include data migration, the warehouse, and access tools. The
data are extracted from operational systems, but must be reformatted, cleansed,
integrated, and summarized before being placed in the warehouse.

For more visit www.ktunotes.in


Challenges for Data Warehousing
1. Unwanted data must be removed.
2. Converting heterogeneous sources into one common schema. This problem is the same as that
found when accessing data from multiple heterogeneous sources. Each operational database may
contain the same data with different attribute names. For example, one system may use "Employee
ID," while another uses "EID" for the same attribute. In addition, there may be multiple data types
for the same attribute.
3. As the operational data is probably a snapshot of the data, multiple snapshots may need to be
merged to create the historical view.
4. Summarizing data is performed to provide a higher level view of the data. This summarization
may be done at multiple granularities and for different dimensions.
5. New derived data (e.g., using age rather than birth date) may be added to better facilitate decision
support functions.
6. Handling missing and erroneous data must be performed. This could entail replacing them with
predicted or default values or simply removing these entries. The portion of the transformation that
deals with ensuring valid and consistent data is sometimes referred to as data scrubbing or data
staging.
7. Data warehouse queries are often complex. They involve the computation of large groups of data
at summarized levels, and may require the use of special data organization, access, and
implementation methods based on multidimensional views.
8. Data Quality – In a data warehouse, data is coming from many disparate sources from all facets
of an organization. When a data warehouse tries to combine inconsistent data from disparate
sources, it encounters errors. Inconsistent data, duplicates, logic conflicts, and missing data all
result in data quality challenges. Poor data quality results in faulty reporting and analytics
necessary for optimal decision making.
9. Understanding Analytics – When building a data warehouse, analytics and reporting will have to
be taken into design considerations. In order to do this, the business user will need to know
exactly what analysis will be performed.
10. Quality Assurance – The end user of a data warehouse is using Big Data reporting and analytics
to make the best decisions possible. Consequently, the data must be 100 percent accurate or a
credit union leader could make ill-advised decisions that are detrimental to the future success of
their business. This high reliance on data quality makes testing a high priority issue that will
require a lot of resources to ensure the information provided is accurate.
11. Performance – Building a data warehouse is similar to building a car. A car must be carefully
designed from the beginning to meet the purposes for which it is intended. Yet, there are options
each buyer must consider to make the vehicle truly meet individual performance needs. A data
warehouse must also be carefully designed to meet overall performance requirements. While the
final product can be customized to fit the performance needs of the organization, the initial overall
design must be carefully thought out to provide a stable foundation from which to start.
12. Designing the Data Warehouse – People generally don’t want to “waste” their time defining the
requirements necessary for proper data warehouse design. Usually, there is a high level perception
of what they want out of a data warehouse. However, they don’t fully understand all the
implications of these perceptions and, therefore, have a difficult time adequately defining them.
This results in miscommunication between the business users and the technicians building the data
warehouse. The typical end result is a data warehouse which does not deliver the results expected
by the user. Since the data warehouse is inadequate for the end user, there is a need for fixes and
improvements immediately after initial delivery.

For more visit www.ktunotes.in


13. User Acceptance – People are not keen to changing their daily routine especially if the new
process is not intuitive. There are many challenges to overcome to make a data warehouse that is
quickly adopted by an organization.
14. Cost – A frequent misconception among credit unions is that they can build data warehouse in-
house to save money.. The harsh reality is an effective do-it-yourself effort is very costly.

For more visit www.ktunotes.in


Data mining
We live in a world where vast amounts of data are collected daily. Analyzing such data is
an important need. “We are living in the information age” is a popular saying; however, we are
actually living in the data age. Terabytes or petabytes of data pour into our computer networks,
the World Wide Web (WWW), and various data storage devices every day from business society,
science and engineering, medicine, and almost every other aspect of daily life
What Is Data Mining?
Data mining refers to extracting or mining knowledge from large amounts of data. Mining is a
vivid term characterizing the process that finds a small set of precious nuggets from a great deal
of raw material.
KDD steps – STEPS IN KNOWLEDGE DISCOVERY FROM DATA
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in the
process of knowledge discovery. The terms knowledge discovery in databases (KDD) and data
ining are often used interchangeably.

For more visit www.ktunotes.in


Over the last few years KDD has been used to refer to a process consisting of many steps, while
data mining is only one of these steps.
Knowledge discovery in databases (KDD) is the process of finding useful information and
patterns in data. Data mining is the use of algorithms to extract the information and patterns
derived by the KDD process
Data mining stages:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques are
used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base. The

For more visit www.ktunotes.in


preceding view shows data mining as one step in the knowledge discovery process, albeit an
essential one because it uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting patterns and knowledge from large amounts
of data. The data sources can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
Data mining applications:
1. Classification: Eg: In loan database, to classify an applicant as a prospective or defaulter,
given his various personal and demographic features along with previous purchase
characteristics.
2. Estimation: Predict the attribute of a data instance. Eg: estimate the percentage of marks
of a student, whose previous marks are already known.
3. Prediction: Predictive model predicts a future outcome rather than the current behaviour.
Eg: Predict next week’s closing price for the Google share price per unit.
4. Market basket analysis(association rule mining)
Analyses hidden rules called association rule in a large transactional database.
{pen, pencil-> book} – whenever pen and pencil are purchased together, book is
also purchased.
5. Clustering Classification into different classes based on some similarities but the target
classes are unknown.
6. Business intelligence
7. Business data analytics
8. Bioinformatics
9. Web mining
10. Text mining
11. Social network data analysis
Data mining models

For more visit www.ktunotes.in


Data mining applications:
1. Classification: Eg: In loan database, to classify an applicant as a prospective or defaulter,
given his various personal and demographic features along with previous purchase
characteristics.
2. Estimation: Predict the attribute of a data instance. Eg: estimate the percentage of marks
of a student, whose previous marks are already known.
3. Prediction: Predictive model predicts a future outcome rather than the current behaviour.
Eg: Predict next week’s closing price for the Google share price per unit.
4. Market basket analysis(association rule mining)
Analyses hidden rules called association rule in a large transactional database.
{pen, pencil-> book} – whenever pen and pencil are purchased together, book is
also purchased.
5. Clustering Classification into different classes based on some similarities but the target
classes are unknown.
6. Business intelligence
7. Business data analytics
8. Bioinformatics
9. Web mining
10. Text mining
11. Social network data analysis

For more visit www.ktunotes.in


Data mining Functionalities
1. Class/Concept Description: Characterization and Discrimination: Data characterization
is a summarization of the general characteristics or features of a target class of data. The data
corresponding to the user-specified class are typically collected by a query. For example, to study
the characteristics of software products with sales that increased by 10% in the previous year, the
data related to such products can be collected by executing an SQL query on the sales database
.
Data discrimination is a comparison of the general features of the target class data objects
against the general features of objects from one or multiple contrasting classes. The target and
contrasting classes can be specified by a user, and the corresponding data objects can be retrieved
through database queries. For example, a user may want to compare the general features of
software products with sales that increased by 10% last year against those with sales that
decreased by at least 30% during the same period.

2.Mining Frequent Patterns, Associations, and Correlations, Frequent patterns, as the name
suggests, are patterns that occur frequently in data.There are many kinds of frequent patterns,
including frequent itemsets, frequent subsequences (also known as sequential patterns), and
frequent substructures. A frequent itemset typically refers to a set of items that often appear
together in a transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers. A frequently occurring subsequence, such as the
pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern. A substructure can refer to different structural
forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent
patterns leads to the discovery of interesting associations and correlations within data.
Association analysis:

A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50%
chance that she will buy software as well. A 1% support means that 1% of all the transactions
under analysis show that computer and software are purchased together. This association rule
involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a
single predicate are referred to as single-dimensional association rules. Dropping the predicate
notation, the rule can be written simply as “computer) software [1%, 50%].”
Adopting the terminology used in multidimensional databases, where each attribute is referred to
as a dimension, the above rule can be referred to as a multidimensional association rule.

3. Classification and Regression for Predictive Analysis:


Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts. The model is derived based on the analysis of a set of training data
(i.e., data objects for which the class labels are known). The model is used to predict the class
label of objects for which the the class label is unknown.
A decision tree is a flowchart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions.

For more visit www.ktunotes.in


A classification model can be represented in various forms: (a) IF-THEN rules, (b) a decision
tree, or (c) a neural network. A neural network,when used for classification, is typically a
collection of neuron-like processing unitswithweighted connections between the units. There are
many other methods for constructing classificationmodels, such as na¨ıve Bayesian
classification, support vector machines, and k-nearest-neighbor classification. Whereas
classification predicts categorical (discrete, unordered) labels, regression models continuous-
valued functions. That is, regression is used to predict missing or unavailable numerical data
values rather than (discrete) class labels. The term prediction refers to both numeric prediction
and class label prediction.
Regression analysis is a statistical methodology that is most often used for numeric prediction,
although other methods exist as well. Regression also encompasses the identification of
distribution trends based on the available data.
Unlike classification and regression, which analyze class-labeled (training) data sets,
4. Clustering analyzes data objects without consulting class labels. In many cases, classlabeled
data may simply not exist at the beginning. Clustering can be used to generate class labels for a
group of data. The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity. That is, clusters of objects are
formed so that objects within a cluster have high similarity in comparison to one another, but are
rather dissimilar to objects in other clusters. Each cluster so formed can be viewed as a class of
objects, from which rules can be derived. Clustering can also facilitate taxonomy formation,
that is, the organization of observations into a hierarchy of classes that group similar events
together.

6. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions. However, in some applications (e.g., fraud detection) the rare events can be more

For more visit www.ktunotes.in


interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier analysis or anomaly mining.
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
unusually large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the locations and types
of purchase, or the purchase frequency.
Are All Patterns Interesting?
A data mining system has the potential to generate thousands or even millions of patterns, or
rules.
A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test data
with some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also interesting if
it validates a hypothesis that the user sought to confirm. An interesting pattern represents
knowledge.
Several objective measures of pattern interestingness exist. These are based on the structure of
discovered patterns and the statistics underlying them. An objective

Technologies for data mining

For more visit www.ktunotes.in


Statistics:
Statistics studies the collection, analysis, interpretation or explanation, and presentationof data.
Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes.
In other words, such statistical models can be the outcome of a data mining task. Alternatively,
data mining tasks can be built on top of statistical models. For example, we can use statistics to
model noise and missing data values. Then, when mining patterns in a large data set, the data
mining process can use the model to help identify and handle noisy or missing values in the data.
Statistics research develops tools for prediction and forecasting using data and statistical models.
Statistical methods can be used to summarize or describe a collection of data.
Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical hypothesis
testing. A statistical hypothesis test (sometimes called confirmatory data analysis) makes
statistical decisions using experimental data. A result is called statistically significant if it is
unlikely to have occurred by chance. If the classification or prediction model holds true, then the
descriptive statistics of the model increases the soundness of the model.
Machine learning: investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. For example, a typical machine
learning problem is to program a computer so that it can automatically recognize handwritten
postal codes on mail after learning from a set of examples. Machine learning is a fast-growing
discipline. Here, we illustrate classic problems in machine learning that are highly related to data
mining.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which supervise the learning of the
classification model.
Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labeled. Typically, we may use clustering to
discover classes within the data. For example, an unsupervised learning method can take, as
input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These
clusters may correspond to the 10 distinct digits of 0 to 9, respectively. However, since the
training data are not labeled, the learned model cannot tell us the semantic meaning of the
clusters found.
Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine the boundaries between
classes. For a two-class problem, we can think of the set of examples belonging to one class as
the positive examples and those belonging to the other class as the negative examples. In Figure
if we do not consider the unlabeled examples, the dashed line is the decision boundary that best
partitions the positive examples from the negative examples. Using the unlabeled examples, we
can refine the decision boundary to the solid line. Moreover, we can detect that the two positive
examples at the top right corner, though labeled, are likely noise or outliers.

For more visit www.ktunotes.in


Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The goal
is to optimize the model quality by actively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.

Major issues in data mining


1. Mining methodology and user interaction issues
 Mining different kinds of knowledge in databases:
Because different users can be interested in different kinds of knowledge, data mining should
cover a wide spectrum of data analysis and knowledge discovery task. These tasks may use the
same database in different ways and require the development of numerous data mining
techniques.
 Interactive mining of knowledge at multiple levels of abstraction:
Interactive mining allows users to focus the search for patterns, providing and refining data
mining requests based on returned results. The user can interact with the data mining system to
view data and discovered patterns at multiple granularities and from different angles.
 Incorporation of background knowledge:
Background knowledge, or information regarding the domain under study, may be used to
guide the discovery process and allow discovered patterns to be expressed in concise terms and
at different levels of abstraction

2. Mining methodology and user interaction issues


 Pattern evaluation—the interestingness problem:
A data mining system can uncover thousands of patterns. Several challenges remain
regarding the development of techniques to assess the interestingness of discovered patterns,
particularly with regard to subjective measures that estimate the value of patterns with respect to
a given user class, based on user beliefs or expectations. The use of interestingness measures or

For more visit www.ktunotes.in


user-specified constraints to guide the discovery process and reduce the search space is another
active area of research.

3. Performance issues
 Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable. The running time of a data mining algorithm must be
predictable and acceptable in large databases.
 Parallel, distributed, and incremental mining algorithms:
The huge size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development of parallel and
distributed data mining algorithms. Such algorithms divide the data into partitions, which are
processed in parallel. The results from the partitions are then merged.
4. Issues relating to the diversity of database types:
 Handling of relational and complex types of data:
Because relational databases and data warehouses are widely used, the development of
efficient and effective data mining systems for such data is important. However, other databases
may contain complex data objects, hypertext and multimedia data, spatial data, temporal data, or
transaction data. Specific data mining systems should be constructed for mining specific kinds of
data.
 Mining information from heterogeneous databases and global information
systems:
Local- and wide-area computer networks (such as the Internet) connect many sources of
data, forming huge, distributed, and heterogeneous databases. The discovery of
knowledge from different sources of structured, semi structured, or unstructured data with
diverse data semantics poses great challenges to data mining

Data Warehouse: Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic decisions. Data
warehouse refers to a database that is maintained separately from an organization’s operational
databases. A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process”

1. Subject-oriented: A data warehouse is organized around major subjects, such as


customer, supplier, product, and sales.
A data warehouse focuses on the modelling and analysis of data for decision makers(not on day
to day transaction).
Provide a simple and concise view around particular subject issues by excluding data that are not
useful in the decision support process.
2. Integrated: data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and on-line transaction records.
3. Time-variant: Data are stored to provide information from a historical perspective
Every key structure in the data warehouse contains, either implicitly or explicitly, an
element of time.
4. Non-volatile: A data warehouse is always a physically separate store of data transformed
from the application data found in the operational environment. Due to this separation, a
10

For more visit www.ktunotes.in


data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms. It usually requires only two operations in data accessing: initial
loading of data and access of data.

Data warehousing is the process of constructing and using data warehouses.


 The construction of a data warehouse requires data cleaning, data integration, and
data consolidation.
 The utilization of a data warehouse often necessitates a collection of decision
support technologies. This allows “knowledge workers” (e.g., managers, analysts,
and executives) to use the warehouse to quickly and conveniently obtain an
overview of the data, and to make sound decisions based on information in the
warehouse.
Data warehousing is very useful from the point of view of heterogeneous database integration.
The traditional database approach to heterogeneous database integration was a ‘query- driven’
approach data warehousing employs an update-driven approach in which information from
multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct
querying and analysis.

Difference between Operational Database systems and Data Warehouse

 Operational Database systems


 Main task is to perform on-line transaction and query processing. These systems
are called on-line transaction processing (OLTP) systems.
 They cover most of the day-to-day operations of an organization, such as
purchasing, inventory, manufacturing, banking, payroll, registration, and
accounting.
 Data Warehouse
 serve users or knowledge workers in the role of data analysis and decision
making.
 Such systems can organize and present data in various formats in order to
accommodate the diverse needs of the different users. These systems are known
as on-line analytical processing (OLAP) systems.
Difference between OLTP and OLAP
 Users and system orientation:
 OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and information technology professionals.
 OLAP system is market-oriented and is used for data analysis by knowledge
workers, including managers, executives, and analysts.
 Data contents:
 OLTP system manages current data
 OLAP system manages large amounts of historical data, provides facilities for
summarization and aggregation, and stores and manages information at different
levels of granularity.
 Database design:

11

For more visit www.ktunotes.in


 An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design.
 An OLAP system typically adopts either a star or snowflake model and a
subjectoriented database design.
 View:

 An OLTP system focuses mainly on the current data within an enterprise or


department, without referring to historical data or data in different organizations.
 An OLAP system often spans multiple versions of a database schema, due to the
evolutionary process of an organization.
 OLAP systems also deal with information that originates from different
organizations.
 OLAP data are stored on multiple storage media.
 Access patterns:
 The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery
mechanisms.
Accesses to OLAP systems are mostly read-only operations although many could be complex
queries.

12

For more visit www.ktunotes.in


Need for Data Warehousing
1. The data ware house market supports such diverse industries as manufacturing, retail,
telecommunications, and health care. Think of a personnel database for a company that is
continually modified as personnel are added and deleted... If management wishes
determine if there is a problem with too many employees quitting. To analyze this
problem, they would need to know which employees have left, when they left, why they
left, and other information about their employment. For management to make these types
of high-level business analyses, more historical data not just the current snapshot are
required.
A data warehouse is a data repository used to support decision support systems

13

For more visit www.ktunotes.in


2. The basic motivation is to increase business profitability. Traditional data processing
applications support the day-to-day clerical and administrative decisions, while data
warehousing supports long-term strategic decisions.
3. For increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending)
4. For repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions in order to fine tune
production strategies; analyzing operations and looking for sources of profit.
5. For managing the customer relationships, making environmental corrections, and
managing the cost of corporate assets.
6. The below figure shows a simple view of a data warehouse. The basic components of a
data warehousing system include data migration, the warehouse, and access tools. The
data are extracted from operational systems, but must be reformatted, cleansed,
integrated, and summarized before being placed in the warehouse.

Challenges for Data Warehousing


1. Unwanted data must be removed.
2. Converting heterogeneous sources into one common schema. This problem is the same as that
found when accessing data from multiple heterogeneous sources. Each operational database may
contain the same data with different attribute names. For example, one system may use "Employee
ID," while another uses "EID" for the same attribute. In addition, there may be multiple data types
for the same attribute.
3. As the operational data is probably a snapshot of the data, multiple snapshots may need to be
merged to create the historical view.
4. Summarizing data is performed to provide a higher level view of the data. This summarization
may be done at multiple granularities and for different dimensions.

14

For more visit www.ktunotes.in


5. New derived data (e.g., using age rather than birth date) may be added to better facilitate decision
support functions.
6. Handling missing and erroneous data must be performed. This could entail replacing them with
predicted or default values or simply removing these entries. The portion of the transformation that
deals with ensuring valid and consistent data is sometimes referred to as data scrubbing or data
staging.
7. Data warehouse queries are often complex. They involve the computation of large groups of data
at summarized levels, and may require the use of special data organization, access, and
implementation methods based on multidimensional views.
8. Data Quality – In a data warehouse, data is coming from many disparate sources from all facets
of an organization. When a data warehouse tries to combine inconsistent data from disparate
sources, it encounters errors. Inconsistent data, duplicates, logic conflicts, and missing data all
result in data quality challenges. Poor data quality results in faulty reporting and analytics
necessary for optimal decision making.
9. Understanding Analytics – When building a data warehouse, analytics and reporting will have to
be taken into design considerations. In order to do this, the business user will need to know
exactly what analysis will be performed.
10. Quality Assurance – The end user of a data warehouse is using Big Data reporting and analytics
to make the best decisions possible. Consequently, the data must be 100 percent accurate or a
credit union leader could make ill-advised decisions that are detrimental to the future success of
their business. This high reliance on data quality makes testing a high priority issue that will
require a lot of resources to ensure the information provided is accurate.
11. Performance – Building a data warehouse is similar to building a car. A car must be carefully
designed from the beginning to meet the purposes for which it is intended. Yet, there are options
each buyer must consider to make the vehicle truly meet individual performance needs. A data
warehouse must also be carefully designed to meet overall performance requirements. While the
final product can be customized to fit the performance needs of the organization, the initial overall
design must be carefully thought out to provide a stable foundation from which to start.
12. Designing the Data Warehouse – People generally don’t want to “waste” their time defining the
requirements necessary for proper data warehouse design. Usually, there is a high level perception
of what they want out of a data warehouse. However, they don’t fully understand all the
implications of these perceptions and, therefore, have a difficult time adequately defining them.
This results in miscommunication between the business users and the technicians building the data
warehouse. The typical end result is a data warehouse which does not deliver the results expected
by the user. Since the data warehouse is inadequate for the end user, there is a need for fixes and
improvements immediately after initial delivery.
13. User Acceptance – People are not keen to changing their daily routine especially if the new
process is not intuitive. There are many challenges to overcome to make a data warehouse that is
quickly adopted by an organization.
14. Cost – A frequent misconception among credit unions is that they can build data warehouse in-
house to save money.. The harsh reality is an effective do-it-yourself effort is very costly.

Applications of DWH

There are three kinds of data warehouse applications: information processing, analytical processing, and
data mining.

15

For more visit www.ktunotes.in


1) Information processing supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing
is to construct low-cost web-based accessing tools that are then integrated with web browsers.

2) Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down,


roll-up, and pivoting. It generally operates on historic data in both summarized and detailed
forms. The major strength of online analytical processing over information processing is the
multidimensional data analysis of data warehouse data.

3) Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the
mining results using visualization tools.

Different areas are:

 Banking Industry

 In the banking industry, concentration is given to risk management and policy reversal as
well analyzing consumer data, market trends, government regulations and reports, and
more importantly financial decision making.

 Certain banking sectors utilize them for market research, performance analysis of each
product, interchange and exchange rates, and to develop marketing programs.

 Analysis of card holder’s transactions, spending patterns and merchant classification, all
of which provide the bank with an opportunity to introduce special offers and lucrative
deals based on cardholder activity.

 Finance Industry

 Revolve around evaluation and trends of customer expenses which aids in maximizing
the profits earned by their clients.

 Consumer Goods Industry

 They are used for prediction of consumer trends, inventory management, market and
advertising research.

 In-depth analysis of sales and production is also carried out.

 Government and Education

 The federal government utilizes the warehouses for research in compliance, whereas the
state government uses it for services related to human resources like recruitment, and
accounting like payroll management.

 The government uses data warehouses to maintain and analyze tax records, health policy
records and their respective providers.

 Criminal law database is connected to the state’s data warehouse. Criminal activity is
predicted from the patterns and trends, results of the analysis of historical data associated
with past criminals.

16

For more visit www.ktunotes.in


 Universities use warehouses for extracting of information used for the proposal of
research grants, understanding their student demographics, and human resource
management.

 Healthcare

 All of their financial, clinical, and employee records are fed to warehouses as it helps
them to strategize and predict outcomes, track and analyze their service feedback,
generate patient reports, share data with tie-in insurance companies, medical aid services,
etc.

 Hospitality Industry

 A major proportion of this industry is dominated by hotel and restaurant services, car
rental services, and holiday home services.

 They utilize warehouse services to design and evaluate their advertising and promotion
campaigns where they target customers based on their feedback and travel patterns.

 Insurance

 The warehouses are primarily used to analyze data patterns and customer trends, apart
from maintaining records of already existing participants.

 The design of tailor-made customer offers and promotions is also possible through
warehouses.

 Manufacturing and Distribution Industry

 A manufacturing organization has to take several make-or-buy decisions which can


influence the future of the sector, which is why they utilize high-end OLAP tools as a part
of data warehouses to predict market changes, analyze current business trends, detect
warning conditions, view marketing developments, and ultimately take better decisions.

 They also use them for product shipment records, records of product portfolios, identify
profitable product lines, analyze previous data and customer feedback to evaluate the
weaker product lines and eliminate them.

 For the distributions, the supply chain management of products operates through data
warehouses.

 The Retailers

 Retailers serve as middlemen between producers and consumers.

 They use warehouses to track items, their advertising promotions, and the consumers
buying trends.

 They also analyze sales to determine fast selling and slow selling product lines and
determine their shelf space through a process of elimination.

 Services Sector

17

For more visit www.ktunotes.in


Data warehouses find themselves to be of use in the service sector for maintenance of financial records,
revenue patterns, customer profiling, resource management, and human resources

 Telephone Industry

 The telephone industry operates over both offline and online data burdening them with a
lot of historical data which has to be consolidated and integrated.

 Analysis of fixed assets, analysis of customer’s calling patterns for sales representatives
to push advertising campaigns, and tracking of customer queries, all require the facilities
of a data warehouse.

 Transportation Industry

 In the transportation industry, data warehouses record customer data enabling traders to
experiment with target marketing where the marketing campaigns are designed by
keeping customer requirements in mind.

 To analyze customer feedback, performance, manage crews on board as well as analyze


customer financial reports for pricing strategies.

18

For more visit www.ktunotes.in


DATA MINING AND
WAREHOUSING
For more visit www.ktunotes.in
DATA MINING AND WAREHOUSING
 Syllabus
For more visit www.ktunotes.in
CONCEPTS
 Data Mining?
 Extracting or mining knowledge from large amounts
of data.
 The amount of data kept in computer files and
database is growing at a phenomenal rate. The users
of these data are expecting more sophisticated
information from them.
 “ Knowledge from Mining”
 Knowledge discovery from Data(KDD)
For more visit www.ktunotes.in
DATA MINING APPLICATIONS
1. Classification
Eg: In loan database, to classify an applicant as a
prospective or defaulter , given his various personal
and demographic features along with previous
purchase characteristics.
2. Estimation
Predict the attribute of a data instance. Eg:
estimate the percentage of marks of a student ,
whose previous marks are already known.
3. Prediction
Predictive model predicts a future outcome rather
than the current behaviour. Eg: Predict next week’s
closing price for the Google share price per unit.
For more visit www.ktunotes.in
DATA MINING APPLICATIONS
4. Market basket analysis(association rule mining)
Analyses hidden rules called association rule in a
large transactional database.
{pen, pencil-> book} – whenever pen and pencil are
purchased together, book is also purchased.
5. Clustering
Classification into different classes based on some
similarities but the target classes are unknown.
Areas where data mining is applied:
Business intelligence Web mining
Business data analytics Text mining
Social network data analysis
Bioinformatics
For more visit www.ktunotes.in
DATA MINING AS A PROCESS IN KNOWLEDGE DISCOVERY
For more visit www.ktunotes.in
DATA MINING STAGES
1. Data cleaning :
to remove noise and inconsistent data
2. Data integration
where multiple data sources may be combined
3. Data selection
where data relevant to the analysis task are retrieved from the database
4. Data transformation
where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations
5. Data mining :
an essential process where intelligent methods are applied in order to
extract data patterns
6. Pattern evaluation
to identify the truly interesting patterns representing knowledge based on some
interestingness measures
7. Knowledge presentation
where visualization and knowledge representation techniques are used to present
the mined knowledge to the user
For more visit www.ktunotes.in
DATA MINING MODELS
For more visit www.ktunotes.in
DATA MINING MODELS
1. Predictive model
 makes a prediction about values of data using known
results found from different data.
 may be made based on the use of other historical
data.
2. Descriptive model
 identifies patterns or relationships in data.
 serves as a way to explore the properties of the data
examined, not to predict new properties.
 Clustering, summarization, association rules, and
sequence discovery are usually viewed as descriptive
in nature.
For more visit www.ktunotes.in
DATA MINING MODELS
1.1 Classification
 maps data into predefined groups or classes.
 It is often referred to as supervised learning because
the classes are determined before examining the
data.
 Classification algorithms require that the classes be
defined based on data attribute values. They often
describe these classes by looking at the
characteristics of data already known to belong to the
classes
 Eg: Naïve Bayes Classifier
For more visit www.ktunotes.in
DATA MINING MODELS
1.2 Regression
 Regression is used to map a data item to a real
valued prediction variable.
 regression involves the learning of the function that
does this mapping.
 Regression assumes that the target data fit into some
known type of function (e.g., linear, logistic, etc.) and
then determines the best function of this type that
models the given data.
 Some type of error analysis is used to determine
which function is "best."
For more visit www.ktunotes.in
DATA MINING MODELS
1.3 Time Series Analysis
 The value of an attribute is examined as it varies over
time.
 The values usually are obtained as evenly spaced
time points (daily, weekly, hourly, etc.).
 A time series plot , is used to visualize the time
series.
 There are three basic functions performed in time
series analysis:
 distance measures are used to determine the similarity
between different time series.
 the structure of the line is examined to determine its
behavior.
 use the historical time series plot to predict future values.
For more visit www.ktunotes.in
DATA MINING MODELS __ o __ X
---x--- Y
Time Series plot __ __ Z
For more visit www.ktunotes.in
DATA MINING MODELS
1.3 Prediction
 Many real-world data mining applications can be seen as
predicting future data states based on past and current data.
 Prediction can be viewed as a type of classification.
 Prediction is predicting a future state rather than a current
state.
 Prediction applications include flooding, speech recognition,
machine learning, and pattern recognition.
For more visit www.ktunotes.in
DATA MINING MODELS
2.1 Clustering
 similar to classification except that the groups are
not predefined, but rather defined by the data
alone.
 unsupervised learning or segmentation.
 It can be thought of as partitioning or segmenting
the data into groups that might or might not be
disjointed.
 The clustering is usually accomplished by
determining the similarity among the data on
predefined attributes.
 The most similar data are grouped into clusters.
For more visit www.ktunotes.in
DATA MINING MODELS
2.2 Summarization
 Summarization maps data into subsets with
associated simple descriptions.
 also called characterization or generalization.
 It extracts or derives representative information about
the database.
 This may be accomplished by actually retrieving
portions of the data.
 summary type information (such as the mean of some
numeric attribute) can be derived from the data.
 The summarization characterizes the contents of the
database
For more visit www.ktunotes.in
DATA MINING MODELS
2.3 Association Rules
 Link analysis, alternatively referred to as affinity
analysis or association, refers to the data mining
task of uncovering relationships among data.
 An association rule is a model that identifies
specific types of data associations.
 These associations are often used in the retail
sales community to identify items that are
frequently purchased together.
 Associations are also used in many other
applications such as predicting the failure of
telecommunication switches.
For more visit www.ktunotes.in
DATA MINING MODELS
2.4 Sequence Discovery
 Sequential analysis or sequence discovery is used
to determine sequential patterns in data.
 These patterns are based on a time sequence of
actions.
 These patterns are similar to associations in that
data (or events) are found to be related, but the
relationship is based on time.
 In sequence discovery the items are purchased over
time in some order.
 For example, most people who purchase CD players
may be found to purchase CDs within one week.
For more visit www.ktunotes.in
DATA WAREHOUSING
 Data warehousing provides architectures and
tools for business executives to systematically
organize, understand, and use their data to make
strategic decisions.
 Data warehouse refers to a database that is
maintained separately from an organization’s
operational databases.
For more visit www.ktunotes.in
DATA WAREHOUSE
 “A data warehouse is a subject-oriented, integrated,
time-variant, and non-volatile collection of data in
support of management’s decision making process”
 Subject-oriented:
 A data warehouse is organized around major subjects, such as
customer, supplier, product, and sales.
 A data warehouse focuses on the modelling and analysis of data
for decision makers(not on day to day transaction).
 Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
 Integrated:
 A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files,
and on-line transaction records.
For more visit www.ktunotes.in
DATA WAREHOUSE
 Time-variant:
 Data are stored to provide information from a historical
perspective
 Every key structure in the data warehouse contains, either
implicitly or explicitly, an element of time.
 Non-volatile:
 A data warehouse is always a physically separate store of
data transformed from the application data found in the
operational environment. Due to this separation, a data
warehouse does not require transaction processing,
recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial
loading of data and access of data.
For more visit www.ktunotes.in
DATA WAREHOUSING
 Data warehousing is the process of constructing
and using data warehouses.
 The construction of a data warehouse requires data
cleaning, data integration, and data consolidation.
 The utilization of a data warehouse often necessitates
a collection of decision support technologies. This
allows “knowledge workers” (e.g., managers, analysts,
and executives) to use the warehouse to quickly and
conveniently obtain an overview of the data, and to
make sound decisions based on information in the
warehouse.
For more visit www.ktunotes.in
DATA WAREHOUSING
 Data warehousing is very useful from the point of
view of heterogeneous database integration.
 The traditional database approach to heterogeneous
database integration was a ‘query- driven’ approach
 data warehousing employs an update-driven
approach in which information from multiple,
heterogeneous sources is integrated in advance and
stored in a warehouse for direct querying and
analysis
For more visit www.ktunotes.in
OPERATIONAL DATABASE SYSTEMS
VS
DATAWARE HOUSES
 Operational Database systems
 Main task is to perform on-line transaction and query
processing. These systems are called on-line transaction
processing (OLTP) systems.
 They cover most of the day-to-day operations of an
organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and
accounting.
 Data Warehouse
 serve users or knowledge workers in the role of data
analysis and decision making.
 Such systems can organize and present data in various
formats in order to accommodate the diverse needs of the
different users. These systems are known as on-line
analytical processing (OLAP) systems.
For more visit www.ktunotes.in
OLTP VS OLAP
 Users and system orientation:
 OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology
professionals.
 OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
 Data contents:
 OLTP system manages current data
 OLAP system manages large amounts of historical data, provides
facilities for summarization and aggregation, and stores and manages
information at different levels of granularity.
 Database design:
 An OLTP system usually adopts an entity-relationship (ER) data
model and an application-oriented database design.
 An OLAP system typically adopts either a star or snowflake model
and a subjectoriented database design.
For more visit www.ktunotes.in
OLTP VS OLAP
 View:
 An OLTP system focuses mainly on the current data within
an enterprise or department, without referring to historical
data or data in different organizations.
 An OLAP system often spans multiple versions of a database
schema, due to the evolutionary process of an organization.
 OLAP systems also deal with information that originates
from different organizations.
 OLAP data are stored on multiple storage media.
 Access patterns:
 The access patterns of an OLTP system consist mainly of
short, atomic transactions. Such a system requires
concurrency control and recovery mechanisms.
 Accesses to OLAP systems are mostly read-only operations
although many could be complex queries.
For more visit www.ktunotes.in
For more visit www.ktunotes.in
NEED FOR DATA WAREHOUSING
 To promote the high performance of both online transaction
processing and online analytical processing
 Data warehouse queries are often complex. They involve the
computation of large groups of data at summarized levels,
and may require the use of special data organization, access,
and implementation methods based on multidimensional
views. Processing OLAP queries in operational databases
would substantially degrade the performance of operational
tasks.
 An operational database supports the concurrent processing
of multiple transactions. Concurrency control techniques are
required in OLTP. But such measures will degrade the
performance of OLAP.
 Structures, contents, and uses of the data in these two
systems are different.
For more visit www.ktunotes.in
DATA MINING ISSUES
 Mining methodology and user interaction
issues
 Mining different kinds of knowledge in databases:
Because different users can be interested in different kinds of
knowledge, data mining should cover a wide spectrum of data
analysis and knowledge discovery task. These tasks may use the
same database in different ways and require the development of
numerous data mining techniques.
 Interactive mining of knowledge at multiple levels of
abstraction:
Interactive mining allows users to focus the search for patterns,
providing and refining data mining requests based on returned
results. The user can interact with the data mining system to
view data and discovered patterns at multiple granularities and
from different angles.
 Incorporation of background knowledge:
Background knowledge, or information regarding the domain
under study, may be used to guide the discovery process and
allow discovered
For morepatterns to www.ktunotes.in
visit be expressed in concise terms and at
DATA MINING ISSUES
 Mining methodology and user interaction
issues
 Data mining query languages and ad hoc data mining:
Relational query languages (such as SQL) allow users to pose ad
hoc queries for data retrieval. Such a language should be
integrated with a database or data warehouse query language
and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results:
Discovered knowledge should be expressed in high-level
languages, visual representations, or other expressive forms so
that the knowledge can be easily understood and directly usable
by humans. This requires the system to adopt expressive
knowledge representation techniques, such as trees, tables,
rules, graphs, charts, crosstabs, matrices, or curves.
 Handling noisy or incomplete data:
 The data stored in a database may reflect noise, exceptional
cases, or incomplete data objects. When mining data regularities,
these objects may confuse the process, causing the knowledge
model constructed
For more to overfit the data. As a result, the accuracy of
visit www.ktunotes.in
DATA MINING ISSUES
 Mining methodology and user interaction
issues
 Pattern evaluation—the interestingness problem:
A data mining system can uncover thousands of patterns.
Several challenges remain regarding the development of
techniques to assess the interestingness of discovered patterns,
particularly with regard to subjective measures that estimate
the value of patterns with respect to a given user class, based on
user beliefs or expectations. The use of interestingness
measures or user-specified constraints to guide the discovery
process and reduce the search space is another active area of
research.
For more visit www.ktunotes.in
DATA MINING ISSUES
 Performance issues
 Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in
databases, data mining algorithms must be efficient and
scalable. The running time of a data mining algorithm must be
predictable and acceptable in large databases.
 Parallel, distributed, and incremental mining algorithms:
The huge size of many databases, the wide distribution of data,
and the computational complexity of some data mining methods
are factors motivating the development of parallel and
distributed data mining algorithms. Such algorithms divide the
data into partitions, which are processed in parallel. The results
from the partitions are then merged.
For more visit www.ktunotes.in
DATA MINING ISSUES
 Issues relating to the diversity of database
types:
 Handling of relational and complex types of data:
 Because relational databases and data warehouses are widely
used, the development of efficient and effective data mining
systems for such data is important. However, other databases
may contain complex data objects, hypertext and multimedia
data, spatial data, temporal data, or transaction data. Specific
data mining systems should be constructed for mining specific
kinds of data.
 Mining information from heterogeneous databases and
global information systems:
 Local- and wide-area computer networks (such as the Internet)
connect many sources of data, forming huge, distributed, and
heterogeneous databases. The discovery of knowledge from
different sources of structured, semistructured, or unstructured
data with diverse data semantics poses great challenges to data
mining
For more visit www.ktunotes.in
DATA WAREHOUSING CHALLENGES
 Data Quality
When a data warehouse tries to combine inconsistent data from disparate
sources, it encounters errors. Inconsistent data, duplicates, logic conflicts, and
missing data all result in data quality challenges. Poor data quality results in
faulty reporting and analytics necessary for optimal decision making.
 Understanding Analytics
The powerful analytics tools and reports available through integrated data will
provide credit union leaders with the ability to make precise decisions that
impact the future success of their organizations. When building a data
warehouse, analytics and reporting will have to be taken into design
considerations. In order to do this, the business user will need to know exactly
what analysis will be performed. Envisioning these reports will be difficult for
someone that hasn’t yet utilized a BI strategy and is unaware of its capabilities
and limitations.
 Quality Assurance
 The end user of a data warehouse is using Big Data reporting and analytics to
make the best decisions possible. Consequently, the data must be 100 percent
accurate or a credit union leader could make ill-advised decisions that are
detrimental to the future success of their business. This high reliance on data
quality makes testing a high priority issue that will require a lot of resources to
ensure the information provided is accurate. The credit union will have to
develop all of the steps required to complete a successful Software Testing Life
Cycle (STLC), which will be a costly and time intensive process.
For more visit www.ktunotes.in
DATA WAREHOUSING CHALLENGES
 Data Structuring and Systems Optimization
 The correct processing of data requires structuring it in a
way that makes sense for your future operations. As you
add more and more information to your warehouse,
structuring data becomes increasingly difficult and can
slow down the process significantly. In addition, it will
become difficult for the system manager to qualify the
data for analytics. In terms of systems optimization, it is
important to carefully design and configure data analysis
tools. This will provide better results, making
development decisions easier.
 Choosing the Right Type of Warehouse
 Which one you choose will depend on your business
model and specific goals.
For more visit www.ktunotes.in
DATA WAREHOUSING CHALLENGES
 Balancing Resources
 To receive the most benefit from data warehouse
deployment, most businesses choose to allow multiple
departments to access the system. This can add stress to
the warehouse and decrease efficiency. However,
implementing access control and security measures can
help you balance the usefulness and performance of
warehouse systems.
 Data Governance and Master Data
 One mistake that some businesses make is a lack of
investment in data governance and master data. Because
information is one of your most important assets, it
should be closely monitored. Implementing data
governance allows you to clearly define ownership and
ensures that shared data is both consistent and accurate.
For more visit www.ktunotes.in
DATA WAREHOUSING CHALLENGES
 Performance
 A data warehouse must be carefully designed to meet overall performance
requirements. While the final product can be customized to fit the performance
needs of the organization, the initial overall design must be carefully thought
out to provide a stable foundation from which to start.
 Designing the Data Warehouse
 People generally don’t want to “waste” their time defining the requirements
necessary for proper data warehouse design. Usually, there is a high level
perception of what they want out of a data warehouse. However, they don’t
fully understand all the implications of these perceptions and, therefore, have
a difficult time adequately defining them. This results in miscommunication
between the business users and the technicians building the data warehouse.
The typical end result is a data warehouse which does not deliver the results
expected by the user. Since the data warehouse is inadequate for the end user,
there is a need for fixes and improvements immediately after initial delivery.
The unfortunate outcome is greatly increased development fees.
For more visit www.ktunotes.in
DATA WAREHOUSING CHALLENGES
 User Acceptance
 People are not keen to changing their daily routine especially if the
new process is not intuitive. There are many challenges to overcome
to make a data warehouse that is quickly adopted by an organization.
Having a comprehensive user training program can ease this
hesitation but will require planning and additional resources.
 Cost
 A frequent misconception among credit unions is that they can build
data warehouse in-house to save money. As the foregoing points
emphasize, there are a multitude of hidden problems in building data
warehouses. Even if a credit union adds a data warehouse “expert” to
their staff, the depth and breadth of skills needed to deliver an
effective result is simply not feasible with one or a few experienced
professionals leading a team of non-BI trained technicians. The harsh
reality is an effective do-it-yourself effort is very costly.
For more visit www.ktunotes.in
OLTP VS DATA WAREHOUSE
 OLTP systems
 designed to maximize the transaction processing
capacity.
 commonly used in clerical data processing tasks,
structured repetitive tasks, read update a few records.
 isolation, recovery and integrity are critical.
 Data warehouse
 holds data that is historical, detailed, and summarized
to various levels and rarely subject to change.
 designed to support relatively low numbers of
transactions that are unpredictable in nature and
require answers to queries that are ad
hoc, unstructured, and heuristic.
For more visit www.ktunotes.in
For more visit www.ktunotes.in
APPLICATIONS OF DWH
 Banking Industry
 In the banking industry, concentration is given to risk
management and policy reversal as well analyzing
consumer data, market trends, government regulations and
reports, and more importantly financial decision making.
 Certain banking sectors utilize them for market research,
performance analysis of each product, interchange and
exchange rates, and to develop marketing programs.
 Analysis of card holder’s transactions, spending patterns
and merchant classification, all of which provide the bank
with an opportunity to introduce special offers and lucrative
deals based on cardholder activity.
 Finance Industry
 revolve around evaluation and trends of customer expenses
which aids in maximizing the profits earned by their clients.
For more visit www.ktunotes.in
APPLICATIONS OF DWH
 Consumer Goods Industry
 They are used for prediction of consumer trends, inventory
management, market and advertising research.
 In-depth analysis of sales and production is also carried out.
 Government and Education
 The federal government utilizes the warehouses for research in
compliance, whereas the state government uses it for services
related to human resources like recruitment, and accounting like
payroll management.
 The government uses data warehouses to maintain and analyze tax
records, health policy records and their respective providers.
 Criminal law database is connected to the state’s data warehouse.
Criminal activity is predicted from the patterns and trends, results
of the analysis of historical data associated with past criminals.
 Universities use warehouses for extracting of information used for
the proposal of research grants, understanding their student
demographics, and human resource management.
For more visit www.ktunotes.in
APPLICATIONS OF DWH
 Healthcare
 All of their financial, clinical, and employee records
are fed to warehouses as it helps them to strategize
and predict outcomes, track and analyze their service
feedback, generate patient reports, share data with
tie-in insurance companies, medical aid services, etc.
 Hospitality Industry
 A major proportion of this industry is dominated by
hotel and restaurant services, car rental services, and
holiday home services.
 They utilize warehouse services to design and
evaluate their advertising and promotion campaigns
where they target customers based on their feedback
and travel patterns.
For more visit www.ktunotes.in
APPLICATIONS OF DWH
 Insurance
 The warehouses are primarily used to analyze data patterns and
customer trends, apart from maintaining records of already existing
participants.
 The design of tailor-made customer offers and promotions is also possible
through warehouses.
 Manufacturing and Distribution Industry
 A manufacturing organization has to take several make-or-buy decisions
which can influence the future of the sector, which is why they utilize
high-end OLAP tools as a part of data warehouses to predict market
changes, analyze current business trends, detect warning conditions,
view marketing developments, and ultimately take better decisions.
 They also use them for product shipment records, records of product
portfolios, identify profitable product lines, analyze previous data and
customer feedback to evaluate the weaker product lines and eliminate
them.
 For the distributions, the supply chain management of products operates
through data warehouses.
For more visit www.ktunotes.in
APPLICATIONS OF DWH
 The Retailers
 Retailers serve as middlemen between producers and
consumers.
 They use warehouses to track items, their advertising
promotions, and the consumers buying trends.
 They also analyze sales to determine fast selling and
slow selling product lines and determine their shelf
space through a process of elimination.
 Services Sector
 Data warehouses find themselves to be of use in the
service sector for maintenance of financial records,
revenue patterns, customer profiling, resource
management, and human resources.
For more visit www.ktunotes.in
APPLICATIONS OF DWH
 Telephone Industry
 The telephone industry operates over both offline and online
data burdening them with a lot of historical data which has to
be consolidated and integrated.
 Analysis of fixed assets, analysis of customer’s calling patterns
for sales representatives to push advertising campaigns, and
tracking of customer queries, all require the facilities of a data
warehouse.
 Transportation Industry
 In the transportation industry, data warehouses record
customer data enabling traders to experiment with target
marketing where the marketing campaigns are designed by
keeping customer requirements in mind.
 To analyze customer feedback, performance, manage crews on
board as well as analyze customer financial reports for pricing
strategies.
For more visit www.ktunotes.in

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy