DMW-M1-Ktunotes.in
DMW-M1-Ktunotes.in
● Terabytes or petabytes of data pour into our computer networks, the World Wide Web (WWW),
and various data storage devices every day from business,society, science and engineering,
medicine, and almost every other aspect of daily life
● This explosive growth of available data volume is a result of the computerization of our society
and the fast development of powerful data collection and storage tools.
● This explosively growing, widely available, and gigantic body of data makes our time truly the
data age. Powerful and versatile tools are badly needed to automatically uncover valuable
information from the tremendous amounts of data and to transform such data into organized
knowledge.
● In summary, the abundance of data, coupled with the need for powerful data analysis tools, has
been described as a data rich but information poor situation.
● Data mining is a powerful new technology with great potential to help companies focus on the
most important information in their data warehouses.
● It has been defined as
“The automated analysis of large or complex data sets in order to discover significant patterns or
trends that would otherwise go unrecognised.”
● In addition, many other terms have a similar meaning to data mining—for example, knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
● Data mining uses mathematical analysis to derive patterns and trends that exist in data.
Typically, these patterns cannot be discovered by traditional data exploration because the
relationships are too complex or because there is too much data.
● KINDS OF DATA THAT CAN BE MINED
o Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level. Flat files are simple data files in text or
binary format with a structure known by the data mining algorithm to be applied. The
data in these files can be transactions, time-series data, scientific measurements, etc.
o Relational Databases: A relational database consists of a set of tables containing
either values of entity attributes, or values of attributes from entity relationships.
Tables have columns and rows, where columns represent attributes and rows represent
tuples. A tuple in a relational table corresponds to either an object or a relationship
between objects and is identified by a set of attribute values representing a unique key.
o Data Warehouses: A data warehouse as a storehouse, is a repository of data collected
from multiple data sources (often heterogeneous) and is intended to be used as a whole
under the same unified schema. A data warehouse gives the option to analyze data from
different sources under the same roof.
o Transaction Databases: A transaction database is a set of records representing
transactions, each with a time stamp, an identifier and a set of items. Since relational
databases do not allow nested tables (i.e. a set as attribute value), transactions are usually
stored in flat files or stored in two normalized transaction tables, one for the transactions
and one for the transaction items.
o Multimedia Databases: Multimedia databases include video, images, audio and text
media. They can be stored on extended object-relational or object-oriented databases, or
simply on a file system.
o Spatial Databases: Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional positioning.
1
o Time-Series Databases: Time-series databases contain time related data such stock
market data or logged activities. These databases usually have a continuous flow of new
data coming in, which sometimes causes the need for a challenging real time analysis.
o World Wide Web: The World Wide Web is the most heterogeneous and dynamic
repository available. A very large number of authors and publishers are continuously
contributing to its growth and metamorphosis, and a massive number of users are
accessing its resources daily. Data in the World Wide Web is organized in inter-connected
documents. These documents can be text, audio, video, raw data, and even applications.
o KINDS OF PATTERN THAT CAN BE DISCOVERED
▪ Characterization: Data characterization is a summarization of general features
of objects in a target class, and produces what is called characteristic rules. The
data relevant to a user-specified class are normally retrieved by a database query
and run through a summarization module to extract the essence of the data at
different levels of abstractions.
▪ Discrimination: Data discrimination produces what are called discriminant rules
and is basically the comparison of the general features of objects between two
classes referred to as the target class and the contrasting class. For example, one
may want to compare the general characteristics of the customers who rented
more than 30 movies in the last year with those whose rental account is lower
than 5. The techniques used for data discrimination are very similar to the
techniques used for data characterization with the exception that data
discrimination results include comparative measures.
▪ Association analysis: Association analysis is the discovery of what are
commonly called association rules. It studies the frequency of items occurring
together in transactional databases, and based on a threshold called support,
identifies the frequent item sets. Another threshold, confidence, which is the
conditional probability than an item appears in a transaction when another item
appears, is used to pinpoint association rules. Association analysis is commonly
used for market basket analysis. For example, it could be useful for the
VideoStore manager to know what movies are often rented together or if there is
a relationship between renting a certain type of movies and buying popcorn.
▪ Classification: Classification analysis is the organization of data in given
classes. Also known as supervised classification, the classification uses given
class labels to order the objects in the data collection. Classification approaches
normally use a training set where all objects are already associated with known
class labels. The classification algorithm learns from the training set and builds a
model. The model is used to classify new objects. For exampleEmail spam
classification.
▪ Prediction: Prediction has attracted considerable attention given the potential
implications of successful forecasting in a business context. There are two major
types of predictions: one can either try to predict some unavailable data values or
pending trends, or predict a class label for some data. The latter is tied to
classification. Once a classification model is built based on a training set, the
class label of an object can be foreseen based on the attribute values of the object
and the attribute values of the classes. Prediction is however more often referred
to the forecast of missing numerical values, or increase/ decrease trends in time
related data. The major idea is to use a large number of past values to consider
probable future values.
▪ Clustering: Similar to classification, clustering is the organization of data in
classes. However, unlike classification, in clustering, class labels are unknown
and it is up to the clustering algorithm to discover acceptable classes. Clustering
2
is also called unsupervised classification, because the classification is not
dictated by given class labels.
▪ Outlier analysis: Outliers are data elements that cannot be grouped in a given
class or cluster. Also known as exceptions or surprises, they are often very
important to identify. While outliers can be considered noise and discarded in
some applications, they can reveal important knowledge in other domains, and
thus can be very significant and their analysis valuable.
● Data Mining Applications
o Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are
as follows –
● Design and construction of data warehouses for multidimensional data analysis
and data mining.
● Loan payment prediction and customer credit policy analysis.
● Classification and clustering of customers for targeted marketing.
● Detection of money laundering and other financial crimes.
o Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer
retention and satisfaction.
o Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason
why data mining is become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve
quality of service.
o Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection
● Development of data mining algorithm for intrusion detection.
▪ Analysis of Stream data.
▪ Distributed data mining.
▪ Visualization and query tools.
o Medicine
3
▪ Data mining enables to characterize patient activities to see incoming office
visits.
▪ Data mining helps identify the patterns of successful medical therapies for
different illnesses.
o A mining model is created by applying an algorithm to data, but it is more than an algorithm or a
metadata container: it is a set of data, statistics, and patterns that can be applied to new data to
generate predictions and make inferences about relationships.
o Data Mining Model Architecture
o A data mining model gets data from a mining structure and then analyzes that data by
using a data mining algorithm. The mining structure and mining model are separate
objects.
o The mining structure stores information that defines the data source. A mining model
stores information derived from statistical processing of the data, such as the patterns
found as a result of analysis.
o A mining model is empty until the data provided by the mining structure has been
processed and analyzed. After a mining model has been processed, it contains metadata,
results, and bindings back to the mining structure.
o The metadata specifies the name of the model and the server where it is stored, as well
as a definition of the model, including the columns from the mining structure that were
used in building the model, the definitions of any filters that were applied when
processing the model, and the algorithm that was used to analyze the data.
o For example, same data can be used to create multiple models, using perhaps a clustering
algorithm, decision tree algorithm, and Naïve Bayes algorithm. Each model type creates
different set of patterns, item sets, rules, or formulas, which you can use for making
predictions. Generally each algorithm analyses the data in a different way, so
the content of the resulting model is also organized in different structures such as
clusters, trees, branches, etc.
o Model is also affected by the data that you train it on: even models trained on the same
mining structure can yield different results if you filter the data differently.
o The model does contain a set of bindings, which point back to the data cached in the
mining structure. If the data has been cached in the structure and has not been cleared
after processing, these bindings enable you to drill through from the results to the cases
that support the results. However, the actual data is stored in the structure cache, not in
the model.
● Data warehouses generalize and consolidate data in multidimensional space. The construction
of data warehouses involves data cleaning, data integration, and data transformation, and can be
viewed as an important preprocessing step for data mining.
● Data warehouses provide online analytical processing (OLAP) tools for the interactive analysis
of multidimensional data of varied granularities, which facilitates effective data generalization
and data mining.
6
● Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
● Data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases. Data warehouse systems allow for integration of a variety of
application systems. They support information processing by providing a solid platform of
consolidated historic data for analysis.
● According to William H. Inmon “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s decision making process”.
● The four keywords—subject-oriented, integrated, time-variant, and nonvolatile—distinguish data
warehouses from other data repository systems, such as relational database systems, transaction
processing systems, and file systems.
o Subject-oriented: A data warehouse is organized around major subjects such as
customer,supplier, product, and sales. Rather than concentrating on the day-to-day
operations and transaction processing of an organization, a data warehouse focuses on the
modeling and analysis of data for decision makers. Hence, data warehouses typically
provide a simple and concise view of particular subject issues by excluding data that are
not useful in the decision support process.
o Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and online transaction
records. Data cleaning and data integration techniques are applied to ensure consistency
in naming conventions, encoding structures, attribute measures, and so on.
o Time-variant: Data are stored to provide information from an historic perspective (e.g.,
the past 5–10 years). Every key structure in the data warehouse contains, either implicitly
or explicitly, a time element.
o Nonvolatile: A data warehouse is always a physically separate store of data transformed
from the application data found in the operational environment. Due to this separation, a
data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms. It usually requires only two operations in data accessing: initial
loading of data and access of data.
● DWH stores the information an enterprise needs to make strategic decisions. A data warehouse is
also often viewed as an architecture, constructed by integrating data from multiple heterogeneous
sources to support structured and/or ad hoc queries, analytical reporting, and decision making.
● Data warehousing as the process of constructing and using data warehouses. The construction of
a data warehouse requires data cleaning, data integration, and data consolidation. The utilization
of a data warehouse often necessitates a collection of decision support technologies. This allows
“knowledge workers” (e.g., managers, analysts, and executives) to use the warehouse to quickly
and conveniently obtain an overview of the data, and to make sound decisions based on
information in the warehouse.
● Data warehousing is also very useful from the point of view of heterogeneous database
integration. Organizations typically collect diverse kinds of data and maintain large databases
from multiple, heterogeneous, autonomous, and distributed information sources.
● The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed
7
to a client site, a metadata dictionary is used to translate the query into queries appropriate for the
individual heterogeneous sites involved. These queries are then mapped and sent to local query
processors. The results returned from the different sites are integrated into a global answer set.
This query-driven approach requires complex information filtering and integration processes,
and competes with local sites for processing resources. It is inefficient and potentially expensive
for frequent queries, especially queries requiring aggregations.
● An alternative to this traditional approach is update driven approach in which information from
multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct
querying and analysis.
● The major task of online operational database systems is to perform online transaction and query
processing. These systems are called online transaction processing (OLTP) systems. They
cover most of the day-to-day operations of an organization such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting.
● Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data
analysis and decision making. Such systems can organize and present data in various formats in
order to accommodate the diverse needs of different users. These systems are known as online
analytical processing (OLAP) systems.
● The major distinguishing features of OLTP and OLAP are summarized as follows:
o Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
o Data contents: An OLTP system manages current data that, typically, are too detailed to
be easily used for decision making. An OLAP system manages large amounts of historic
data, provides facilities for summarization and aggregation, and stores and manages
information at different levels of granularity. These features make the data easier to use
for informed decision making.
o Database design: An OLTP system usually adopts an entity-relationship (ER) datA
model and an application-oriented database design. An OLAP system typically adopts
either a star or a snowflake model and a subject-oriented database design.
o View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema, due to the
evolutionary process of an organization. OLAP systems also deal with information that
originates from different organizations, integrating information from many data stores.
Because of their huge volume, OLAP data are stored on multiple storage media.
o Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only operations (because most data
warehouses store historic rather than up-to-date information), although many could be
complex queries.
8
Need for Data Warehousing
1. The data ware house market supports such diverse industries as manufacturing, retail,
telecommunications, and health care. Think of a personnel database for a company that is
continually modified as personnel are added and deleted... If management wishes
determine if there is a problem with too many employees quitting. To analyze this
problem, they would need to know which employees have left, when they left, why they
left, and other information about their employment. For management to make these types
of high-level business analyses, more historical data not just the current snapshot are
required.
A data warehouse is a data repository used to support decision support systems
2. The basic motivation is to increase business profitability. Traditional data processing
applications support the day-to-day clerical and administrative decisions, while data
warehousing supports long-term strategic decisions.
3. For increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending)
4. For repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions in order to fine tune
production strategies; analyzing operations and looking for sources of profit.
5. For managing the customer relationships, making environmental corrections, and
managing the cost of corporate assets.
6. The below figure shows a simple view of a data warehouse. The basic components of a
data warehousing system include data migration, the warehouse, and access tools. The
data are extracted from operational systems, but must be reformatted, cleansed,
integrated, and summarized before being placed in the warehouse.
2.Mining Frequent Patterns, Associations, and Correlations, Frequent patterns, as the name
suggests, are patterns that occur frequently in data.There are many kinds of frequent patterns,
including frequent itemsets, frequent subsequences (also known as sequential patterns), and
frequent substructures. A frequent itemset typically refers to a set of items that often appear
together in a transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers. A frequently occurring subsequence, such as the
pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern. A substructure can refer to different structural
forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent
patterns leads to the discovery of interesting associations and correlations within data.
Association analysis:
A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50%
chance that she will buy software as well. A 1% support means that 1% of all the transactions
under analysis show that computer and software are purchased together. This association rule
involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a
single predicate are referred to as single-dimensional association rules. Dropping the predicate
notation, the rule can be written simply as “computer) software [1%, 50%].”
Adopting the terminology used in multidimensional databases, where each attribute is referred to
as a dimension, the above rule can be referred to as a multidimensional association rule.
6. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions. However, in some applications (e.g., fraud detection) the rare events can be more
3. Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable. The running time of a data mining algorithm must be
predictable and acceptable in large databases.
Parallel, distributed, and incremental mining algorithms:
The huge size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development of parallel and
distributed data mining algorithms. Such algorithms divide the data into partitions, which are
processed in parallel. The results from the partitions are then merged.
4. Issues relating to the diversity of database types:
Handling of relational and complex types of data:
Because relational databases and data warehouses are widely used, the development of
efficient and effective data mining systems for such data is important. However, other databases
may contain complex data objects, hypertext and multimedia data, spatial data, temporal data, or
transaction data. Specific data mining systems should be constructed for mining specific kinds of
data.
Mining information from heterogeneous databases and global information
systems:
Local- and wide-area computer networks (such as the Internet) connect many sources of
data, forming huge, distributed, and heterogeneous databases. The discovery of
knowledge from different sources of structured, semi structured, or unstructured data with
diverse data semantics poses great challenges to data mining
Data Warehouse: Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic decisions. Data
warehouse refers to a database that is maintained separately from an organization’s operational
databases. A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process”
11
12
13
14
Applications of DWH
There are three kinds of data warehouse applications: information processing, analytical processing, and
data mining.
15
3) Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the
mining results using visualization tools.
Banking Industry
In the banking industry, concentration is given to risk management and policy reversal as
well analyzing consumer data, market trends, government regulations and reports, and
more importantly financial decision making.
Certain banking sectors utilize them for market research, performance analysis of each
product, interchange and exchange rates, and to develop marketing programs.
Analysis of card holder’s transactions, spending patterns and merchant classification, all
of which provide the bank with an opportunity to introduce special offers and lucrative
deals based on cardholder activity.
Finance Industry
Revolve around evaluation and trends of customer expenses which aids in maximizing
the profits earned by their clients.
They are used for prediction of consumer trends, inventory management, market and
advertising research.
The federal government utilizes the warehouses for research in compliance, whereas the
state government uses it for services related to human resources like recruitment, and
accounting like payroll management.
The government uses data warehouses to maintain and analyze tax records, health policy
records and their respective providers.
Criminal law database is connected to the state’s data warehouse. Criminal activity is
predicted from the patterns and trends, results of the analysis of historical data associated
with past criminals.
16
Healthcare
All of their financial, clinical, and employee records are fed to warehouses as it helps
them to strategize and predict outcomes, track and analyze their service feedback,
generate patient reports, share data with tie-in insurance companies, medical aid services,
etc.
Hospitality Industry
A major proportion of this industry is dominated by hotel and restaurant services, car
rental services, and holiday home services.
They utilize warehouse services to design and evaluate their advertising and promotion
campaigns where they target customers based on their feedback and travel patterns.
Insurance
The warehouses are primarily used to analyze data patterns and customer trends, apart
from maintaining records of already existing participants.
The design of tailor-made customer offers and promotions is also possible through
warehouses.
They also use them for product shipment records, records of product portfolios, identify
profitable product lines, analyze previous data and customer feedback to evaluate the
weaker product lines and eliminate them.
For the distributions, the supply chain management of products operates through data
warehouses.
The Retailers
They use warehouses to track items, their advertising promotions, and the consumers
buying trends.
They also analyze sales to determine fast selling and slow selling product lines and
determine their shelf space through a process of elimination.
Services Sector
17
Telephone Industry
The telephone industry operates over both offline and online data burdening them with a
lot of historical data which has to be consolidated and integrated.
Analysis of fixed assets, analysis of customer’s calling patterns for sales representatives
to push advertising campaigns, and tracking of customer queries, all require the facilities
of a data warehouse.
Transportation Industry
In the transportation industry, data warehouses record customer data enabling traders to
experiment with target marketing where the marketing campaigns are designed by
keeping customer requirements in mind.
18