DWH Unit 3
DWH Unit 3
OR
The process of extracting information to identify patterns, trends, and useful data that would allow the business
to take the data-driven decision from huge sets of data is called Data Mining.
Data mining is the act of automatically searching for large stores of information to find trends and patterns that
go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data segments
and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Example
• So there is a Mobile network operator. They consult a data-miner to dig into the call records of the operator. No
• As the data miner starts digging into the data, he finds a pattern that there are fewer international calls on
• This information is shared with the management, and they come up with the plan to reduce the international call
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may comprise
one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain text
files or spreadsheets may contain information. Another primary source of data is the World Wide Web
or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated,
and selected. As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and accurate. So, the
first data requires to be cleaned and unified. More information than needed will be collected from
various data sources, and only the data of interest will have to be selected and passed to the server.
These procedures are not as easy as we think. Several methods may be performed on the data as part
of selection, integration, and cleaning.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various data
sources and stored within the data warehouse.
This segment commonly employs stake measures that cooperate with the data mining modules to focus
the search towards fascinating patterns. It might utilize a stake threshold to filter out discovered
patterns. On the other hand, the pattern evaluation module might be coordinated with the mining
module, depending on the implementation of the data mining techniques used. For efficient data
mining, it is abnormally suggested to push the evaluation of pattern stake as much as possible into the
mining procedure to confine the search to only fascinating patterns.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user views
and data from user experiences that might be helpful in the data mining process. The data mining
engine may receive inputs from the knowledge base to make the result more accurate and reliable. The
pattern assessment module regularly interacts with the knowledge base to get inputs, and also update
it.
Different Data Mining Methods
There are many methods used for Data Mining, but the crucial step is to select the appropriate form
from them according to the business or the problem statement. These methods help in predicting
the future and then making decisions accordingly. These also help in analyzing market trends and
• Association
• Classification
• Clustering Analysis
• Prediction
• Decision Trees
• Neural Network
1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
Major Issues in Data Mining
Major Issues in Data Mining: Data Mining is not very simple to understand and implement. As it is
already evident that Data Mining is a process which is very crucial for various researchers and
businesses. But in data mining, the algorithms are very complex and on top of that, the data is not
readily available at one place. Every technology has flaws or issues. But one needs to always know the
various flaws or issues that technology has.
Data Mining Techniques
Classification:
The technique used for obtaining important and relevant information about data and the metadata is
called classification. As we already know the literal meaning of classification is to categorize the given
set of information or data according to some criteria.
Clustering:
Clustering can be defined as an analytics technique that heavily depends on the visual approaches in
order to understand the data. The mechanism in Clustering displays the graphics so as to show where
the distribution of data is. In this type, the colors are used to show the distribution of data.
Using a Graphical approach is ideal for cluster analytics because with the help of graphs and clustering,
the users are easily able to see the distribution of data so that the trends are identified which are
relevant to the business objectives.
Regression:
Regression can be defined as a technique of data mining that is most often used to predict a range of
numeric values (or continuous values), that are provided to the user in a specific set of data. Regression
is a widely used concept. It is used across the multiple industries mainly for planning marketing
strategies, forecasting finance related matters and analyzing the trends.
Outer:
Outer detection is also called Outlier Analysis and Outlier mining. Outer is a type of data mining
technique in which there is a full examination of particular data objects in a set of data that do not
equate to the expected pattern or trend. This technique is very helpful as it can be used in a big variety
of domains, such as intrusion, detection, fraud, fault detection, and so on.
Sequential Patterns:
This type of data mining technique is used for discovering a series of events that has taken place in
sequence. It is notably very useful/beneficial for transactional mining of data. To give an example, we
can say that this technique is used to reveal what items of hand accessories that a customer is more
likely to buy after an initial purchase of let’s say, a clothing item. The sequential patterns help us in
understanding the general trends of a customer’s need. In this way, organization can recommend
additional items to the customers so as to exponentially increase its sales.
Prediction:
Prediction is an extremely powerful and very beneficial feature/quality of data mining that is
responsible for representing one of four branches of analytics (the other branches are descriptive,
diagnostic, and prescriptive). Let us take an example for understanding better, let’s say- a manager of a
well- respected organization has to predict how much the customers of their products or services are
likely to spend during the sale that organization has offered.
Association Rules:
Association Rules is a data mining technique that is very beneficial as we can search the association
between two or more Items, data or information. It helps in discovering the hidden pattern or trend in a
given data set. To explain in simpler yet apt language, this technique helps us in identifying or finding
engaging associations or relations within various large sets of data.
Tracking Patterns:
Tracking patterns is an essential data mining technique. In this technique, the user identifies and
monitors pattern or the trend that are usually seen in the data we are. It helps in making intelligent
inferences about outcomes of business. If a certain organization is able to figure out and then determine
that a product, they are selling is more popular than the other products sold by that very organization,
the organization will be able to use that information in order to create products or services that are
similar to the best sellers.
Apriori Algorithm
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work
on the databases that contain transactions. With the help of these association rule, it determines how
strongly or how weakly two objects are connected. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset associations efficiently. It is the iterative process for
finding the frequent itemsets from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market
basket analysis and helps to find those products that can be bought together. It can also be used in the
healthcare field to find drug reactions for patients.