0% found this document useful (0 votes)

19 views

big data analytics notes

Data mining is the computational process of discovering patterns in large datasets, aiming to extract actionable information through methods from artificial intelligence, machine learning, and statistics. The data mining process involves several tasks such as anomaly detection, association rule learning, clustering, classification, regression, and summarization, and is supported by components like a knowledge base, data mining engine, and user interface. Major issues in data mining include handling noisy data, pattern evaluation, and the efficiency of algorithms, while the overall knowledge discovery process encompasses data cleaning, integration, selection, transformation, mining, evaluation, and presentation.

Uploaded by

fatmaeram49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

big data analytics notes

Uploaded by

fatmaeram49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Big Data Analytics

What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amount of data. Data mining
should have been more appropriately named as knowledge mining which emphasis on mining
from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.

The key properties of data mining are

Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information

Focus on large datasets and databases

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

Tasks of Data Mining

Data mining involves six common classes of tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification of

unusual data records, that might be interesting or data errors that require further
investigation.

Association rule learning (Dependency modelling) – Searches for relationships

between variables. For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.
Summarization – providing a more compact representation of the data set, including
visualization and report generation.

Architecture of Data Mining

A typical data mining system may have the following major components.

1. Knowledge Base:

This is the domain knowledge that is used to guide the search or-evaluate the
interesting of meaningful patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s nobelty based
on its unexpectedness, may also be included. Other examples of domain knowledge
are additional interesting constraints or thresholds, and metadata (e.g., describing data
from multiple heterogeneous sources).

2. Data Mining Engine:

This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:

This component typically employs interesting measures interacts with the data mining
modules so as to focus the search toward interesting patterns. It may use interesting
thresholds to filter out discovered patterns. Alternatively, the pattern evaluation
module may be integrated with the mining module, depending on the implementation
of the data mining method used. For efficient data mining, it is highly recommended
to evaluation of pattern interesting as deep as possible into the mining processs as to
confine the search to only the interesting patterns.

4. User interface:

This module communicates between users and the data mining system,allowing the
user to interact with the system by specifying a data mining query or-task, providing
information to help focus the search, and performing exploratory data mining based
on the intermediate data mining results. In addition, this component allows the user to
browse database and data warehouse schemas or data structures,evaluate mined
patterns, and visualize the patterns in different forms.
Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.

Hence, domain-specific knowledge and experience are usually necessary in order to come
up with a meaningful problem statement. Unfortunately, many application studies tend to
focus on the data-mining technique at the expense of a clear problem statement. In this
step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial hypothesis. There may be several
hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. In successful data-mining applications, this cooperation does not stop in the initial
phase; it continues during the entire data-mining process.

Collect the data

This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and implicitly
given in the data-collection procedure. It is very important, however, to understand how
data collection affects its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later
for testing and applying a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully used in a final application
of the results.

Preprocessing the data

In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:

a. Detect and eventually remove outliers as a part of the preprocessing phase, or

b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight
for further analysis. Also, application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large
spectrum of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.

Estimate the model

The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.

Interpret the model and draw conclusions

In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are
more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using high dimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific
techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for
successful decision making.

The Data mining Process

Classification of Data mining Systems:

The data mining system can be classified according to the following criteria:

Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Some Other Classification Criteria:

Classification according to kind of databases mined

Classification according to kind of knowledge
mined
Classification according to kinds of techniques utilized
Classification according to applications adapted

Classification according to kind of databases mined

We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.

Classification according to kind of knowledge mined

We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:

Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Classification according to kinds of techniques utilized

We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.

Classification according to applications adapted

We can classify the data mining system according to application adapted. These applications are
as follows:

Finance
Telecommunications
DNA
Stock Markets
E-mail

Major Issues In Data Mining:

Mining different kinds of knowledge in databases. - The need of different users is

not the same. And Different user may be in interested in different kind of knowledge. Therefore
it is necessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express the

discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations should
be easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to interesting patterns of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.

Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the
data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
from scratch.

Knowledge Discovery in Databases(KDD)

he following diagram shows the process of knowledge discovery process:

Architecture of KDD
Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:

Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into
forms appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step,knowledge is represented.
The following diagram shows the process of knowledge discovery process:

Certified Data Engineer Associate
No ratings yet
Certified Data Engineer Associate
24 pages
Checklist For ETL Testing in Data Integration Testing Project
100% (2)
Checklist For ETL Testing in Data Integration Testing Project
1 page
Data Mining Notes
100% (1)
Data Mining Notes
75 pages
UNIT-2 BI
No ratings yet
UNIT-2 BI
26 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
Unit-I Data Mining
No ratings yet
Unit-I Data Mining
28 pages
DM Notes-1
No ratings yet
DM Notes-1
71 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
Unit 3
No ratings yet
Unit 3
34 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
data mining
No ratings yet
data mining
44 pages
Data Mining Notes
No ratings yet
Data Mining Notes
9 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
dw and dm notes (1)
No ratings yet
dw and dm notes (1)
89 pages
UNIT 3 DWM NOTES
No ratings yet
UNIT 3 DWM NOTES
17 pages
Data Mining U-1
No ratings yet
Data Mining U-1
10 pages
Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
ware house server
No ratings yet
ware house server
89 pages
LECTURE NOTES ON DATA MINING and DATA WA
No ratings yet
LECTURE NOTES ON DATA MINING and DATA WA
84 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
Unit-1 PPT
No ratings yet
Unit-1 PPT
21 pages
unit-1 notes onl
No ratings yet
unit-1 notes onl
25 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Data Mining and C
No ratings yet
Data Mining and C
85 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Data Mining Issues and Tasks
No ratings yet
Data Mining Issues and Tasks
5 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining AND Data Warehousing: Computer Science & Engineering
0% (1)
Data Mining AND Data Warehousing: Computer Science & Engineering
14 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Yihao Final Paper CCSC for Submission
No ratings yet
Yihao Final Paper CCSC for Submission
6 pages
Data Mining
No ratings yet
Data Mining
19 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
A) Data Cleaning
No ratings yet
A) Data Cleaning
7 pages
data_mining_ppt
No ratings yet
data_mining_ppt
17 pages
Data Mining Models and Tasks
No ratings yet
Data Mining Models and Tasks
6 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
DM-unit 1
No ratings yet
DM-unit 1
22 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Whats App
No ratings yet
Whats App
23 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
unit2
No ratings yet
unit2
20 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
AnIntroductiontoDataMining PDF
No ratings yet
AnIntroductiontoDataMining PDF
40 pages
Module 4
No ratings yet
Module 4
54 pages
Data Mining
No ratings yet
Data Mining
4 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
unit-III
No ratings yet
unit-III
101 pages
LEI2020
No ratings yet
LEI2020
9 pages
AE011926 Eram Fatma
No ratings yet
AE011926 Eram Fatma
2 pages
monteiro2018
No ratings yet
monteiro2018
8 pages
2021-22 - EPBM_Final
No ratings yet
2021-22 - EPBM_Final
1 page
MAP
No ratings yet
MAP
13 pages
big data summary
No ratings yet
big data summary
19 pages
Mechatronics- Basics of arduino
No ratings yet
Mechatronics- Basics of arduino
42 pages
Bigdata-cloud computing A K Mishra
No ratings yet
Bigdata-cloud computing A K Mishra
48 pages
Backup-and-Disaster-Recovery-Presentation
No ratings yet
Backup-and-Disaster-Recovery-Presentation
15 pages
Edited - MMIS103 2023 S2 Assignment 3
No ratings yet
Edited - MMIS103 2023 S2 Assignment 3
4 pages
Brochure-Oracle 12c Introduction To SQL
No ratings yet
Brochure-Oracle 12c Introduction To SQL
2 pages
Commvault Quick Start Guide
No ratings yet
Commvault Quick Start Guide
5 pages
SAP DWC - Getting - Started
No ratings yet
SAP DWC - Getting - Started
52 pages
DBMS Full 1st Sem
No ratings yet
DBMS Full 1st Sem
147 pages
Database SQL
No ratings yet
Database SQL
3 pages
Practice Essay Questions
No ratings yet
Practice Essay Questions
3 pages
SQLServer MPGuide
No ratings yet
SQLServer MPGuide
62 pages
CS2255 Data Base Management Systems Question Bank
No ratings yet
CS2255 Data Base Management Systems Question Bank
2 pages
DBASE
No ratings yet
DBASE
3 pages
Log
No ratings yet
Log
11 pages
Lecture5 -Query_Processing 1
No ratings yet
Lecture5 -Query_Processing 1
23 pages
IS223 Access DataFiles Tutorial Data Files Updated Spring 2024 Revised
No ratings yet
IS223 Access DataFiles Tutorial Data Files Updated Spring 2024 Revised
5 pages
Relevance Feedback Slides PDF
No ratings yet
Relevance Feedback Slides PDF
14 pages
Database Management System Class 11 Notes
No ratings yet
Database Management System Class 11 Notes
6 pages
Amazon DE Interview Prep Material
No ratings yet
Amazon DE Interview Prep Material
4 pages
Top 35 SAP ABAP Interview Questions
No ratings yet
Top 35 SAP ABAP Interview Questions
10 pages
Dr. Davoud Mougouei Decision Systems Lab Scit, Eis, Uow
No ratings yet
Dr. Davoud Mougouei Decision Systems Lab Scit, Eis, Uow
81 pages
Big Data and BDA
No ratings yet
Big Data and BDA
44 pages
Lefkovitz1969 Book FileStructuresForOn-LineSystem
No ratings yet
Lefkovitz1969 Book FileStructuresForOn-LineSystem
223 pages
Java Hall Management System
No ratings yet
Java Hall Management System
15 pages
Spatialware: Installation Guide
No ratings yet
Spatialware: Installation Guide
18 pages
18CSC303J - (Database Management Systems) UNIT-2 Multiple Choice Questions
No ratings yet
18CSC303J - (Database Management Systems) UNIT-2 Multiple Choice Questions
7 pages
Assignment#1
No ratings yet
Assignment#1
6 pages
Gartner Reprint
No ratings yet
Gartner Reprint
42 pages
Lab 1
No ratings yet
Lab 1
2 pages
CS 3308 Learning Journal Unit 1
No ratings yet
CS 3308 Learning Journal Unit 1
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

big data analytics notes

Uploaded by

big data analytics notes

Uploaded by

Big Data Analytics

What Is Data Mining?

The key properties of data mining are

Focus on large datasets and databases

The Scope of Data Mining

Tasks of Data Mining

Anomaly detection (Outlier/change/deviation detection) – The identification of

Association rule learning (Dependency modelling) – Searches for relationships

Architecture of Data Mining

2. Data Mining Engine:

3. Pattern Evaluation Module:

Most data-based modeling studies are performed in a particular application domain.

Collect the data

Preprocessing the data

a. Detect and eventually remove outliers as a part of the preprocessing phase, or

Estimate the model

Interpret the model and draw conclusions

The Data mining Process

Classification of Data mining Systems:

Classification according to kind of databases mined

Classification according to kind of databases mined

Classification according to kind of knowledge mined

Classification according to applications adapted

Major Issues In Data Mining:

Mining different kinds of knowledge in databases. - The need of different users is

Incorporation of background knowledge. - To guide discovery process and to express the

Knowledge Discovery in Databases(KDD)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.