1 Da

UNIT-I
Data Management: Design Data Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality (noise,
outliers, missing values, duplicate data) and Data Processing &Processing.
Data Analy cs:

• Data Analy cs is a process of sor ng through raw data to find meaningful pa erns.
• It is a powerful decision making tool that allows business leaders to have all the
informa on they need, to move their organiza on in the right direc on.
There are Four main types of data analy cs:
1. Descrip ve Analy cs: Reveal what happened in the past.
2. Diagnos c Analysis: Answer why something happened
3. Predic ve Analy cs: Tell what will probably happen in the future
4. Prescrip ve Analy cs: show what ac ons should be taken to make progress or avoid
problems in future.
• Data Analysts use various kinds of tools and technologies to gather all sorts of data,
like sta s cs about how much me users spend on a website, demographic informa on
about customer or traffic pa erns in a city.
Applica ons of Data Analy cs:
• Increasing the quality of medical care
• Figh ng climate change in local communi es
• Revealing trends for research ins tu ons
• Stopping hackers in their tracks (Security Analy cs)
• Serving customers with useful products
• Driving marke ng campaigns for business
• Promo ng smart energy usage for u lity companies
• Improving the insurance industry
• Crea ng manufacturer warran es that make sense
Big Data:
• Most of the data is generated from social media sites and e commerce websites,
hospitals, schools, bank etc.
• This data is impossible to manage by tradi onal data storing techniques.
• So big data came into existence for handling the data which is voluminous and impure.
• Big Data is a field of collec ng the large datasets from various sources and analyzing
them systema cally and extract useful pa erns using some tools and techniques.
• Before analyzing and determining the data, the data architecture must be designed
by the data architect.
Data Architecture Design:
• Data architecture design is set of standards which are composed of certain policies,
rules, models and standards which manages, what type of data is collected, from where it is
collected, the arrangement of collected data, storing that data, u lizing and securing the data
into the systems and data warehouses for further analysis.
• Data architecture design is important for crea ng a vision of interac ons occurring
between data systems.
• Data architecture also describes the type of data structures applied to manage data
and it provides an easy way for data preprocessing.
• The data architecture is formed by dividing into three essen al models and then are
combined :
• Conceptual model:
It is a business model which uses En ty Rela onship (ER) model for rela on between en es and
their a ributes.
• Logical model:
It is a model where problems are represented in the form of logic such as rows and column of data,
classes, xml tags and other DBMS techniques.
• Physical model:
Physical models holds the database design like which type of database technology will be suitable
for architecture.
Data Architect:
• A data architect is responsible for all the design, crea on, manage, deployment of data
architecture and defines how data is to be stored and retrieved, other decisions are made by
internal bodies.
Factors that influence Data Architecture :
Few influences that can have an effect on data architecture are business policies, business
requirements, Technology used, economics, and data processing needs.
• Business requirements:
These include factors such as the expansion of business, the performance of the system access,
data management, transac on management, making use of raw data by conver ng them into
image files and records, and then storing in data warehouses. Data warehouses are the main
aspects of storing transac ons in business.
• Business policies:
The policies are rules that are useful for describing the way of processing data. These policies are
made by internal organiza onal bodies and other government agencies.
• Technology in use:
This includes using the example of previously completed data architecture design and also using
exis ng licensed so ware purchases, database technology.
• Business economics:
The economical factors such as business growth and loss, interest rates, loans, condi on of the
market, and the overall cost will also have an effect on design architecture.
• Data processing needs :
These include factors such as mining of the data, large con nuous transac ons, database
management, and other data preprocessing needs.
Data Management:
• Data management is the process of managing tasks like extrac ng data, storing data,
transferring data, processing data, and then securing data with low-cost consump on. 
Main mo ve of data management is to manage and safeguard the people’s and organiza on
data in an op mal way so that they can easily create, access, delete, and update the data.
• Because data management is an essen al process in each and every enterprise
growth, without which the policies and decisions can’t be made for business advancement.
The be er the data management the be er produc vity in business.
• Large volumes of data like big data are harder to manage tradi onally so there must
be the u liza on of op mal technologies and tools for data management such as Hadoop,
Scala, Tableau, AWS, etc. Which can further used for big data analysis in achieving
improvements in pa erns.
• Data management can be achieved by training the employees necessarily and
maintenance by DBA, data analyst, and data architects.
Data Collec on:
• Data collec on is the process of acquiring, collec ng, extrac ng, and storing the
voluminous amount of data which may be in the structured or unstructured form like text,
video, audio, XML files, records, or other image files used in later stages of data analysis.
• In the process of data analysis, “Data collec on” is the ini al step before star ng to
analyze the pa erns or useful informa on in data.
• The data which is to be analyzed must be collected from different valid sources.
• The data which is collected is known as raw data which is not useful now but on
cleaning the impure and u lizing that data for further analysis forms informa on, the
informa on obtained is known as “knowledge”.
• The main goal of data collec on is to collect informa on-rich data.
• Data collec on starts with asking some ques ons such as what type of data is to be
collected and what is the source of collec on.
Various sources of Data:
The data sources are divided mainly into two types known as:
1. Primary data
2. Secondary data
1.Primary data:
The data which is Raw, original, and extracted directly from the official sources is known as primary
data. This type of data is collected directly by performing techniques such as ques onnaires,
interviews, and surveys. The data collected must be according to the demand and requirements of
the target audience on which analysis is performed otherwise it would be a burden in the data
processing.
Few methods of collec ng primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee. Some
basic business or product related ques ons are asked and noted down in the form of notes, audio,
or video and this data is stored for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant ques ons are asked and
answers are noted down in the form of text, audio, or video. The survey method can be obtained in
both online and offline mode like through website forms and email. Then that survey answers are
stored for analysing data. Examples are online surveys or surveys through social media polls.
3. Observa on method:
The observa on method is a method of data collec on in which the researcher keenly observes the
behaviour and prac ces of the target audience using some data collec ng tool and stores the
observed data in the form of text, audio, video, or any raw formats. In this method, the data is
collected directly by pos ng a few ques ons on the par cipants. For example, observing a group of
customers and their behaviour towards the products. The data obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collec ng data through performing experiments,
research, and inves ga on. The most frequently used experiment methods are CRD, RBD, LSD, FD.
• CRD- Completely Randomized design is a simple experimental design used in data analy cs
which is based on randomiza on and replica on. It is mostly used for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which the experiment is divided
into small units called blocks. Random experiments are performed on each of the blocks and
results are drawn using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.
• LSD – La n Square Design is an experimental design that is similar to CRD and RBD blocks
but contains rows and columns. It is an arrangement of NxN squares with an equal amount of
rows and columns which contain le ers that occurs only once in a row.
• FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combina onal factors
are derived.
• 2. Secondary data:
Secondary data is the data which has already been collected and reused again for
some valid purpose. This type of data is previously recorded from primary data and it
has two types of named internal source and external source.
Internal source:
These types of data can easily be found within the organiza on such as market record,
a sales record, transac ons, customer data, accoun ng resources, etc. The cost and
me consump on is less in obtaining internal sources.
External source: The data which can’t be found at internal organiza ons and can be
gained through external third-party resources is external source data. The cost and
me consump on is more because this contains a huge amount of data. Examples of
external sources are Government publica ons, news publica ons, Registrar General of
India.
Other sources:
• Sensor’s data: With the advancement of IoT devices, the sensors of these devices
collect data which can be used for sensor data analy cs to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful informa on.
• Web traffic: Due to fast and cheap internet facili es many formats of data which
is uploaded by users on different pla orms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through
keywords and queries searched mostly.
Data Quality (noise, outliers, missing values, duplicate data):
• Data quality is a measure of the condi on of data based on factors such as accuracy,
completeness, consistency, reliability and whether it's up to date.
Measuring data quality levels can help organiza ons iden fy data errors that need to be
resolved.
Measurement Error:
• It refers to any problem resul ng from the measurement process. Ex: Hardware Failure
• For con nuous a ributes the numerical difference between measured and true value is called
as error.
Data Collection Error:
• It refers to errors such as omi ng data objects, inappropriately including a data object.
Noise:
• Noise is the random component of a measurement error.

• It may involve the distor on of a value or the addi on of spurious objects.
• The elimina on of noise is very difficult hence we focus on implemen ng robust algorithms that
produce acceptable results even when noise is present. Ex: Speech Processing, Image Processing
Artifact:
• Determinis c distor ons of the data are o en referred to as ar fact.

Precision:
The closeness of repeated measurements (of the same quan ty) to one another.  Precision is
o en measured by the standard devia on of a set of values
Bias:
• A systematic variation of measurements from the quantity being measured.

• Bias is measured by taking the difference between the mean of the set of values and the known
value of the quantity being measured.
• Bias can only be determined for objects whose measured quantity is known by means external to
the current situation.
Accuracy:
• The closeness of measurements to the true value of the quantity being measured.
• Accuracy depends on precision and bias.
• It is referred to the degree of measurement error in data Accuracy= 1-error
Outliers:
• Outliers are either

(1) data objects that, in some sense, have characteristics that are different from most of the other data
objects in the data set, or
(2) values of an attribute that are unusual with respect to the typical values
Missing Values:
• It is not unusual for an object to be missing one or more attribute values.

• In some cases, the information was not collected; e.g., some people decline to give their age or
weight.
• In other cases, some attributes are not applicable to all objects.
• Missing values should be taken into account during the data analysis.
Handling Missing Values:
• Remove the records which are having missing values

• Remove the attributes which has more missing values
• Replace missing values with mean/average value
• Replace with random value
Inconsistent Values:
• Data can contain inconsistent values.  Consider an address field, where both a zip code and city
are listed, but the specified zip code area is not contained in that city.
• The correc on of an inconsistency requires addi onal or redundant informa on.
Duplicate Data:
• A data set may include data objects that are duplicates, or almost duplicates.
To detect and eliminate such duplicates, two main issues must be addressed.
• First, if there are two objects that actually represent a single object, then the values of
corresponding a ributes may differ, and these inconsistent values must be resolved.
• Care needs to be taken to avoid accidentally combining data objects that are similar, but not
duplicates
Data Quality Issues related to Applications:
• Data quality issues can also be considered from an applica on viewpoint.

o Timeliness o Relevance o Sampling bias o Knowledge about the data
Data Pre-processing:
• The steps that should be applied to make data suitable for analysis and mining is called
as data preprocessing.
Techniques:
• Aggrega on
• Sampling
• Dimensionality Reduc on
• Feature Subset selec on
• Feature genera on/ Feature Crea on
• Discre za on and Binariza on
• Variable Transforma on
Aggregation:
• The process of combining all the records based on some criteria like sales by year or
sales by loca on is called as data aggrega on.
• Quan ta ve a ributes such as price are typically aggregated by taking sum or average.
• Qualita ve a ributes such as item can either be omi ed or summarized as the set of
all the items that was sold at that loca on.
• Aggrega on can act as a change of scope or scale by providing a higher-level view
instead of lower-level view.
Ex: Daily sales report to annual sales report.
Disadvantages of Aggregation:
• Poten al loss of minute interes ng details

Ex: In Annual sales report, we miss in which day of the week has the highest sales.
Sampling:
• Sampling is a commonly used approach for selec ng a subset of the data objects to be
analysed.
Key Principle for Effective Sampling:
• A sampling is representa ve if it has approximately the same property of interest as

the original set of data.
• If the mean or average of the data objects is the property of interest then a sample is
representa ve (close to original data).
Sampling Approaches:
I. Simple Random Sampling:

• In random sampling there is an equal probability of selec ng any par cular item.
• There are two varia ons on random sampling
o Sampling without replacement
o Sampling with replacement
II. Stratified Sampling:

• In this approach objects are divided into groups then equal number of objects are drawn from
each group even though the groups are different sizes.
Dimensionality Reduction:
• If the number of a ributes increases the performance of the model decreases. This scenario is
called as curse of dimensionality.
• The data mining algorithms will work be er if the dimensionality is lower.
• Eliminate irrelevant features and reduce noise.
• There are two techniques for dimensionality reduc on Principal Component Analysis (PCA) o
Singular Value Decomposi on (SVD)
Feature Subset Selection:
• The reduc on of dimensionality by selec ng new a ributes that are subset of the original
dataset is called as Feature Subset Selec on
• Redundant Features duplicate much or all of the informa on contained in one or more other
a ributes.
Ex: Purchase price of product and amount of sales tax paid
Irrelevant Features contain almost no useful informa on for the data mining task at hand.
Ex: Students ID number and Grade
• Redundant and irrelevant features can reduce classifica on accuracy and the quality of the
clusters that are found.
Three standard approaches for feature selec on:
1. Embedded Approaches: Feature selec on occurs as part of data mining algorithm. Ex: Decision
Tree classifier
2. Filter Approaches: Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task. Ex: Selec ng a ributes having pairwise
correla on low.
3. Wrapper Approaches: These methods use the target data mining algorithm as black box to find
the best subset of a ributes.
Ex: Ideal approach without enumera ng all possible subsets.
• Feature Weighting: It is an alterna ve method to keeping or elimina ng features.
• More important features are assigned a higher weight, while less important features are
given a lower weight.
Ex: Support Vector Machine
Feature Creation:
• Crea ng new set of a ributes that captures the important informa on in a data set from the
original a ributes is called feature crea on.
• Three methods:
1. Feature Extrac on Ex: Image Processing
2. Mapping data to a new space
Ex: Fourier Transform, Wavelet Transform
3. Feature Construc on
Ex: Density from mass and volume
Discretization and Binarization:
• The process of transforming a con nuous a ribute into categorical a ribute is called as
discretization.
Ex: Age into Kids, Youth, Middle Aged, Senior Ci zens
• Equal Width approach divides the range of a ributes into a user specified number of intervals
each having the same width.
• Equal Depth (Equal Frequency) approach holds same number of objects into each interval.
• The process of transforming both con nuous and discrete a ributes into one or more binary
a ributes is called as Binariza on.
Variable Transformation:
• It refers to a transforma on that is applied to all the values of a variable.

• For example, if only the magnitude of a variable is important, then the values of the variable can
be transformed by taking the absolute value.
• Two Types of variable transforma on:
1. Simple Func onal Transforma on:
• A simple mathema cal func on is applied to each value individually.
• If x is a variable, then examples of such transforma ons include xk,log x, ex,√x,1/x, sin x, or |x|.
2. Normaliza on or Standardiza on
• The goal of standardiza on or normaliza on is to make an en re set of values have a par cular
property.
• Normalization is used to scale the data of an a ribute so that it falls in a smaller range, such as
-1.0 to 1.0 or 0.0 to 1.0.
• It is generally useful for classifica on algorithms.
• Three methods I. Decimal Scaling
• Normalized value of a ribute = ( vi / 10j )
II. Min-Max Normaliza on
III. Z-Score Normaliza on

• z = (v – μ) / σ

1 Da

Uploaded by

Copyright:

Available Formats

1 Da

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Da

Uploaded by

Copyright:

Available Formats

UNIT-I

Data Analy cs:

Data Collection Error:

• Noise is the random component of a measurement error.

• Determinis c distor ons of the data are o en referred to as ar fact.

• A systematic variation of measurements from the quantity being measured.

• Outliers are either

• It is not unusual for an object to be missing one or more attribute values.

• Remove the records which are having missing values

• Data quality issues can also be considered from an applica on viewpoint.

• Poten al loss of minute interes ng details

• A sampling is representa ve if it has approximately the same property of interest as

I. Simple Random Sampling:

II. Stratified Sampling:

Feature Subset Selection:

Ex: Fourier Transform, Wavelet Transform

Discretization and Binarization:

• It refers to a transforma on that is applied to all the values of a variable.

III. Z-Score Normaliza on

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.