1 Da
1 Da
1 Da
Data Management: Design Data Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality (noise,
outliers, missing values, duplicate data) and Data Processing &Processing.
• Conceptual model:
It is a business model which uses En ty Rela onship (ER) model for rela on between en es and
their a ributes.
• Logical model:
It is a model where problems are represented in the form of logic such as rows and column of data,
classes, xml tags and other DBMS techniques.
• Physical model:
Physical models holds the database design like which type of database technology will be suitable
for architecture.
Data Architect:
• A data architect is responsible for all the design, crea on, manage, deployment of data
architecture and defines how data is to be stored and retrieved, other decisions are made by
internal bodies.
Factors that influence Data Architecture :
Few influences that can have an effect on data architecture are business policies, business
requirements, Technology used, economics, and data processing needs.
• Business requirements:
These include factors such as the expansion of business, the performance of the system access,
data management, transac on management, making use of raw data by conver ng them into
image files and records, and then storing in data warehouses. Data warehouses are the main
aspects of storing transac ons in business.
• Business policies:
The policies are rules that are useful for describing the way of processing data. These policies are
made by internal organiza onal bodies and other government agencies.
• Technology in use:
This includes using the example of previously completed data architecture design and also using
exis ng licensed so ware purchases, database technology.
• Business economics:
The economical factors such as business growth and loss, interest rates, loans, condi on of the
market, and the overall cost will also have an effect on design architecture.
• Data processing needs :
These include factors such as mining of the data, large con nuous transac ons, database
management, and other data preprocessing needs.
Data Management:
• Data management is the process of managing tasks like extrac ng data, storing data,
transferring data, processing data, and then securing data with low-cost consump on.
Main mo ve of data management is to manage and safeguard the people’s and organiza on
data in an op mal way so that they can easily create, access, delete, and update the data.
• Because data management is an essen al process in each and every enterprise
growth, without which the policies and decisions can’t be made for business advancement.
The be er the data management the be er produc vity in business.
• Large volumes of data like big data are harder to manage tradi onally so there must
be the u liza on of op mal technologies and tools for data management such as Hadoop,
Scala, Tableau, AWS, etc. Which can further used for big data analysis in achieving
improvements in pa erns.
• Data management can be achieved by training the employees necessarily and
maintenance by DBA, data analyst, and data architects.
Data Collec on:
• Data collec on is the process of acquiring, collec ng, extrac ng, and storing the
voluminous amount of data which may be in the structured or unstructured form like text,
video, audio, XML files, records, or other image files used in later stages of data analysis.
• In the process of data analysis, “Data collec on” is the ini al step before star ng to
analyze the pa erns or useful informa on in data.
• The data which is to be analyzed must be collected from different valid sources.
• The data which is collected is known as raw data which is not useful now but on
cleaning the impure and u lizing that data for further analysis forms informa on, the
informa on obtained is known as “knowledge”.
• The main goal of data collec on is to collect informa on-rich data.
• Data collec on starts with asking some ques ons such as what type of data is to be
collected and what is the source of collec on.
Various sources of Data:
The data sources are divided mainly into two types known as:
1. Primary data
2. Secondary data
1.Primary data:
The data which is Raw, original, and extracted directly from the official sources is known as primary
data. This type of data is collected directly by performing techniques such as ques onnaires,
interviews, and surveys. The data collected must be according to the demand and requirements of
the target audience on which analysis is performed otherwise it would be a burden in the data
processing.
Few methods of collec ng primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee. Some
basic business or product related ques ons are asked and noted down in the form of notes, audio,
or video and this data is stored for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant ques ons are asked and
answers are noted down in the form of text, audio, or video. The survey method can be obtained in
both online and offline mode like through website forms and email. Then that survey answers are
stored for analysing data. Examples are online surveys or surveys through social media polls.
3. Observa on method:
The observa on method is a method of data collec on in which the researcher keenly observes the
behaviour and prac ces of the target audience using some data collec ng tool and stores the
observed data in the form of text, audio, video, or any raw formats. In this method, the data is
collected directly by pos ng a few ques ons on the par cipants. For example, observing a group of
customers and their behaviour towards the products. The data obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collec ng data through performing experiments,
research, and inves ga on. The most frequently used experiment methods are CRD, RBD, LSD, FD.
• CRD- Completely Randomized design is a simple experimental design used in data analy cs
which is based on randomiza on and replica on. It is mostly used for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which the experiment is divided
into small units called blocks. Random experiments are performed on each of the blocks and
results are drawn using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.
• LSD – La n Square Design is an experimental design that is similar to CRD and RBD blocks
but contains rows and columns. It is an arrangement of NxN squares with an equal amount of
rows and columns which contain le ers that occurs only once in a row.
• FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combina onal factors
are derived.
• 2. Secondary data:
Secondary data is the data which has already been collected and reused again for
some valid purpose. This type of data is previously recorded from primary data and it
has two types of named internal source and external source.
Internal source:
These types of data can easily be found within the organiza on such as market record,
a sales record, transac ons, customer data, accoun ng resources, etc. The cost and
me consump on is less in obtaining internal sources.
External source: The data which can’t be found at internal organiza ons and can be
gained through external third-party resources is external source data. The cost and
me consump on is more because this contains a huge amount of data. Examples of
external sources are Government publica ons, news publica ons, Registrar General of
India.
Other sources:
• Sensor’s data: With the advancement of IoT devices, the sensors of these devices
collect data which can be used for sensor data analy cs to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful informa on.
• Web traffic: Due to fast and cheap internet facili es many formats of data which
is uploaded by users on different pla orms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through
keywords and queries searched mostly.
Data Quality (noise, outliers, missing values, duplicate data):
• Data quality is a measure of the condi on of data based on factors such as accuracy,
completeness, consistency, reliability and whether it's up to date.
Measuring data quality levels can help organiza ons iden fy data errors that need to be
resolved.
Measurement Error:
• It refers to any problem resul ng from the measurement process. Ex: Hardware Failure
• For con nuous a ributes the numerical difference between measured and true value is called
as error.
• It refers to errors such as omi ng data objects, inappropriately including a data object.
Noise:
• The elimina on of noise is very difficult hence we focus on implemen ng robust algorithms that
produce acceptable results even when noise is present. Ex: Speech Processing, Image Processing
Artifact:
The closeness of repeated measurements (of the same quan ty) to one another. Precision is
o en measured by the standard devia on of a set of values
Bias:
• The closeness of measurements to the true value of the quantity being measured.
• Accuracy depends on precision and bias.
• It is referred to the degree of measurement error in data Accuracy= 1-error
Outliers:
Missing Values:
• Data can contain inconsistent values. Consider an address field, where both a zip code and city
are listed, but the specified zip code area is not contained in that city.
• The correc on of an inconsistency requires addi onal or redundant informa on.
Duplicate Data:
• A data set may include data objects that are duplicates, or almost duplicates.
To detect and eliminate such duplicates, two main issues must be addressed.
• First, if there are two objects that actually represent a single object, then the values of
corresponding a ributes may differ, and these inconsistent values must be resolved.
• Care needs to be taken to avoid accidentally combining data objects that are similar, but not
duplicates
Data Quality Issues related to Applications:
Data Pre-processing:
• The steps that should be applied to make data suitable for analysis and mining is called
as data preprocessing.
Techniques:
• Aggrega on
• Sampling
• Dimensionality Reduc on
• Feature Subset selec on
• Feature genera on/ Feature Crea on
• Discre za on and Binariza on
• Variable Transforma on
Aggregation:
• The process of combining all the records based on some criteria like sales by year or
sales by loca on is called as data aggrega on.
• Quan ta ve a ributes such as price are typically aggregated by taking sum or average.
• Qualita ve a ributes such as item can either be omi ed or summarized as the set of
all the items that was sold at that loca on.
• Aggrega on can act as a change of scope or scale by providing a higher-level view
instead of lower-level view.
Ex: Daily sales report to annual sales report.
Disadvantages of Aggregation:
Sampling:
• Sampling is a commonly used approach for selec ng a subset of the data objects to be
analysed.
Key Principle for Effective Sampling:
Dimensionality Reduction:
• If the number of a ributes increases the performance of the model decreases. This scenario is
called as curse of dimensionality.
• The data mining algorithms will work be er if the dimensionality is lower.
• Eliminate irrelevant features and reduce noise.
• There are two techniques for dimensionality reduc on Principal Component Analysis (PCA) o
Singular Value Decomposi on (SVD)
• The reduc on of dimensionality by selec ng new a ributes that are subset of the original
dataset is called as Feature Subset Selec on
• Redundant Features duplicate much or all of the informa on contained in one or more other
a ributes.
Ex: Purchase price of product and amount of sales tax paid
Irrelevant Features contain almost no useful informa on for the data mining task at hand.
Ex: Students ID number and Grade
• Redundant and irrelevant features can reduce classifica on accuracy and the quality of the
clusters that are found.
Three standard approaches for feature selec on:
1. Embedded Approaches: Feature selec on occurs as part of data mining algorithm. Ex: Decision
Tree classifier
2. Filter Approaches: Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task. Ex: Selec ng a ributes having pairwise
correla on low.
3. Wrapper Approaches: These methods use the target data mining algorithm as black box to find
the best subset of a ributes.
Ex: Ideal approach without enumera ng all possible subsets.
• Feature Weighting: It is an alterna ve method to keeping or elimina ng features.
• More important features are assigned a higher weight, while less important features are
given a lower weight.
Ex: Support Vector Machine
Feature Creation:
• Crea ng new set of a ributes that captures the important informa on in a data set from the
original a ributes is called feature crea on.
• Three methods:
1. Feature Extrac on Ex: Image Processing
2. Mapping data to a new space
3. Feature Construc on
Ex: Density from mass and volume
• The process of transforming a con nuous a ribute into categorical a ribute is called as
discretization.
Ex: Age into Kids, Youth, Middle Aged, Senior Ci zens
• Equal Width approach divides the range of a ributes into a user specified number of intervals
each having the same width.
• Equal Depth (Equal Frequency) approach holds same number of objects into each interval.
• The process of transforming both con nuous and discrete a ributes into one or more binary
a ributes is called as Binariza on.
Variable Transformation: