0% found this document useful (0 votes)

11 views

Summary- Data Quality

Uploaded by

Mariela Ls

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Summary- Data Quality

Uploaded by

Mariela Ls

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Chapter 1 Accuracy is usually measured as a Mean (arithmetic mean) Very sensitive to

DQ is a multidimensional concept. It is the state of percentage:Proportion of the stored data versus outliers
qualitative or quantitive pieces of information, high the potential of 100% accuracy.
quality if it is fit for its intended uses in operation, Median Middle value of a distribution Not
decision making and planning Metadata is "data that provides information about sensitive to outliers
other data.
Chapter 2 Four levels of measurement
1.-what data is intended to represent (definition of
Measuring data quality dimensions serves to terms and research/business rules) 2.-how data
understand, analyze data quality problems resolve or effects this representation: conventions, including
minimize. physical data definition [format, field sizes, data types,
etc.] 3.-the system design and system processing 4.-
the limits of that representation (what the data does
6 most common dimensions: not represent) 5.-how data is used, and how it can be
1.-Completeness ->The capability of representing all used.
and only the relevant aspects of the reality of interest,
data are present or absent 2.-(In)Accuracy->The
Exercise ->without missing values the total the Nominal scale: Each category is assigned a
quality or state of being free from error, content, and
data set is 28, one name value is not accurate. number (the code is arbitrary) Ordinal scale:
form. 3.-Consistency->The ability to belong together
Just 27 values are accurate in the dataset. Each category is associated with an ordered
without contradiction 4.-Validity->The quality of being
well-grounded, sound, or correct 5.-Timeliness->The number. Metric scales: The actual measured
state of being appropriate or adapted to the time or the Accuracy % = (Number of accurate values of a value is assigned.
occasion 6.-Uniqueness->The state of being the only data element in the data set, excluding
manifestations of missing values (27) × 100 / *A scale from a higher level can be transformed
one.
(Number of values for the data element in the into a scale on a lower level, but not vice versa.
data set, excluding manifestation of missing
Each data quality dimension captures one measurable
aspect of data quality: values (28)) = 27 × 100 / 28 = 96.43% Box plot: a method for graphically depicting
groups of numerical data through their quartiles
Objective when it is based on quantitative metrics Inaccuracy % = (Number of inaccurate values of Histogram: the distribution of numerical data
A)Task-independent metrics reflect states of the data a data element in the data set, excluding Scatter plot: two variables for a set of data.
without the contextual knowledge can be applied to any manifestations of missing values (1)) × 100 / Mosaic plot: two or more qualitative variables
data set, regardless of the tasks at hand. B)Task- (Number of values for the data element in the
dependent metrics are developed in specific data set, excluding manifestation of missing Exercise -> Boxplot 1:Both distributions
application contexts included are organization’s values (28)) = 1 × 100 / 28 = 3.57% (men and women) seem to be right skewed.
business rules, company, and government regulations, The median body weight of men is larger.
etc. Chapter 3
The distributions are right skewed is confirmed
by the histograms for men and for women.
Completeness Completeness is usually measured as EDA Exploratory data analysis
Boxplot 2 Distributions of random numbers
a percentage: Proportion of the stored data versus the
potential of 100% complete. Questions:Is all the What exactly is this thing called data wrangling? It’s
characterized by more or less strong right/left
necessary informationavailable? Are critical data values the ability to take a messy, unrefined source of skewness. Small or Large spread. Outliers. 50%
missing in the data set? Are all the data sets recorded? Are data and wrangle it into something useful. of the data points are larger than the median(0-1).
all mandatory data items recorded?
The data set consists of 6 variables, sample size
is n = 99. In total 4 missings → 3 for Final and 1
for TakeHome
Cumulative relative frequency shows that 55% of all
data points belong to category 0 or 1. In category 0 are
24% of the data points, in category 1 31% of the data
points.

Chapter 4 Imputation – Type "mean" = missings will be

Exercise ->MCAR Missing pattern replaced by the mean value
Classification of missings – Types

MCAR (Missing Completely at Random) Missing values are

completely randomly distributed across all cases (persons,
etc.) Cases with missing values do not differ from cases
without missing values. Whether a value is missing from the
data set is not related to any of the variables collected. There
is no correlation of the occurrence of missing values with
other variables. MAR (Missing at Random) The last row shows the total number of missing
The occurrence of a missing value occurs conditionally at
values for each variable and in total: 37 for
random and can be explained by the values in other
variables.Persons with complete data differ from those with Ozone and 7 for Solar.R and in total 44. In 2
incomplete data. M.NAR (Missing Not at Random) cases, the pattern is such that there is a Missing
Values are systematically missing but no information is in the 2 variables Solar.R and Ozone. sample N
available to model their absence. There is no adequate = 153. The graph shows the same information,
statistical procedure to avoid bias. but additionally made visible graphically.

Analysis and treatment of missing values Example "TakeHome"

The missing value is replaced by the mean of
Deletion 1)Listwise deletion (complete-case analysis): the variable of the original data set.
Delete all rows that have a missing value 2) Pairwise
deletion (available case analysis) Considers all data of Here a chi-squared test is conducted. The test is
a person. Leads to different sample sizes for different significant with a p-value of 0.00142, which is
variables Imputation 1)Single imputation-unit less than 0.05
imputation Missing values are replaced by mean /
median of the variable - MCAR setting: no bias - MAR Exercise -> Missing analysis the mice*
& MNAR setting: bias possible (underestimation / package is used to impute MAR values.It is one of the
overestimation) Missing values are replaced by values fastest and probably a gold standard for imputing
derived from a regressions analysis. values.
Imputation – Type "sample" = Each time you run Point vs. Contextual vs. Collective
the imputation of type "sample", you will get different
results. Hence "sample" is defined by "Random Point outliers or global outlier: extremely deviate Potential outliers show up. In fact, it is a
sample from observed values". from well-defined norms or given concepts of bimodal/normal distribution. Large values on the
expected behavior. Contextual outliers: Data right side indicate that there could be outliers.
object is extremely different in a specific context
(but not in every context). Each data object can
be defined by two attributes: 1) Contextual
attributes (Date and location in the temperature
example) 2) Behavioral attributes (Temperature,
humidity and pressure in the temperature
example). Collective outliers :Group of data
objects fall extremely far from well-defined norms
of a data set or given concepts of expected
behavior.This collection is known as collective
outliers.Example: 100 delayed orders form
collective outliers. There are 199 outliers according to the
definition of the box plot.
Chapter 5
Calculation with the percentiles method
Example "TakeHome" Exploratory Data Analysis (EDA)1.- dataset,
The missing value is replaced by a random variable 2.- key figures and graphs, 3.-Boxplot 4.-
Percentile method
16.91.

Exercise -> Visualization techniques

Choosing 5% proportion of all values to define

the outliers gives the limits. The lower limit for
Anomaly detection outliers is 50 kg, the upper limit is 113.2 kg.
observations that lie outside the interval formed
by the 2.5 and 97.5 percentiles will be considered
as potential outliers.
The dataset contains n=13908 data points and of
Exercise ->Statistical tests
these 1389 NA. Mean and median in the wtval
data set are comparable; this indicates that
variable hwy is normally distributed Outlier test according to Grubbs

Removing the NA

Generate the histogram

The null hypothesis is rejected at the 5% significance Parametric models assume a specific family of The dataset contains n = 120 data points. It
level → highest value 184.3 is an outlier distributions to describe the normal data 1) Univariate shows the relationship between learning effort in
methods deals with one random variable at a time self-study [hours per week] and success on the
so each variable would have to be modelled final exam [index 0 to 10] in a master's program.
independently using its own distribution function. 2)
Multivariate methods allow vectors of random
variables to be modelled using the same distribution Preparation: To perform k-means clustering, the
function. variables must be standardized, due to different
units of the two variables.
Grubs's Test -> Allows detecting whether the highest
or lowest value in a dataset is an outlier.Normally
The null hypothesis is kept at the 5% significance level
distributed data.The test has too little power for
→ lowest value 35.6 is not an outlier. The larger the sample sizes n ≤ 6 and should not be executed.
sample, the less the p-value can be used as a measure
of validity Rosner’s Test -> can detect several outliers at once Preparation: To perform k-means clustering,
solves the problem of masking distances between the data points are
Rosner's test (masking: outlier that is close in value to another
calculated.
outlier might be undetected). Normally distributed
dataThe test is most appropriate for large sample
sizes n ≥ 20.

Chapter 6 Running k-means clustering → Number of cluster

has to be set in advance: centers = 4 The choice
Outlier can be identified as: 1) data points that do is based on the fact that the scatter plot shows 4
not fit well in the clustering of the normal class 2) small clusters.
clusters that are far apart from the clusters of the
normal class. Clustering-based methods can be
The input k = 3 corresponds to the decision that there categorized into methods that: 1)define a single
are 3 outliers. Simulations were not run for k>10 or k> point as outlying if does not fit the clustering well
floor(n/2) (typically measured by the distance from a cluster
center) 2)consider small clusters as outliers.

Cluster Based-> Initially run k-means clustering

algorithm to find k cluster. Calculate accuracy Four clusters are created that show
and silhouette index of k-means clustering. approximately equal group sizes.

Exercise->Cluster-Based-outlier Displaying the clusters

detection
The four clusters correspond to the expected
“Outllier" shows that at least 10 potential outliers are pattern.In this solution, one element assigned to
detected in Rosner’s test. The values range from 152.5 cluster 3 is close to cluster 1.
to 184.3.
Scree plot Exercise -> RANSARC

Result: the content of the solution could be

interpreted in this way. As a result of cluster
analysis, these elements could clearly be labeled
The position of the elbow is the indicator of the number as outliers but the fact that the cluster analysis The dataset contains n = 120 data points.
of clusters. has formed a larger cluster in the middle field,
which can be interpreted well in terms of content, Run the RANSAC function
Silhouette plot. and because of the interpretation that makes
sense as a whole together with the elements on * n y d son numeros iguales o similares
the bottom right side, they are kept as part of the
whole.ay.

Chapter 7

Linear regression model: Test the regression model

by evaluating every data point against the model.
RANSAC algorithm: Random Sample Consensus
(RANSAC) is an iterative procedure for estimating
parameters of a mathematical model from a set of
observed data containing outliers. The goal is to
ensure that the outliers do not affect the estimation of
the model. Inliers data whose distribution can be
explained by some set of model parameters outliers
data that do not fit the model.
It shows a rather "balanced" picture: s i being between
Input into the function
0.56, and 0.79. The silhouette of cluster 2, the cluster data a set of observed data points
at the bottom right, has the largest value, indicating that n minimum number of data points required to fit the Outlier detection with DBSCAN – Density-
the cluster is denser, has a greater distance from the model based spatial clustering of applications with
other clusters and is found more distinct to the other k maximum number of iterations allowed in the noise k-means algorithm implicitly assumes a
clusters. algorithm spherical shape for the inliers. Inliers form areas
t threshold value to determine when a data point fits with high density, these form the "building blocks"
Dendrogram /tree diagram a model
for constructing arbitrarily-shaped areas.
d number of close data points required to assert that
a model fits well to data

Returnbestfit model parameters which best fit the

It shows a relatively even structure. There is no large data (or null if no good model is found)
subtree that stands out from the others.
Exercise -> DBSCAN Show hull plot Exercise -> Replication success
Calculate and plot kNNdist (k-Nearest Neighbor
Distances)

General rule for parameter k → k = 4 for all The hull plot shows that the groups that appear
databases (for 2-dimensional data) in the original scatterplot are identified as sepa-
rate clusters by the DBSCAN algorithm. The
DBSCAN algorithm identifies the group at the
bottom right as one of these clusters. The noise
point is to find at the bootom left.

Chapter 8
In most cases the replication estimate rr is
Data-> Information-> Knowledge-> Wisdom smaller than the corresponding original estimate
ro. Furthermore, a substantial number of the
Information Quality (InfoQ) is: the potential of a replication estimates do not achieve statistical
dataset to achieve a specific goal using a given data significance at one-sided 2.5% level, while
analysis method. almost all original estimates did. It turns out that
only in 21 of 73 replications (≙ 29%) the result of
Quality of ... goal definition g, dataX, analysis the original study regarding significant correlation
f and utility measure U is reached. This indicated that either the original
A sudden increase of the kNN distance (a knee) studies are not valid or that replication is difficult,
indicates that the points to the right are most likely Quality of analysis f: it refers to the adequacy of also because it is not possible to include or
outliers. the empirical analysis considering the data and compare all information of the original study in
goal at hand. Reflects the adequacy of the the replication.
Choose Eps where the knee is → 1.3 modelling with respect to the data and for
answering the question of interest. Chapter 10
Here MinPts is chosen 4, according to:
"... eliminate the parameter MinPts by setting it to Chapter 9 Data quality aspects in large data sets
4 for all databases (for 2-dimensional data) («BigData»)
Detecting inherent quality problems in
research data

Hierarchy of the terms: Reproducibility < Replicability

< Repeatability

Reproducibility (Different team, different

experimental setup); Replicability (Different team,
same experimental setup); Repeatability (Same
team, same experimental setup)
0 -> 1 noise point ; 1 -> cluster 1 consisting of 28 data
points; 2 -> cluster 2 consisting of 31 data points
If the two variables originate from populations with
the same distributions, the points lie approximately on
an angle bisector. The greater the deviation from the
bisector, the more likely it can be assumed that the
two variables originate from populations with different
distributions.

Right skewed distribution

Outliers

Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
ppt2
No ratings yet
ppt2
57 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Data Quality
100% (2)
Data Quality
16 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Eda 2022 04 11 09352244
No ratings yet
Eda 2022 04 11 09352244
35 pages
Presentation 3
No ratings yet
Presentation 3
14 pages
Topic Five (5)
No ratings yet
Topic Five (5)
55 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Unit 1
No ratings yet
Unit 1
21 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Act 2 AGJ
No ratings yet
Act 2 AGJ
6 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
Missing Data
100% (2)
Missing Data
35 pages
Data Analytics Course Session 1-5
100% (1)
Data Analytics Course Session 1-5
252 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Unit 1
No ratings yet
Unit 1
26 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
No ratings yet
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
6 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
UNIT 2_2
No ratings yet
UNIT 2_2
22 pages
Data Quality
No ratings yet
Data Quality
7 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Lecture 5 - Data Quality Checks and Lecture 6 Missing value analysis (1)
No ratings yet
Lecture 5 - Data Quality Checks and Lecture 6 Missing value analysis (1)
53 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Act2 Apren GVZA
No ratings yet
Act2 Apren GVZA
4 pages
Businnes Intelligence
No ratings yet
Businnes Intelligence
36 pages
1st Part of Material
No ratings yet
1st Part of Material
15 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
PS-ML-Lect-5-9-Unit-2
No ratings yet
PS-ML-Lect-5-9-Unit-2
114 pages
CAC 428 Topic 2_Data Quality
No ratings yet
CAC 428 Topic 2_Data Quality
29 pages
Data Quality
No ratings yet
Data Quality
14 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Chapter Six Data Processing, Analysis and Interpretation
No ratings yet
Chapter Six Data Processing, Analysis and Interpretation
8 pages
253777
No ratings yet
253777
66 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
No ratings yet
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
4 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Data Mining-L3
No ratings yet
Data Mining-L3
22 pages
Module 4
No ratings yet
Module 4
47 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Infant Behavior and Development: Noor de Waal, Myrthe G.B.M. Boekhorst, Ivan Nyklí Cek, Victor J.M. Pop
No ratings yet
Infant Behavior and Development: Noor de Waal, Myrthe G.B.M. Boekhorst, Ivan Nyklí Cek, Victor J.M. Pop
12 pages
Impact of OJT
No ratings yet
Impact of OJT
9 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Heart_Disease_Prediction_Interview_QA
No ratings yet
Heart_Disease_Prediction_Interview_QA
2 pages
Stock Market Prediction Using Machine Learning Report 1
No ratings yet
Stock Market Prediction Using Machine Learning Report 1
36 pages
wolfinger2003
No ratings yet
wolfinger2003
18 pages
Game Theory and Machine Learning For Cyber Security (Charles A. Kamhoua (Editor) Etc.) (Z-Library)
No ratings yet
Game Theory and Machine Learning For Cyber Security (Charles A. Kamhoua (Editor) Etc.) (Z-Library)
547 pages
Computers in Human Behavior: Yunwen Wang
No ratings yet
Computers in Human Behavior: Yunwen Wang
9 pages
The Statistics of Causal Inference: A View From Political Methodology
No ratings yet
The Statistics of Causal Inference: A View From Political Methodology
23 pages
Full Information Maximum Likelihood Estimation For
No ratings yet
Full Information Maximum Likelihood Estimation For
36 pages
Ehac 494
No ratings yet
Ehac 494
12 pages
VIGAN: Missing View Imputation With Generative Adversarial Networks
No ratings yet
VIGAN: Missing View Imputation With Generative Adversarial Networks
10 pages
9580 ANIL PANDEY Anil DATA Week 4 Assignment 4 833985 1072711078
No ratings yet
9580 ANIL PANDEY Anil DATA Week 4 Assignment 4 833985 1072711078
16 pages
Top 30 Data Analyst Interview Questions & Answers (2022)
No ratings yet
Top 30 Data Analyst Interview Questions & Answers (2022)
16 pages
Brain Stroke Prediction
No ratings yet
Brain Stroke Prediction
5 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Recruiting
No ratings yet
Recruiting
18 pages
paper2
No ratings yet
paper2
9 pages
Methods and Principles of Statistical Analysis: 2.1 Recommended Textbooks On Statistics
No ratings yet
Methods and Principles of Statistical Analysis: 2.1 Recommended Textbooks On Statistics
18 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Mindthe Gap ASuccinct Explorationof Research Gap Types
No ratings yet
Mindthe Gap ASuccinct Explorationof Research Gap Types
13 pages
Stata Guide V1
No ratings yet
Stata Guide V1
65 pages
PSY417 Week02
No ratings yet
PSY417 Week02
38 pages
Rubrics _ Project ISE 291 T241 _ Updated
No ratings yet
Rubrics _ Project ISE 291 T241 _ Updated
2 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
FDSA Unit V Notes
No ratings yet
FDSA Unit V Notes
8 pages
Innovative Strategies Statistical Solutions and Simulations for Modern Clinical Trials 1st Edition Mark Chang (Author) download pdf
100% (3)
Innovative Strategies Statistical Solutions and Simulations for Modern Clinical Trials 1st Edition Mark Chang (Author) download pdf
55 pages
Keogh Et Al 2018 Biometrics
No ratings yet
Keogh Et Al 2018 Biometrics
12 pages
Chapter4 (The Evaluating Multiple Models Chapter Is Really Good!)
No ratings yet
Chapter4 (The Evaluating Multiple Models Chapter Is Really Good!)
47 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Summary- Data Quality

Uploaded by

Summary- Data Quality

Uploaded by

Chapter 1 Accuracy is usually measured as a Mean (arithmetic mean) Very sensitive to

Chapter 4 Imputation – Type "mean" = missings will be

MCAR (Missing Completely at Random) Missing values are

Analysis and treatment of missing values Example "TakeHome"

Exercise -> Visualization techniques

Choosing 5% proportion of all values to define

Generate the histogram

Chapter 6 Running k-means clustering → Number of cluster

Cluster Based-> Initially run k-means clustering

Exercise->Cluster-Based-outlier Displaying the clusters

Result: the content of the solution could be

Linear regression model: Test the regression model

Returnbestfit model parameters which best fit the

Hierarchy of the terms: Reproducibility < Replicability

Reproducibility (Different team, different

Right skewed distribution

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.