0% found this document useful (0 votes)
11 views

Summary- Data Quality

Uploaded by

Mariela Ls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Summary- Data Quality

Uploaded by

Mariela Ls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Chapter 1 Accuracy is usually measured as a Mean (arithmetic mean) Very sensitive to

DQ is a multidimensional concept. It is the state of percentage:Proportion of the stored data versus outliers
qualitative or quantitive pieces of information, high the potential of 100% accuracy.
quality if it is fit for its intended uses in operation, Median Middle value of a distribution Not
decision making and planning Metadata is "data that provides information about sensitive to outliers
other data.
Chapter 2 Four levels of measurement
1.-what data is intended to represent (definition of
Measuring data quality dimensions serves to terms and research/business rules) 2.-how data
understand, analyze data quality problems resolve or effects this representation: conventions, including
minimize. physical data definition [format, field sizes, data types,
etc.] 3.-the system design and system processing 4.-
the limits of that representation (what the data does
6 most common dimensions: not represent) 5.-how data is used, and how it can be
1.-Completeness ->The capability of representing all used.
and only the relevant aspects of the reality of interest,
data are present or absent 2.-(In)Accuracy->The
Exercise ->without missing values the total the Nominal scale: Each category is assigned a
quality or state of being free from error, content, and
data set is 28, one name value is not accurate. number (the code is arbitrary) Ordinal scale:
form. 3.-Consistency->The ability to belong together
Just 27 values are accurate in the dataset. Each category is associated with an ordered
without contradiction 4.-Validity->The quality of being
well-grounded, sound, or correct 5.-Timeliness->The number. Metric scales: The actual measured
state of being appropriate or adapted to the time or the Accuracy % = (Number of accurate values of a value is assigned.
occasion 6.-Uniqueness->The state of being the only data element in the data set, excluding
manifestations of missing values (27) × 100 / *A scale from a higher level can be transformed
one.
(Number of values for the data element in the into a scale on a lower level, but not vice versa.
data set, excluding manifesta- tion of missing
Each data quality dimension captures one measurable
aspect of data quality: values (28)) = 27 × 100 / 28 = 96.43% Box plot: a method for graphically depicting
groups of numerical data through their quartiles
Objective when it is based on quantitative metrics Inaccuracy % = (Number of inaccurate values of Histogram: the distribution of numerical data
A)Task-independent metrics reflect states of the data a data element in the data set, excluding Scatter plot: two variables for a set of data.
without the contextual knowledge can be applied to any manifestations of missing values (1)) × 100 / Mosaic plot: two or more qualitative variables
data set, regardless of the tasks at hand. B)Task- (Number of values for the data element in the
dependent metrics are developed in specific data set, excluding manifestation of missing Exercise -> Boxplot 1:Both distributions
application contexts included are organization’s values (28)) = 1 × 100 / 28 = 3.57% (men and women) seem to be right skewed.
business rules, company, and government regulations, The median body weight of men is larger.
etc. Chapter 3
The distributions are right skewed is confirmed
by the histograms for men and for women.
Completeness Completeness is usually measured as EDA Exploratory data analysis
Boxplot 2 Distributions of random numbers
a percentage: Proportion of the stored data versus the
potential of 100% complete. Questions:Is all the What exactly is this thing called data wrangling? It’s
characterized by more or less strong right/left
necessary informationavailable? Are critical data values the ability to take a messy, unrefined source of skewness. Small or Large spread. Outliers. 50%
missing in the data set? Are all the data sets recorded? Are data and wrangle it into something useful. of the data points are larger than the median(0-1).
all mandatory data items recorded?
The data set consists of 6 variables, sample size
is n = 99. In total 4 missings → 3 for Final and 1
for TakeHome
Cumulative relative frequency shows that 55% of all
data points belong to category 0 or 1. In category 0 are
24% of the data points, in category 1 31% of the data
points.

Chapter 4 Imputation – Type "mean" = missings will be


Exercise ->MCAR Missing pattern replaced by the mean value
Classification of missings – Types

MCAR (Missing Completely at Random) Missing values are


completely randomly distributed across all cases (persons,
etc.) Cases with missing values do not differ from cases
without missing values. Whether a value is missing from the
data set is not related to any of the variables collected. There
is no correlation of the occurrence of missing values with
other variables. MAR (Missing at Random) The last row shows the total number of missing
The occurrence of a missing value occurs conditionally at
values for each variable and in total: 37 for
random and can be explained by the values in other
variables.Persons with complete data differ from those with Ozone and 7 for Solar.R and in total 44. In 2
incomplete data. M.NAR (Missing Not at Random) cases, the pattern is such that there is a Missing
Values are systematically missing but no information is in the 2 variables Solar.R and Ozone. sample N
available to model their absence. There is no adequate = 153. The graph shows the same information,
statistical procedure to avoid bias. but additionally made visible graphically.

Analysis and treatment of missing values Example "TakeHome"


The missing value is replaced by the mean of
Deletion 1)Listwise deletion (complete-case analysis): the variable of the original data set.
Delete all rows that have a missing value 2) Pairwise
deletion (available case analysis) Considers all data of Here a chi-squared test is conducted. The test is
a person. Leads to different sample sizes for different significant with a p-value of 0.00142, which is
variables Imputation 1)Single imputation-unit less than 0.05
imputation Missing values are replaced by mean /
median of the variable - MCAR setting: no bias - MAR Exercise -> Missing analysis the mice*
& MNAR setting: bias possible (underestimation / package is used to impute MAR values.It is one of the
overestimation) Missing values are replaced by values fastest and probably a gold standard for imputing
derived from a regressions analysis. values.
Imputation – Type "sample" = Each time you run Point vs. Contextual vs. Collective
the imputation of type "sample", you will get different
results. Hence "sample" is defined by "Random Point outliers or global outlier: extremely deviate Potential outliers show up. In fact, it is a
sample from observed values". from well-defined norms or given concepts of bimodal/normal distribution. Large values on the
expected behavior. Contextual outliers: Data right side indicate that there could be outliers.
object is extremely different in a specific context
(but not in every context). Each data object can
be defined by two attributes: 1) Contextual
attributes (Date and location in the temperature
example) 2) Behavioral attributes (Temperature,
humidity and pressure in the temperature
example). Collective outliers :Group of data
objects fall extremely far from well-defined norms
of a data set or given concepts of expected
behavior.This collection is known as collective
outliers.Example: 100 delayed orders form
collective outliers. There are 199 outliers according to the
definition of the box plot.
Chapter 5
Calculation with the percentiles method
Example "TakeHome" Exploratory Data Analysis (EDA)1.- dataset,
The missing value is replaced by a random variable 2.- key figures and graphs, 3.-Boxplot 4.-
Percentile method
16.91.

Exercise -> Visualization techniques

Choosing 5% proportion of all values to define


the outliers gives the limits. The lower limit for
Anomaly detection outliers is 50 kg, the upper limit is 113.2 kg.
observations that lie outside the interval formed
by the 2.5 and 97.5 percentiles will be considered
as potential outliers.
The dataset contains n=13908 data points and of
Exercise ->Statistical tests
these 1389 NA. Mean and median in the wtval
data set are comparable; this indicates that
variable hwy is normally distributed Outlier test according to Grubbs

Removing the NA

Generate the histogram


The null hypothesis is rejected at the 5% significance Parametric models assume a specific family of The dataset contains n = 120 data points. It
level → highest value 184.3 is an outlier distributions to describe the normal data 1) Univariate shows the relationship between learning effort in
methods deals with one random variable at a time self-study [hours per week] and success on the
so each variable would have to be modelled final exam [index 0 to 10] in a master's program.
independently using its own distribution function. 2)
Multivariate methods allow vectors of random
variables to be modelled using the same distribution Preparation: To perform k-means clustering, the
function. variables must be standardized, due to different
units of the two variables.
Grubs's Test -> Allows detecting whether the highest
or lowest value in a dataset is an outlier.Normally
The null hypothesis is kept at the 5% significance level
distributed data.The test has too little power for
→ lowest value 35.6 is not an outlier. The larger the sample sizes n ≤ 6 and should not be executed.
sample, the less the p-value can be used as a measure
of validity Rosner’s Test -> can detect several outliers at once Preparation: To perform k-means clustering,
solves the problem of masking distances between the data points are
Rosner's test (masking: outlier that is close in value to another
calculated.
outlier might be undetected). Normally distributed
dataThe test is most appropriate for large sample
sizes n ≥ 20.

Chapter 6 Running k-means clustering → Number of cluster


has to be set in advance: centers = 4 The choice
Outlier can be identified as: 1) data points that do is based on the fact that the scatter plot shows 4
not fit well in the clustering of the normal class 2) small clusters.
clusters that are far apart from the clusters of the
normal class. Clustering-based methods can be
The input k = 3 corresponds to the decision that there categorized into methods that: 1)define a single
are 3 outliers. Simulations were not run for k>10 or k> point as outlying if does not fit the clustering well
floor(n/2) (typically measured by the distance from a cluster
center) 2)consider small clusters as outliers.

Cluster Based-> Initially run k-means clustering


algorithm to find k cluster. Calculate accuracy Four clusters are created that show
and silhouette index of k-means clustering. approximately equal group sizes.

Exercise->Cluster-Based-outlier Displaying the clusters


detection
The four clusters correspond to the expected
“Outllier" shows that at least 10 potential outliers are pattern.In this solution, one element assigned to
detected in Rosner’s test. The values range from 152.5 cluster 3 is close to cluster 1.
to 184.3.
Scree plot Exercise -> RANSARC

Result: the content of the solution could be


interpreted in this way. As a result of cluster
analysis, these elements could clearly be labeled
The position of the elbow is the indicator of the number as outliers but the fact that the cluster analysis The dataset contains n = 120 data points.
of clusters. has formed a larger cluster in the middle field,
which can be interpreted well in terms of content, Run the RANSAC function
Silhouette plot. and because of the interpretation that makes
sense as a whole together with the elements on * n y d son numeros iguales o similares
the bottom right side, they are kept as part of the
whole.ay.

Chapter 7

Linear regression model: Test the regression model


by evaluating every data point against the model.
RANSAC algorithm: Random Sample Consensus
(RANSAC) is an iterative procedure for estimating
parameters of a mathematical model from a set of
observed data containing outliers. The goal is to
ensure that the outliers do not affect the estimation of
the model. Inliers data whose distribution can be
explained by some set of model parameters outliers
data that do not fit the model.
It shows a rather "balanced" picture: s i being between
Input into the function
0.56, and 0.79. The silhouette of cluster 2, the cluster data a set of observed data points
at the bottom right, has the largest value, indicating that n minimum number of data points required to fit the Outlier detection with DBSCAN – Density-
the cluster is denser, has a greater distance from the model based spatial clustering of applications with
other clusters and is found more distinct to the other k maximum number of iterations allowed in the noise k-means algorithm implicitly assumes a
clusters. algorithm spherical shape for the inliers. Inliers form areas
t threshold value to determine when a data point fits with high density, these form the "building blocks"
Dendrogram /tree diagram a model
for constructing arbitrarily-shaped areas.
d number of close data points required to assert that
a model fits well to data

Returnbestfit model parameters which best fit the


It shows a relatively even structure. There is no large data (or null if no good model is found)
subtree that stands out from the others.
Exercise -> DBSCAN Show hull plot Exercise -> Replication success
Calculate and plot kNNdist (k-Nearest Neighbor
Distances)

General rule for parameter k → k = 4 for all The hull plot shows that the groups that appear
databases (for 2-dimensional data) in the original scatterplot are identified as sepa-
rate clusters by the DBSCAN algorithm. The
DBSCAN algorithm identifies the group at the
bottom right as one of these clusters. The noise
point is to find at the bootom left.

Chapter 8
In most cases the replication estimate rr is
Data-> Information-> Knowledge-> Wisdom smaller than the corresponding original estimate
ro. Furthermore, a substantial number of the
Information Quality (InfoQ) is: the potential of a replication estimates do not achieve statistical
dataset to achieve a specific goal using a given data significance at one-sided 2.5% level, while
analysis method. almost all original estimates did. It turns out that
only in 21 of 73 replications (≙ 29%) the result of
Quality of ... goal definition g, dataX, analysis the original study regarding significant correlation
f and utility measure U is reached. This indicated that either the original
A sudden increase of the kNN distance (a knee) studies are not valid or that replication is difficult,
indicates that the points to the right are most likely Quality of analysis f: it refers to the adequacy of also because it is not possible to include or
outliers. the empirical analysis considering the data and compare all information of the original study in
goal at hand. Reflects the adequacy of the the replication.
Choose Eps where the knee is → 1.3 modelling with respect to the data and for
answering the question of interest. Chapter 10
Here MinPts is chosen 4, according to:
"... eliminate the parameter MinPts by setting it to Chapter 9 Data quality aspects in large data sets
4 for all databases (for 2-dimensional data) («BigData»)
Detecting inherent quality problems in
research data

Hierarchy of the terms: Reproducibility < Replicability


< Repeatability

Reproducibility (Different team, different


experimental setup); Replicability (Different team,
same experimental setup); Repeatability (Same
team, same experimental setup)
0 -> 1 noise point ; 1 -> cluster 1 consisting of 28 data
points; 2 -> cluster 2 consisting of 31 data points
If the two variables originate from popula- tions with
the same distributions, the points lie approximately on
an angle bisector. The greater the deviation from the
bisector, the more likely it can be assumed that the
two variables originate from populations with different
distributions.

Right skewed distribution

Outliers

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy