Summary- Data Quality
Summary- Data Quality
DQ is a multidimensional concept. It is the state of percentage:Proportion of the stored data versus outliers
qualitative or quantitive pieces of information, high the potential of 100% accuracy.
quality if it is fit for its intended uses in operation, Median Middle value of a distribution Not
decision making and planning Metadata is "data that provides information about sensitive to outliers
other data.
Chapter 2 Four levels of measurement
1.-what data is intended to represent (definition of
Measuring data quality dimensions serves to terms and research/business rules) 2.-how data
understand, analyze data quality problems resolve or effects this representation: conventions, including
minimize. physical data definition [format, field sizes, data types,
etc.] 3.-the system design and system processing 4.-
the limits of that representation (what the data does
6 most common dimensions: not represent) 5.-how data is used, and how it can be
1.-Completeness ->The capability of representing all used.
and only the relevant aspects of the reality of interest,
data are present or absent 2.-(In)Accuracy->The
Exercise ->without missing values the total the Nominal scale: Each category is assigned a
quality or state of being free from error, content, and
data set is 28, one name value is not accurate. number (the code is arbitrary) Ordinal scale:
form. 3.-Consistency->The ability to belong together
Just 27 values are accurate in the dataset. Each category is associated with an ordered
without contradiction 4.-Validity->The quality of being
well-grounded, sound, or correct 5.-Timeliness->The number. Metric scales: The actual measured
state of being appropriate or adapted to the time or the Accuracy % = (Number of accurate values of a value is assigned.
occasion 6.-Uniqueness->The state of being the only data element in the data set, excluding
manifestations of missing values (27) × 100 / *A scale from a higher level can be transformed
one.
(Number of values for the data element in the into a scale on a lower level, but not vice versa.
data set, excluding manifesta- tion of missing
Each data quality dimension captures one measurable
aspect of data quality: values (28)) = 27 × 100 / 28 = 96.43% Box plot: a method for graphically depicting
groups of numerical data through their quartiles
Objective when it is based on quantitative metrics Inaccuracy % = (Number of inaccurate values of Histogram: the distribution of numerical data
A)Task-independent metrics reflect states of the data a data element in the data set, excluding Scatter plot: two variables for a set of data.
without the contextual knowledge can be applied to any manifestations of missing values (1)) × 100 / Mosaic plot: two or more qualitative variables
data set, regardless of the tasks at hand. B)Task- (Number of values for the data element in the
dependent metrics are developed in specific data set, excluding manifestation of missing Exercise -> Boxplot 1:Both distributions
application contexts included are organization’s values (28)) = 1 × 100 / 28 = 3.57% (men and women) seem to be right skewed.
business rules, company, and government regulations, The median body weight of men is larger.
etc. Chapter 3
The distributions are right skewed is confirmed
by the histograms for men and for women.
Completeness Completeness is usually measured as EDA Exploratory data analysis
Boxplot 2 Distributions of random numbers
a percentage: Proportion of the stored data versus the
potential of 100% complete. Questions:Is all the What exactly is this thing called data wrangling? It’s
characterized by more or less strong right/left
necessary informationavailable? Are critical data values the ability to take a messy, unrefined source of skewness. Small or Large spread. Outliers. 50%
missing in the data set? Are all the data sets recorded? Are data and wrangle it into something useful. of the data points are larger than the median(0-1).
all mandatory data items recorded?
The data set consists of 6 variables, sample size
is n = 99. In total 4 missings → 3 for Final and 1
for TakeHome
Cumulative relative frequency shows that 55% of all
data points belong to category 0 or 1. In category 0 are
24% of the data points, in category 1 31% of the data
points.
Removing the NA
Chapter 7
General rule for parameter k → k = 4 for all The hull plot shows that the groups that appear
databases (for 2-dimensional data) in the original scatterplot are identified as sepa-
rate clusters by the DBSCAN algorithm. The
DBSCAN algorithm identifies the group at the
bottom right as one of these clusters. The noise
point is to find at the bootom left.
Chapter 8
In most cases the replication estimate rr is
Data-> Information-> Knowledge-> Wisdom smaller than the corresponding original estimate
ro. Furthermore, a substantial number of the
Information Quality (InfoQ) is: the potential of a replication estimates do not achieve statistical
dataset to achieve a specific goal using a given data significance at one-sided 2.5% level, while
analysis method. almost all original estimates did. It turns out that
only in 21 of 73 replications (≙ 29%) the result of
Quality of ... goal definition g, dataX, analysis the original study regarding significant correlation
f and utility measure U is reached. This indicated that either the original
A sudden increase of the kNN distance (a knee) studies are not valid or that replication is difficult,
indicates that the points to the right are most likely Quality of analysis f: it refers to the adequacy of also because it is not possible to include or
outliers. the empirical analysis considering the data and compare all information of the original study in
goal at hand. Reflects the adequacy of the the replication.
Choose Eps where the knee is → 1.3 modelling with respect to the data and for
answering the question of interest. Chapter 10
Here MinPts is chosen 4, according to:
"... eliminate the parameter MinPts by setting it to Chapter 9 Data quality aspects in large data sets
4 for all databases (for 2-dimensional data) («BigData»)
Detecting inherent quality problems in
research data
Outliers