0% found this document useful (0 votes)
13 views

Data Analytics Summary

The document outlines the Foundations of Data Analytics course, detailing various types of data analytics, the data analytics life cycle, and different data types. It covers exploratory data analysis (EDA), summary statistics, and various statistical plots used for data visualization, emphasizing the importance of both numerical summaries and graphical representations. Additionally, it discusses the purpose of visualization, effective communication through visual means, and the psychological principles that aid in interpreting visual data.

Uploaded by

biermanscas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Data Analytics Summary

The document outlines the Foundations of Data Analytics course, detailing various types of data analytics, the data analytics life cycle, and different data types. It covers exploratory data analysis (EDA), summary statistics, and various statistical plots used for data visualization, emphasizing the importance of both numerical summaries and graphical representations. Additionally, it discusses the purpose of visualization, effective communication through visual means, and the psychological principles that aid in interpreting visual data.

Uploaded by

biermanscas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Summary Foundations of Data Ana

2IAB0 – 2023/2024

Biermans, Cas
Biermanscas@gmail.com
Contents
Lecture 1: (Monday 13-11-2023).......................................................................................................3
Lecture 2: (Monday 20-11-2023).....................................................................................................12
Lecture 3: (Monday 27-11-2023).....................................................................................................27
Lecture 4: (Monday 4-12-2023).......................................................................................................34
Lecture 5: (Monday 11-12-2023).....................................................................................................46
Lecture 6: (Monday 18-12-2023).....................................................................................................58

1
Lecture 1: (Monday 13-11-2023)
Types of data analytics:
- Descriptive: insight into the past
- Predictive: looking into the future
- Prescriptive: data-driven advice on how to take action to influence or change the future

Data analytics life cycle:

Data: raw numbers, facts, etc.


Information: structured, meaningful, and useful numbers and facts

Data types:
- Categorical data: data that has no intrinsic value:
o Nominal: two or more outcomes that have no natural order
e.g. movie genre or hair color
o Ordinal: two or more outcomes that have a natural order
e.g. movie ratings (bad, neutral or good), level of education
- Numerical (quantitative) data: data that has an intrinsic numerical value:
o Continuous: data that can attain any value on a given measurement scale
 Interval data: equal intervals represent equal differences, there is no fixed
“zero point”
e.g. clock time, birth year
 Ratio data: both differences and ratios make sense, there is a fixed "zero
point"
e.g. movie budget, distance, time duration
o Discrete: data that can only attain certain values (typically integers)
e.g. the number of days with sunshine in a certain year, the number of traffic
incidents

Reference table: to store “all” data in a table so that it can be looked up easily.
Demonstration table: to illustrate a point (with just enough data, or with a specific summary)

EDA: exploratory data analytics


Key features of EDA:

2
• getting to know the data before doing further analysis
• extensively using plots
• generating questions
• detecting errors in data

Plots are a useful tool to discover unexpected relations and insights. Plots help us to explore and give
clues. Numerical summaries like averages help us to document essential features of data sets.
One should use both plots and numerical summaries. They complement each other. Numerical
summaries are often called statistics or summary statistics (note the double meaning of the word:
both a scientific field and computed numbers).

Summary statistics:
- Level: location summary statistics (typical values)
- Spread: scale summary statistics (how much do values vary?)
- Relation: association summary statistics (how do values of different quantities vary
simultaneously)

Location summary statistics:


- Mean/average:
n
1
x̄= ∑x
n i=1 i
- Median: the value separating the higher half from the lower half of a data set
o Median computation:
 Order series of observations from small to large.
 If the number of observations is odd, take the middle value.
 If the number of observations is even, take the average of the two middle
values.
- Mode: most frequently occurring value, may be non-unique
The mean is sensitive to “outliers”, the median is not.
The mean can be misleading and difficult to interpret for non-symmetric data sets.

3
Pth percentile: a cut-off point for P% of data
The 0th percentile = the smallest element of the dataset
The 100th percentile = the largest element of the dataset
100
For a dataset with n observations, the interval between neighbors’ percentiles is %
n−1
p
For a percentile P we compute its location in a dataset of n observations: L p=1+ ( n−1 )
100

Let l and h be the observations at the position ⌊Lp⌋ and ⌊Lp⌋ in the ordered dataset.
Apply linear interpolation:

Pth percentile value = l+ ( L p−⌊ L p ⌋ ) ( h−l )

Location and scale statistics:

Range = max – min


Interquartile range (IQR) = 3rd quartile – 1st quartile
n
1
Sample variance = ∑ ( x 1−x )
2

Often denoted by 𝝈2 or S2
n−1 i=1


n
1
Sample standard deviation s = ∑
n−1 i=1
( x1−x )
2

Median absolute deviation (MAD): median of the absolute deviation |x i−M | from the median M.

4
The higher these statistics, the more spread/variability in the data.

The standard deviation has right (physical) unit.


The variance is more convenient mathematically.
The range, variance and standard deviation are sensitive to “outliers”, IQR and MAD are not.
The standard deviation can be used as a general unit to describe variability.

The z-score (=normalized value) of value 𝑥 shows how many standard deviations the value is from
Standardization (z-score normalization):

the mean:
x−x
z=
s
Negative z-score: the value is below the mean.
Positive z-score: the value is above the mean.

The mean value z of the z-scores of a data set is 0 and the standard deviation s z is 1.

Rule of thumb: observations with a z-score larger than 2.5 are considered to be extreme (“outliers”).

Association statistics: Association statistics try to quantify the strength of the relation between two
variables (attributes).
The sign of an association statistic indicates whether it is:
- a positive association (e.g., between budget and profit)
- a negative association (e.g., between ice cream consumption and weight)

n
1
Sample covariance s xy = ∑ ( x −x ) ( y i− y )
n−1 i=1 i
s xy
Sample correlation r xy =
sx s y
( s x =std . dev of x ) ∧(−1 ≤r xy ≤1 )

Jitter plot:
no meaning of different x-coordinates, only to indicate amount of data.
- One dimensional numerical data
Tasks: finding clusters and outliers
Not suitable for large data sets

5
6
Scatter plot
- Two-dimensional data with 2 numerical attributes (e.g. production budget and profit)
Tasks: investigate relations and see whether there are outliers

Histogram: distribution of numerical data


- One-dimensional numerical data
- The range of data values is split in bins (intervals of values)
- Choose either:
o Bin size
o number of bins
- The histogram show the count of observations or percentages per bin
Tasks: understanding the distribution, comparing the counts/percentages of bins

Histograms are sensitive to bin width:


- Bin width too small: too wiggly
- Bin width too large: too few details
Rule of thumb for choosing a sensible number of bins: √ n, where n is the number of observations.
“wrong” choice of bin size can still lead to useful insights.

Cumulative histogram:
- One-dimensional numerical data
- The range of data values is split in bins
- The height of a bin [a,b) show the count or percentage of the number of observations not
exceeding b.
Task: understanding the distribution, exploring unexpected “jumps”, to lookup percentiles or
thresholds.

Bar chart:

7
- Two-dimensional data with:
o 1 categorical attribute
e.g. release month, genre
o 1 numerical attribute
e.g. the number of released movies
- Also used for one-dimensional categorical data with the derived numerical attribute being
the count or percentage of the number of observations per category
Tasks: looking up and comparing values

Bar charts vs Histograms

Summary elementary statistical plots:


- Strip plots (=dot plots) are good for showing actual values, but cannot be used for large data
sets.
- Scatter plots are useful to show relations between quantities.
- Histograms are for numerical data, they require a good choice of bin width.
- Cumulative histograms are useful to illustrate thresholds.
- Bar charts are for categorical + numerical data, histograms are for numerical data.

8
Box and whisker plot: convenient way to display summary statistics
- Box: 1st and 3rd quartile and the median inside it
- Outliers: dots/crosses/diamond… for all values above (Q3+1.5*IQR) and below (Q1-1.5*IQR)
- Endpoints of whiskers: the maximal and minimal non-outlier values
- Optional: the indication of the mean

Kernel density plots: improved histograms


- The kernel density function shows the likelihood of finding a data point at a specific value on
the x-axis.
- The total area under the curve always equals 1.
- The area under the curve between x1 and x2 shows an estimate of the probability of getting a
value between x1 and x2.
- It is impossible to determine the minimal and maximal values or find the mode of a dataset
based on its kernel density plot.

How to generate kernel density plots:


- Choose a kernel function and a bandwidth to be taken around each data point.
- Generate a kernel with the chosen bandwidth for every data point in the dataset.
1
- For a data set with n points, the area under the curve of the kernel of a point is
n
- The kernel density plot is the sum of the kernels of the data points.

9
- Bandwidth choice is important!
Influence of bandwidth!

Typical distribution shapes:

Violin plot:
Combination of box-and-whisker plot and kernel density plot to have the best of two worlds:

- Globe shape of box-and-whisker plot


- Local details of kernel density plot

Empirical cumulative distribution function (ECDF)


Problem: the cumulative histogram suffers from the choice of fixed bin width
Solution: ECDF – empirical cumulative distribution function

ECDF is a function that for a give value x returns the fraction of observations that are smaller or equal
to x.

10
Summary advanced statistical plots:
- Kernel density plots:
o overcome the drawbacks of histograms because they do not have fixed bins, but
beware of bandwidth!
o good tools to explore distribution shapes (symmetry, skewness, unimodality,
bimodality,...)
- Box and whisker plots are simple but effective ways to compare shapes of groups.
- Violin plots combine the advantages of kernel density plot and box-and-whisker plots.
- The ECDF (empirical distribution function) overcomes the fixed bin problems of the
cumulative histogram.

11
Lecture 2: (Monday 20-11-2023)

Infographics:
- Made for telling or explaining a story.
- Focusing on narrative, not on data.
- Manually drawn and illustrated.
- Using a larger range of creative elements and styles.
- Using a presentation method that cannot simply be reused for a different dataset.

Sphygmograph:
- First tentative to record signals. (heartbeat)

Visual definition:
- Visualization is the process that transforms abstract data into interactive graphical
representations for the purpose of exploration, confirmation, and communication.

Why visualize?
- Communication to inform humans.
- Exploration when questions are not well defined.

Communication:
- Specific aspects of a larger dataset are extracted and visualized.
- Annotations highlight important aspects and allow the reader to better connect the
presented information to their existing knowledge.
- The juxtaposition of upper and lower graphs with comparable axes and scales, facilitates
comparison.
- The data are selected and often summarized to support specific (simple) analysis and
insights. Annotations highlight important aspects and facilitate comparison.

Purpose of visualization:
- Open exploration
- Confirmation
- Communication

When NOT to visualize:


- When a table is better, for:
o Individual precise values matter.
o Both summary and detail values.
o The scale is broken: a lot of small numbers and a few large numbers.
- Decision needed in minimal time:
o High frequency stock market.
o Manufacturing: is a bottle broken, do we need to stop the production line.

12
13
High level actions:
- Analyze
- Search
- Query

Analyze:
- Visualization for consuming data:
o For end-users, often not technical.
o Needs good balance of details and message.
o Design matters more than for other tasks.
o Dataset will not be changed.
- Visualization for producing data:
o Extends the dataset.
o Usually interactive, more technical users.
o Additional meta-data (semantics)
o Additional data (volume)
o Additional dimensions (derived)
Goals:
When consuming data:
- Discover new information.
o Generate a hypothesis.
o Find support for a hypothesis.
- Present – communicate something you know.
- Enjoy – have fun with data visualization.

When producing data:


- Annotate
- Record
- Derive

Annotate:
- Attach text or graphical elements to visualization elements.

14
Derive:

Record:
- Save an artifact, create graphical history!

Search:
- Location:
o A “key” identifying an object/artefact.
- Target:
o An aspect of interest: trends, outliers, features.

We look at all data:


- Trends (patterns) define the “mainstream”.
- Outliers stand out from the mainstream.
- Features are task-dependent structures of interest.
We look at attributes only:
- Single attribute: analyze the distribution or extremes.
- Many attributes: analyze dependency, correlation, or similarity.

Visual Processing: proximity


Interpretation by our brains, this is part of a psychological theory called “gestalt theory”.
- Objects close to each other are perceived as a group.
- Objects that are similar (color, shape, …) are perceived as a group.

Our eyes unconsciously draw a line thought the points and “see” a maximum value.

15
Gestalt principles can help us making effective graphs:
- Position and the arrangement of visual elements is the most important channel for
visualizations.

Perception: colors
- Perception of color begins with 3 specialized retinal cells containing pigments with different
spectral sensitives, known as cone cells.
- The output of these cells in the retina is used in the visual cortex and associative areas of the
brain.
- The human eye is sensitive to light, but not the same at all frequencies.
- Perceptual processing before optic nerve:
o One achromatic channel L
o Two chromatic channels R-G and Y-B axis
- “Color blind” if one axis has degraded acuity
o 8% of men are red/green color deficient.
o Blue/yellow is very rare => blue orange is safe to use.

Attribute types: recap


- Nominal - can only be ordered with auxiliary information (e.g. alphabetically)
o Example: fruit, fabric types, chemical elements
- Ordinal: only order can be determined (< > = ≠), no computations possible
o Example: shirt size M is between S and L; the difference L – S has no meaning
- Numerical (quantitative): support arithmetic comparison
o Example: length – the difference 46cm – 23cm is meaningful
- Ordering direction
o Sequential: Shirt size XS < S < M < L < XL
o Diverging: colour scheme red-blue
o Cyclic: Time representation, e.g. day of week (Mon, Tue, Wed, Thu, Fr, Sat, Sun,
Mon,...)

Key and value attributes:


- A key attribute acts as an index that is used to look up value attributes.
o Distinction between key and value attributes is important for tables.
o Helpful to pick a suitable visualization type (“idiom”)
o Keys may be nominal or ordinal attributes, usually not quantitative attributes.
 An id can be encoded as a number, but it is not a quantitative attribute!
o Keys will come back in lecture 4 about databases.
o Movie data set example: “movie title” can work as a key for a small dataset, but
hardly for a large one, e.g. 21 movies with the title Anna Karenina on imdb.

16
Marks and channels
Data visualization makes use of:
- Marks (geometric primitives)
o Points
o Lines
o Areas
o Complex shapes
- Channels (appearance of marks)
o Position (horizontal, vertical, both)
o Color (HSL = hue, saturation, luminance)
o Size
o Shape (e.g. circles, diamonds)
o Angle, …

Encoding attributes:

Redundant encoding:
Using two or more channels to encode one attribute. E.g. hue and horizontal position

Arrangement:
How data values determine position and alignment of visual representatives.
- Express: attribute is mapped to spatial position along an axis (e.g. MWh per capita)
- Separate: emphasize similarity and distinction (using a categorical attribute, e.g. country)
- Order: emphasize order (using an ordinal attribute, e.g. year, if you show progression in time)
- Align: emphasize quantitative comparison (for a quantitative attribute, e.g. MWh per capital
of a country)
- Use: using an existing structure for arrangement (е.g. a map)

17
Mapping color:
- Usually, 3 channels
o Hue, Saturation, Luminance (HSL)
o RGB or CMYK are less useful.
- Ordering/how-much for ordinal and numerical attributes.
o Map to Luminance
o Map to Saturation
- What/where for categorical attributes.
o Map to Hue
- Transparency is possible.
o Useful for creating visual layers.
o But: is not combinable with luminance or saturation.

Designing for color deficiency: encoding

Map other channels:


- Size
o Length - accurate
o 2d area - ok
o 3d volume - poor
- Angle
o Non-linear accuracy (only in horizontal, vertical and exact diagonal are similar in
accuracy)
- Shape
o Complex combination of lower-level primitives
o Many bins
- Motion
o Highly separable against static
 Binary: great for highlighting (“blinking”)
o Use with care to avoid irritation.

18
Properties and best uses of visual encodings:

Reference model of visualization


Munzner’s reference model uses a stepwise approach to analyze and construct visualizations of data.
- Why?
o What are the actions and target of the visualization?
- What?
o What is the data and how is it structured?
- How?
o What is the mapping between the data items and visual elements or channels?

How to use the reference model


To construct a visualization of data, consider different aspects:
1. What is the data and how is it structured?
2. Who is the user or recipient?
3. What should the user be able to do?
o Exploration, confirmation or communication?
4. Why? What are the actions and targets?
5. How? The mapping between the data items and visual elements or channels

(Make sure that you have understood point 1-4 very well, before going into the visualization (point
5). Always think about what the purpose is.)

Design strategies: defaults and best practices


- Use position for the most important aspects to visualize, color for categories.
- Position, length, thickness, area, brightness, saturation have natural order.
- Shape, line style, hue do not have natural order.
- Redundant encoding (same data in different channels)
o Easier and faster to read.
o Often more accurate.
o Good for unused channels.
- Limit the amount and detail of data in a visualization.
- Consider default formats and mappings:
o Most things have been tried.
o Default formats are easier to read.
o Innovative formats require thinking.

19
Design strategies: user-focus
- Consider reader’s context:
o Text: title, tags, and labels
 Familiar to the reader or strange jargon?
o Colors
 Will the reader have the same and right color associations?
o Color deficiencies
o Representation format (different devices, print, interactive)
o Directional orientation (reading direction)
- Compatibility with reality:
o Make the encoding of things and relationships well-aligned with the reality (the
reader’s reality)
o Patterns are easily recognized by humans, powerful tool to analyze visual and
auditory information.
o Violation of patterns is hard for humans, consistency is key!

Elements of a chart:

Next to visual elements expressing data, we need:


- Coordinate systems and scaling of the data
- Axes and grid (help reading the chart)
- Legend (provides mapping information)
- Annotations (highlight or emphasize)
- Layout and layering (support tasks such as comparison)

Idiom = chart type (bar charts, scatter plots, etc.)


Idioms do not serve all tasks equally, so choose them carefully.

20
How to understand idioms (chart types)
- DATA: what is the data behind the chart?
o Number of categorical attributes
o Number of quantitative attributes
o Semantics of keys and values
- MARK: which marks (visual elements) are used?
o Points, lines, glyph/complex shape, …
- CHANNELS: how will the data be encoded in visual channels?
o Arrangement: horizontal and vertical position
o Mapping: colors, length, size, …
- TASKS: what are the supported tasks?
o Finding trends, outliers, distribution, correlation, clusters, …

Keys and values: recap


Key:
- An attribute used as unique index to look up items.
- Simple tables: 1 key attribute (e.g. student_id in the student table)
- Complex tables: multiple key attributes (student_id + test_id in the grades table)
Value:
- A non-key attribute

21
Idioms:

22
23
24
Best practices for idioms:
- Include a descriptive title.
- Put labels on the axes and include units if applicable.
- Make the plot readable (use appropriate font sizes, colors, contrast)
- Choose appropriate ranges for the axes.
- Include a legend when you use multiple colors, marks, …

25
Effectiveness:
Effectiveness specifies how well the visualisation helps a person with their tasks. Any depiction of
data requires the designer to make choices about how that data is visually represented.
- “It’s not just about pretty pictures”
- Lots of possibilities to visualise, only few are effective
- Effective data visualisations conform to user actions and targets!

Checklist for an effective visualization


- Clearly indicates how the values relate to one another, which in this case is a part-to-whole
relationship - the number of deaths per cause, when summed, equal all deaths during the
year.
- Represents the quantities accurately.
- Makes it easy to compare the quantities.
- Makes it easy to see the ranked order of values, such as from the leading cause of death to
the least.
- Makes obvious how people should use the information - what they should use it to
accomplish – and encourages them to do this.

Mistakes in idiom selection:


Idiom – task mismatch

Careful with pies, donuts, and 3D


- Insufficient task support
o Difficult to compare without additional labels (Pie, Donut)
o Difficult to read and to look up values (3D)

26
Ethical considerations:
- Scientific integrity
o Plagiarism
o Data fabrication and “engineering”
- Data protection and privacy
o Was an ethical committee involved? (Usually mandatory for medical research and
anything involving patients, participants, and animals.)
o GDPR and other privacy-relevant data: do we have informed consent and
permissions?
 Storing personally – identifiable data needs clear information and always
explicit consent.
Note: Who owns the data, who presents it to whom for which purpose?

Ethical considerations: misleading visuals


- Omitting data to suggest a different trend.
- Truncating the y-axis to suggest larger changes.
- Selecting a favorable, but misleading dimensions
- Playing with 2D vs 3D perspectives and optical illusions

Ethical considerations: spurious correlations


- A statistical phenomenon where two variables appear to be related or correlated but their
relationship is coincidental, due to chance or the influence of an unaccounted-for variable
(confounder).
- Amount of money spent in food stores (“food and beverage stores”) and the amount spent in
restaurants (“food services and drinking places”) based on data by state in the United States.

27
Lecture 3: (Monday 27-11-2023)

What is data mining?


- Data mining is extracting information (knowledge) from data (observations)
Categories:
- Do we have a predefined target? (output variable)
o Yes -> Supervised method (predict grade of a student)
o No -> Unsupervised method (group students)
- Do we seek information applicable to all or only some of the data?
o All -> Global method (mean and variance for adult human length)
o Some -> Local method (prediction adult length with French-German parentage)
Warning
- Data mining usually finds correlations, not causations!

Four examples of data mining methods:


- Linear regression (supervised, global)
o Create a linear model to relate an output variable to one or more input variables.
- Clustering (unsupervised, global)
o Partition all observations in meaningful groups.
- Decision tree mining (supervised, global)
o Learn a tree model to separate ‘positive’ and ‘negative’ cases; the tree represents a
few easy to interpret input attribute decisions.
- Association rule learning (unsupervised, local)
o Find high-confidence associations between frequently occurring subsets of items.

Linear model quality: low SSD


- For all observations ( x i , y i ) consider the deviation of residual y i− y̌ i , between:
o Real output value ( y i ) found in data.
o Predicted output value y̌ i=β 0 + β 1 x i according to the model.
- Gather them in the Sum of Squared Deviations (SSD):

n n
SSD=∑ ( yi − y̌ i ) =∑ ( y i−( β 0 + β 1 x i ) )
2 2

i=1 i=1

- The lower SSD, the better the linear model

Linear model quality: high R2


- SSD expresses variation relative to the model, similar to what sample variance does relative
to the mean:
n
SSD=∑ ( yi − y̌ i )
2

i=1
n
1
σ = 2
y ∑
n−1 i=1
( y i− y )
2

2 SSD
- Use this to define the relative quality measure R2: R =1−
( n−1 ) σ 2y
- The fraction of variation in output values explained though the model by variation in input
values.

28
- The higher R2, the better the linear model (1 means all y i= y̌ i )

29
There can be multiple input variables:
- Any number of input variables can be used.
- E.g. extend the profit regression model:
o Only production budget as input
o Extended with release date as input
- R2 always increases when adding input variables, beware of overfitting.
o If data gets trained with too many variables, it wont adapt well to new inputs!
- Difficult to visualize for more than one input variable.

Interactions can be modelled:


- Modeling interactions (synergies) between input variables can improve the prediction of the
output.
o Modeled by adding a function of multiple inputs to the model, e.g. a product (cross
term)
- For example, the extend the profit regression model:
o Original, only production budget as input.
o Extended, add IMDB score and cross term as input.

The line does not have to be straight!

Clustering:
k-means clustering algorithm:
1. Pick k random points as initial centroids.
2. Assignt points to nearest centroid.
3. Recompute centroids: mean of the points in each cluster.
4. Repeat steps 2 and 3 until clusters are stable.

30
How do we assess clustering quality?
- There is no known “output” to compare to (clustering is unsupervised)
- A clustering model essentially consists of a set of centroids.
- How well does a centroid represent an observation in a cluster?
o Good, if there is a small distance between them.
o Bad, if there is a large distance between them.
- So, strive for low within-cluster distance: the sum of the distances between cluster’s centroid
and each of its observations.
- A clustering’s within-cluster-distance total W(C) is the sum of the within-cluster-distance over
all clusters (smaller is better, larger is worse)

How do we pick the number of clusters k?


- Interaction between W(C) and k:
o Generally, clusters become more compact if there are more (if k increases)
o So, then the within-cluster-distance total W(C) decreases.
o W(C) is minimal (zero) when there is a centroid at every observation.
- When is a clustering informative?
o K=1: a single cluster: no
o K=n: as many clusters as observations: no
o K somewhere in between: yes
- No single rule for picking k, but certainly not minimizing W(C)

Equal treatment of attributes is important:


- Clustering depends on the distance between points.
- The scale on which attributes are measured therefore matters, so:
o Use the same units for similar attributes (e.g. S-N and W-E distances)
o Ensure units used lead to relevant distance for problem (e.g. ° vs. km)
o Standardize units for dissimilar attributes (e.g. length vs. thickness vs. mass)
 Use z-score.

Distances:
- Distances are fundamental in data mining!

Linear regression:
- Residuals / deviations: distance between points and linear models.
- Determines SSD, and so determine the optimal regression model.
- At the basis of quality measure R2
Clustering:
- Distances determine cluster assignment.
- Different attribute scales can be chosen, influencing the distance.
- At the basis of quality measure W(C)

31
What is distance?
- A measure for how close or far away things are from each other.
(origin = spatial distance; e.g. between two towns)
- A measure for how related things are.
(e.g. comparing apples and oranges in terms of attributes such as mass, volume, color, sugar
content, etc.)
- A single positive number: distances can be easily compared. (Close vs. far becomes good vs.
bad or in-group vs. out group)
(yes, we can fully compare apples and oranges if we define a distance!)
- There is no single appropriate distance…

Euclidean distance:
- Unrestricted movement, think of straight lines.
- Euclidean distance is also typical default (not always good)

Network distance:
- Movement between “hotspots”, think of commuting from home to school by car.
- Good when:
o We know the network of possible movements.

Manhattan distance:
- Movement in only horizontal and vertical directions.
- Appropriate when movement in restricted to a fixed grid.
- A special network of network distance.

Decision trees:

32
How good is the tree?
Confusion matrix:
˅Reality˅ >Tree> Yes No
Yes TP FN
No FP TN

Impact of incorrect results: spam filter


Build a spam filter that classifies incoming messages:
- Positive: spam
- Negative: non-spam
False negative: spam ends up in my inbox.
- I must delete it. Not fun, but no big deal.
False positive: real message does not reach me.
- I miss birthday party invitations. This is a big deal!
Here, False Positives are much more harmful than False Negatives.

Impact of incorrect results: medical test


Perform preliminary screening for a deadly disease to classify people to be at-risk or not:
- Positive: at-risk, further investigation is required
- Negative: not at-risk
False negative: deadly disease goes undetected.
- A person might die because of this.
False positive: a person undergoes unnecessary expensive medical test.
- Unpleasant, but nobody dies.
Here, False Negatives are much more harmful than False Positives.

Use the confusion matrix to report quality:

TP+TN
Accuracy=
TP+TN + FP+ FN

TP
Precision=
TP+ FP

TP
Recall=
TP+ FN

TN
Specificity=
TN + FP

How to create the decision tree?

33
34
Use entropy! Measures purity of nodes
“Resit” node is pure! 100% yes.

How to decide?
- We compute the entropy for the whole dataset.
- For each possible split (calc, phys, …) we compute the average entropy over all outcomes of
the split (pass, resit, fail) weighed by how often such an outcome occurs.
- This average entropy, minus the entropy for the whole dataset, constitutes the information
gain of that split.

Association rule learning:

The support of an itemset is its frequency:


- The total number of observations: n
- Number of observations in the itemset X :| X|
- Support of itemset X:
| X|
supp( X)=
n
- Support of itemset X ∩Y
| X ∩Y |
supp( X ∩Y )=
n

Association rules:
From frequent itemsets to association rules:
- Find all frequent itemsets.
o Length-1: X
o Length-2: X ∩Y
o Length-3: X ∩Y ∩ Z
o … until nothing is frequent anymore, then stop.
- Create association rules by replacing ∩by ⇒
o X ⇒Y ,Y ⇒ X
o X ⇒(Y ∩Z ),(Y ∩ Z )⇒ X , …

35
The confidence of a rule is its frequency:
- Confidence of a rule X ⇒ Y :
|X ∩Y |
conf (X ⇒ Y )=
| X|
- Strong association rules have high support and high confidence:
o supp( X ∩Y )≥ α
o conf (X ⇒ Y )≥ β

Things to watch out for:


- Be careful when setting minimal support (α ) and confidence ( β )
o Too low: result explosion
o Too high: find nothing interesting (or nothing at all)
- An itemset with high support cannot always be transformed into an association rule with
high confidence.
- Having a strong association rule in one direction does not imply having one in the other
direction.

Things to watch out for:


Support:
- Gives an idea about how frequent an itemset is in all the transactions.
- It can help identifying rules that are worth considering for further analysis.
o E.g. one might want to consider only the itemsets which occur at least for a pre-
defined number of times.
Confidence:
- Defines the likeliness of occurrence of the consequent given that the antecedent happens.
o It is the conditional probability of occurrence of consequent given the antecedent.
- A rule with a very frequent consequent will always have a high confidence.

36
Lecture 4: (Monday 4-12-2023)

Data organization and queries:


Data needs to be stored and retrieved in many types of applications:
- Scientific applications: biology, chemistry, physics, social network analytics, …
- Technical / engineering applications: automotive controls, embedded systems, air traffic
control, climate control, power stations and grids, …
- Administrative applications: banking, student administration, retail, manufacturing, logistics,
human resources, …
- Document-oriented applications: news sites, (digital) libraries, websites, search engines, …

Object system vs. information system:


- Object system:
The “real world” of a company, organization, or experiment, with people, machines,
products, warehouses, chemical reactions, social relationships, …
- Information system:
a representation of the real world in a computer system, using data (e.g., numbers) to
represent objects such as people, machines, products, …
- example: students are people in the real world, but they are represented by an identifying
student number, name, address, list of enrolled courses, grades, etc. in the student
administration database.
- the representation is always an approximation:
o e.g., your knowledge of a university course is represented by an integer number
between 0 and 10.

Why do we use database systems?


What is wrong with storing all my data in one table?
Problems:
- Duplication of information.
- Difficulty in keeping information consistent.
- Difficulty in accessing and sharing data.
- Hard or impossible to keep the data safe and secure.
- Hard or impossible to express (and efficiently execute) interesting (i.e., high-value) analytics
over the data.

Data redundancy
- Imagine we store account information in one single table in a straightforward way.
- John Doe has several accounts -> several rows in the table
- His address is the same in all rows -> redundancy!
- What about joint accounts?

37
Data inconsistency
- John moves to Boschdijk and the record is corrected at the Eindhoven centrum branch.
o Inconsistency!
- The address needs to be updated in all records of John!

Structuring data to solve problems


Organize your data!
- Several tables:
o a table with records about the customers
o a table with records about accounts
o a table with records of account ownership

Problems solved
Redundancy and inconsistencies are avoided:
- One record with John’s address
- Changing the address means changing one record

Primary key
- A minimal set of attributes of a table that uniquely defines each row of this table
o custName defines the address → primary key
o acctNumber defines the branch and the balance → primary key
o custName does not define the account number, the account number does not define
the custName → combination (custName, acctNumber) is a primary key
- Another way to look at the “key” of a table:
o
o If we remove all “non-key columns” from the table, all rows are unique!

- A primary key is a minimal set of attributes (columns) that uniquely identifies each record
(row) in the table:
o One single column or several columns
o No subset of this set is a key
o Sometimes it is better to introduce an id

38
Database models:
- A database model is a collection of tools for describing:
o Data
o Data relationships
o Data semantics (i.e., the meaning of the data)
o Data constraints
- Historically, there have been many proposals:
o Network and Hierarchical database models (1960-1970’s)
o Relational database model (1970-1980’s)
o Object-based database models (1980-1990’s)
o XML data model (1990-2000’s)
o RDF (graph) data model (2000-2010’s)
o …
- We study the relational database model, as it is the dominant practical model, and industry
standard.

Instances and schemas:


- Schemas are similar to variables in programming languages and instances are similar to
variable values.
- Logical Schema (data model) – logical structure of the database:
o Analogous to name and type of a variable in a program.
o A relational database schema consists of a collection of table schemas.

Schemas:
Example: suppose we have a bank database schema consisting of three tables:
- customer(custName, custStreet, custCity) table keeps track of each customer of the bank.
- account(acctNumber, bName, balance) table keeps track of each account of each branch of
the bank.
- depositor(custName, acctNumber) table keeps track of which customer is associated to
which account.

Instances:
- Instance – the actual content of the database at a particular point in time
o Analogous to the value of a variable
- Example: (John Doe, Kruisstraat, Eindhoven) is an instance of the schema customer
(custName, custStreet, custCity)
o Different terms used in the literature: an “entry”, a “tuple”, a “row”

39
What is database design?
- A database represents the information of a particular domain (e.g, organization, experiment)
o First and foremost: determine the information needs and the users
o Design a conceptual model for this information
o Determine functional requirements for the system: which operations should be
performed on the data?
- “Goodness” of Conceptual design:
o Accurately reflect the semantics of use in the modeled domain

Modelling entities in the ER model


- A database can be modeled as a collection of entities and relationships between entities.
- An entity is an object that exists and is distinguishable from other objects.
o Example: a concrete person, e.g. Mary Johnson, a building, e.g. Auditorium at TU/e
- Entities have attributes
o Example: people have names and addresses, buildings have addresses, height, etc.
- An entity set is a set of entities of the same type that share the same properties.
o Example: set of all TU/e students, set of all TU/e buildings

Modeling relationships in the ER model


- A relationship is an association among several entities
o Example:
o Crick: instructor entity
o Advisor: relationship set
o Tanaka: student entity
- A relationship set is a collection of relationships among entity sets.
o Example: (Crick, Tanaka) belongs to the relationship set advisor

The entity-relationship model:


- Models a real-world system as a collection of entities and relationships
o Entity: a “thing” or “object” in the system that is distinguishable from other objects
 Described by a set of attributes
o Relationship: an association among several entities
- Represented diagrammatically by an entity-relationship diagram

40
Relationship sets (cont)
- An attribute can also be a property of a relationship set.
- For instance, the advisor relationship set between entity sets instructor and student may
have an attribute date

Attributes
- An entity is represented by a set of attributes, that is, descriptive properties possessed by all
members of an entity set.
o Example: entity sets
 instructor = (ID, name, street, city, salary )
 course = (course_id, title, credits)
- Domain – the set of permitted values for each attribute
o Example:
 Instructor names are strings (e.g., “Jane Smith”)
 Instructor salaries are numbers (e.g., 50000)

E-R diagrams: syntax


- Rectangles represent entity sets.
- Diamonds represent relationship sets.
- Lines link entity sets to relationship sets.
- Attributes listed inside entity rectangles.
- Underline indicates “primary key” attributes
o instructors have unique IDs and
o students have unique IDs

41
42
Reduction to relation schemas
- Both entity sets and relationship sets are expressed as relation schemas (tables) that
represent the contents of the database.
- A database which conforms to an E-R diagram can be represented by a collection of relation
schemas.
- For each entity set and relationship set there is a unique schema that gets the name of the
corresponding entity set or relationship set.
- Each schema has columns corresponding to attributes, which have unique names.
- An entity set reduces to a schema with the same attributes
o primary key → primary key
- A relationship set reduces to a schema whose attributes are the primary keys of the
participating entity sets and attributes of the relationship set
o The primary key is the combination of the primary keys of the participating entities
(it does not include the attributes of the relationship set!)

A relational database schema consisting of three tables:


instructor(instructor_id, name, salary)
student(student_id, name, tot_cred)
advisor (student_id, instructor_id, date)

Basic SQL
- Industry standard query language for RDBMS (relational database management systems)
- Query language: a language in which to express data analytics

What is SQL?
- Structured Query Language (SQL = “Sequel”)
- Industry standard query language for RDBMS (relational database management systems)
- designed by IBM, now an ISO standard

43
o available in most database systems (with local variations)
o has procedural flavor but is declarative in principle
o “reads like English”
- uses a relational database model where:
o tuples in relation instance are ordered
o query results may contain duplicate tuples

Why SQL?
- Lingua franca of data intensive systems
o Relational databases
o MapReduce systems, such as Hadoop
o Streaming systems, such as Apache Spark and Apache Flink
o ...
- Stable mature language with 40+ years of international standardization and study
o SQL is essentially the industry standard for first order logic, i.e., predicate logic, i.e.,
the relational DB model (see 2ID50 in Q2)
- Via information systems, you will continue to interact directly or indirectly with SQL or SQL-
like languages for the coming decades …

Conventions in SQL
- Keywords are case-insensitive. They are often capitalized for better readability
- Table and column names can be case-sensitive (often configurable)
- The semicolon (;) at the end of an SQL statement is not mandatory for most database
systems

Data definition language


- SQL DDL: Allows the specification of a set of relations (i.e., the relational database schema)
and information about each relation, including:
o the schema for each relation
o The domain of values associated with each attribute
o Integrity constraints
- An important constraint for attributes: is it allowed to have missing values? Missing attribute
values get a special value NULL.

Basic
query structure
- The SQL data-manipulation language (DML) provides the ability to query information
- A typical SQL query has the form:
o SELECT instructor.instructor_id, instructor.name
o FROM instructor, advisor, student
o WHERE instructor.instructor_id = advsor.instructor_id and
 student.student_id = advisor.student_id and

44
 student.tot_credit < 100;
- SELECT lists attributes to retrieve
- FROM lists tables from which we query
- WHERE defines a predicate (i.e., a filter) over the values of attributes

The SELECT clause:


- The SELECT clause lists the attributes desired in the result of a query, it performs a “vertical”
restriction of the table
- Example: find the names of all branches in the loan relation
o SELECT branch_name
o FROM loan;

The SELECT clause: duplicates


- SQL allows duplicates in relations as well as in query results
- To eliminate duplicates, insert the keyword DISTINCT after SELECT
- Example: Find the names of all branches in the loan relation, and remove duplicates
o SELECT DISTINCT branch_name
o FROM loan;
- The keyword ALL specifies that duplicates should not be removed (default option)
o SELECT ALL branch_name
o FROM loan;

The SELECT clause: *


- An asterisk in the select clause denotes “all attributes”
o SELECT *
o FROM loan;
- This will just result in the same table as loan

The SELECT clause: calculations


- The SELECT clause can contain arithmetic expressions involving the operations +, –,∗, and /,
and operating on constants or attributes of tuples.

o SELECT loan_number, branch_name, amount ∗ 100


- The query:

o FROM loan;

45
would return a relation that is the same as the loan relation, except that the value of the
attribute amount is multiplied by 100 (euro to cents)

The WHERE clause:


- The WHERE clause performs a “horizontal” restriction of the table, it specifies conditions on
the rows (records)
- Example: find all loan number for loans made at the Perryridge branch with loan amounts
greater than $1200.
o SELECT loan_number
o FROM loan
o WHERE branch_name = ‘Perryridge’ AND amount > 1200

The WHERE clause: operators


- Conditions can use the logical operators AND, OR, and NOT, and parentheses ( ) for grouping
- Comparisons =, <>, <, <=, >, >=, can be applied to results of arithmetic expressions.
- String comparison: operator LIKE and
o % the percent sign represents zero, one, or multiple characters
o Write strings between ‘ ‘
(amount <=1000 OR amount >1000000) AND (branch_name LIKE ‘%Eindhoven%’)
loan amounts not exceeding 1000 or greater than 1000000 and branch names containing
“Eindhoven” (e.g. Eindhoven-Woensel, or BigEindhoven1)

The FROM clause:


- The FROM clause lists the tables involved in the query

o SELECT ∗
- Find all pairs of instructor and teaches entries

o FROM instructor, teachers


- this generates every possible pair (instructor, teaches), with all attributes from both relations

Multiple tables in the FROM clause:


- Generating all pairs of entries is not very useful directly but is useful combine with where-
clause conditions.

The rename operation AS:


- SQL allows renaming relations and attributes using the AS clause:
o old-name AS new-name
- Example: find the name, loan number and loan amount of all customers; rename the column
name loan_number to loan_id.

SELECT customer_name, borrower.loan_number AS loan_id, amount


FROM borrower, loan
WHERE borrower.loan_number = loan.loan_number

Tuple variables
- Tuple variables are defined in the FROM clause using the AS clause
- Keyword AS is optional and may be omitted, i.e. you can write borrower b instead of
borrower AS b
Find the name, loan number and loan amount of all customers having a loan at the Perryridge
branch:

46
SELECT customer_name, borrower.loan_number, amount
FROM borrower AS b, loan AS l
WHERE b.loan_number = l.loan_number
AND
branch_name = ‘Perryridge’

Tuple variables
- Find the names of all branches that have greater assets than some branch located in
Brooklyn.
SELECT DISTINCT b2.branch_name
FROM branch b1, branch b2
WHERE b2.assets > b1.assets AND b1.branch_city = ‘Brooklyn’

Set operations:
- The set operations UNION, INTERSECT, and EXCEPT operate on relations.
o A “set” is a collection of objects without repetition of elements and without any
particular order.
o E.g., each of us has a set of hobbies, which might be empty.

Set operations: union


- Given two sets A and B, the union of A and B is the set containing all
elements of both A and B.

Set operations: intersections


- Given two sets A and B, the intersection of A and B is the set
containing all elements appearing in both A and B.

Set operations: set difference – except


- Given two sets A and B, the difference of A and B is the set
containing all elements appearing in A but not in B. This is
denoted by EXCEPT in SQL.

Se

Set operations:
- The set operations UNION, INTERSECT, and EXCEPT operate on relations (tables).
- Each of these operations automatically eliminates duplicates; to retain all duplicates, use the
corresponding multiset versions UNION ALL, INTERSECT ALL and EXCEPT ALL.

47
Aggregate functions:
- These functions operate on all values of a column (including duplicate values, by default),
and return a value
COUNT: number of values
MIN: minimum value
MAX: maximum value
AVG: average value (on numbers)
SUM: sum of values (on numbers)

Aggregate functions: group by


- Find the total assets in each city where the bank has a branch.
SELECT bCity, SUM(assets) AS totalAssets
FROM branch
GROUP BY bCity

Aggregate functions: HAVING clause


- Find the total assets in each city where the bank has a branch with the total assets of at least
100k.
SELECT bCity, SUM(assets) AS totalAssets
FROM branch
GROUP BY bCity
HAVING SUM(assets) >= 100000
- Note: predicates in the HAVING clause are applied after grouping, whereas predicates in the
WHERE clause are applied before grouping.

Aggregate functions: examples


- Find the names of customers and the number of their accounts for customers who have
more than one account and the total balance of their accounts is higher than €1200
SELECT customer.custName, COUNT (account.acctNumber)
FROM customer, account, depositor
WHERE customer.custName = depositor.custName AND depositor.acctNumber =
account.acctNumber
GROUP BY customer.custName
HAVING COUNT (account.acctNumber) > 1 AND SUM (account.balance) > 1200
- Find the names of branches for which the total of loan amounts is higher than the assets of
that branch
SELECT branch.bName F
ROM branch, loan
WHERE branch.bName = loan.bName
GROUP BY branch.bName
HAVING SUM (loan.amount) > branch.assets

48
Lecture 5: (Monday 11-12-2023)

Data Aggregation and Sampling


Target size matters when reaction time is important:

Consequences for computer interfaces:


- Choosing the button size:
o Make important buttons large.
o Use corners as they are equivalent to large buttons.
- Reduce the distance:
o Put important menus under your mouse (right-click menu)
- Most people move the mouse from top left to bottom right -> let the user use that
movement.

Original (1D) Fitts’ law experiment


- Classic paper (from 1954) in the field of movement science; Fitts proposed to use a standard
task to measure (human) motoric skill.
- Interaction task: select target with width W positioned at a distance D from the current
position.
- Fitts’ law states that the average time needed to select the target is linearly related to the
D
index-of-difficulty ID=log (1+ )
W
- Supported by empirical evidence

A mouse experiment
Based on a standardized task in Human-Computer interaction
- Using a mouse or a trackpad, move onto a square shown in the center of the window
- After a random delay of 2 to 4 seconds a target disc appears
o in one of the eight possible directions (-1350, -900, …, 1350 ,1800 to the X-axis )
o at a distance chosen from at random from 100 to 290 pixels
o the target radius is 3, 6 or 9 pixels (chosen randomly)
- Move towards the blue disc and click on it

49
Data collected
- User characteristics (right-handed?, the major study program,…)
- Characteristics of the mouse, operating system, etc.
- Trial characteristics, like
o the trial number for this user (there might be a learning curve)
o delay
o target radius
o target position
o …
- Paths
o timestamp with the corresponding position for each trial

Research question examples


Air quality and traffic data
- What is the relationship between air quality and traffic?
Mouse experiment
- How does the task difficulty influence the reaction time of the user?

What data can we collect / do we get to answer the research question?


- Primary data: collected by you / your team (like in the mouse experiment)
- Secondary data: collected by others (like in the air quality – traffic data)
- Database schema for the mouse experiment:
o paths(user, trial, t, x, y), where t is a time stamp
o trials(user, trial, delay, input_method, mouse_acceleration, target_radius, target_x,
target_y, …)
o users(user, major, gender, right_handed, use_tue_laptop, …)

Formulate a hypothesis
First attempt:
- The reaction time is larger if the task is more difficult.
Questions about it:
1. What exactly is meant by the reaction time?
2. How is the task difficulty defined / measured?
Reaction time and task difficulty are examples of features.

Feature
- An object can be represented in data as a vector of features
- Feature is a measurable property or characteristic
- Some features are directly gathered during an experiment,
e.g. target width, dominant hand
- Some features are computed based on the collected data,
e.g. reaction time, length of a user’s path from a source to a target

50
Hypotheses refinement process:
1. Ask questions
- paths(user, trial, t, x, y)
- trials(user, trial, delay, input_method, mouse_acceleration, …, target_radius, target_x,
target_y,…)
- users(user, major, gender, right_handed, use_tue_laptop, …)

Ask questions! (data driven and / or based on domain knowledge)


- Could the task difficulty be related to the moment of target appearance?
- Could the task difficulty depend on the target radius? If so, how can we describe this
dependency?
- Could the position of the target contribute to task difficulty? If so, how?
- Could the task difficulty depend on the dominant hand? If so, how?
- ...

2. Answer questions (use data and domain knowledge)


Reaction time definition?
- Definition 1: the reaction time for a trial is the time between target appearance and the first
nonzero time stamp in the paths table.
o Can reaction times be so close to 0?
o Look at some concrete trials with very small reaction times. What could be an
explanation?

Reaction time: definition attempt 2


- The time from the moment when the target appears (𝑡=0) until the moment the first point
outside the circle with radius 10 pixels around the origin (𝑥=0,𝑦=0) is reached.
o No reaction times close to 0
o Reaction time according to definition 2 is slightly larger than the one based on
definition 1. Possible consequences?
- Choose your definitions wisely!
- Check them on the data!

Generation features through data aggregation


Air quality and weather data (secondary data)
- You got data that was generated from raw data using data aggregation:
o Hourly precipitation amount (in mm)
o Mean wind speed (in m/s) during the 10-minute period preceding the time of
observation
o …
- Possible further aggregations:
o Daily precipitation amount
o …
Mouse experiment data (primary data)
- You get raw data that can be aggregated. Compute e.g.:
o The length of the path in a trial
o The length factor:
path length
origin ¿ target ¿
distance ¿
o The average reaction time per user
o The median reaction time per user
o …

51
Scientific method: a way of thinking
- Scientific method is a systematic approach to conducting empirical research to obtain sound
answers to questions
- The scientific (research) method is used to
o to collect valid data for exploratory or confirmatory analysis
o test a hypothesis or theory (collect evidence for a claim)
- Empirical (experimental) research characteristics:
o procedures, methods and techniques that have been tested for validity and reliability
o unbiased and objective (research attitude)

Tools for scientific method


- Deduction
- Induction
- Verification by others
- Occam’s razor / Parsimony principle
- Statistics

Deduction vs induction
Deductive reasoning:
- Premise 1: Rain droplets collide with airborne particles during free fall.
- Premise 2: Today, there has been heavy rainfall in the region.
- Conclusion: the more cars, the higher NO2 concentration in the air (with other factors being
equal)
- If the premises are true, the conclusion is valid.
- What if one of the premises is not true? (e.g. cars do not use fuel anymore)
Inductive reasoning:
- We make 100 measurements of air quality and traffic intensity and see that the higher NO2
concentration corresponds to the higher traffic intensity.
- Conclusion: the more cars, the higher NO2 concentration in the air.
o 1. Do we have enough observations to generalize?
o 2. Were the observations made under a wide variety of conditions?
o 3. Are there observations in conflict with a derived law?

52
Occam’s Razor / Parsimony principle:
- A simpler explanation of the phenomenon is to be preferred
o A model with less variables is preferred, if it fits the data “equally well”.
o Newton’s laws are simpler than Einstein’s laws and they are still used for situations
when they explain the phenomenon equally well (the difference in outcomes would
be negligible).
- If you can explain the concentration of NO2 using just traffic flow intensity and the wind flow
well enough, you go for such a simple model. Otherwise, you have to involve more/other
variables.

Validity, reliability, and reproducibility


We discuss trustworthiness of data, data collection process, data measurement and data analytics
methods using concepts of
- validity
- reliability, and
- reproducibility

Validity and reliability


- Measurements or conclusions are valid when they accurately describe the real world
- Measurements or conclusions are reliable when the similar results are obtained under
similar conditions
- Suppose the sensors used at a traffic measurement station malfunction and the number of
cars per hour is always two times lower than the real number
o our measurements are not valid – they do not describe the reality accurately
o they are reliable – they would consistently show the same numbers in repeated
experiments
- Reliability does not imply validity, nor does validity imply reliability!

Internal validity vs external validity


Internal validity: are the conclusions valid within the study?
- Is there a strong justification for the conclusion about a causal relationship or can there be
an alternative explanation?
o i.e. is it possible that weather conditions lead to both a higher traffic intensity and a
larger concentration of some chemical compound in the air?
External validity: can the conclusions of a scientific study be applied beyond the context of the
study?
- Will the reaction time in the mouse experiment depend on the task difficulty in the same
way if you consider people of different ages?

53
Reproducibility, verification by others
- Refers to the ability of others to replicate the findings
o If others do not get similar measurements/conclusions, analysis results will not be
accepted
Example: traffic
- Imagine you found that the amount of NO2 in the air is proportional to the number of cars
passing nearby, while the others do not see this relationship in their experiments. – so the
results are not reproducible!
- This can happen because of:
o Implicit assumptions, e.g. the number of cars is an appropriate as an indicator for the
amount of car emissions, assuming that all cars use fuel (think of electric cars)
o your data or their data was collected in different, very specific conditions (e.g. a
power plant near their measurement station)
o the measurements were incorrect (e.g. malfunctioning measuring instruments)
o you omitted some facts/results when drawing conclusions
o …

Measurements: precision and accuracy


- In a measurement process
o Is the measurement process actually measuring what it is intended to measure?
o It involves both the definition of a measurement and the instruments used
o Example: mouse experiment
 What does reaction time mean? How is it measured?
- Precision of the instrument and accuracy of the measurements
o Precision refers to errors introduced by the measuring instrument, i.e. ±0.5 mm
(random errors)
o Accuracy refers to deviations from real values (systematic errors)

Possible threats to validity in measurements


- Sources of errors:
o Inadequacies of the technology employed
o Measurement process
o The definition of what is being measured
- Random errors (not forming any pattern) vs. systematic errors (consistent errors, like offset
errors or scale errors)
o e.g., every second car is measured whenever the traffic is intensive, so the numbers
are too low compared to reality – scale error with the factor 0.5
o The thermometer always showing the temperature that is 2oC higher that the real
temperature has an offset error

Studying path properties


- Two real paths: the black one and the blue one
- Data collected: for each path, we get coordinates of
orange dots
- Which questions can be answered with this data and
which not?
- How to compute the direction of the movement?

54
Population and sample
- Population is a complete set of all elements, like traffic and air quality data for every single
moment of time at every location in the Netherlands
o collect the data for the whole population is too expensive, or even physically
impossible
o it might be not feasible to analyze the data for the whole population (when it is
available) because of the data size and the complexity of algorithms.
- A sample is any part of the population (whether it is representative or not)
o the data from the mouse experiment in GA2 is a sample, containing data collected by
the students following the DAE course
- A sample is biased if some part of the population is overrepresented compared to others
o young people and males are overrepresented in the data collected in the mouse
experiment, since the data is collected by students following the DAE course

Limitations due to sampling in data collection


Traffic and air quality
- Observations for one month only
o Is it enough to build models?
o Is it representative for the whole year?
- The measurements were made only at some locations in the Netherlands – a sample of all
possible locations
Mouse experiment
- The users are 1st year TU/e students: a non-representative sample of all computer users
w.r.t.
o their age
o educational background
o …

Convenience sampling
- Going for data that is easier to collect
o DAE students for the mouse experiment
- Advantages: saving time, effort, money, …
- Disadvantages: possible bias that is a threat to external validity
o Would the reaction time in the mouse experiment depend on the target size in the
same way for older people?

Random sampling
- Each individual is equally likely to be included into the sample
- Ignoring any knowledge of the population

Stratified random sampling


- Defining strata - disjoint parts forming the whole target population
o e.g. based on age– see the example
- Define your sample size, e.g. N
- Proportionate stratified random sampling: take a random sample from every stratum in the
proportion equal to proportion of this stratum in the population
- Disproportionate stratified random sampling: when you want to over- represent a particular
stratum in the sample
o a study of people with a rare medical condition: a sample should over-represent
these people compared to the whole population

55
Voluntary sampling
- individuals select themselves
o ‒ e.g. course evaluation questionnaires
- self-selection bias the effects of which are difficult to measure
o Would more engaged students fill in the questionnaire?
o more unsatisfied students?

Can we fully trust our data?


Traffic and air quality
- Constant speed of 144 km/h all the time for a traffic measurement station ?
- Strange values for air quality measurements ?
- Missing values for some time periods ?
Mouse experiment
- Reaction time = 5 minutes ?
- The path length is 3 times longer than the distance from the origin to the target ?
- Atypical mouse settings ?

Need for data cleaning and filtering!

Data cleaning
- Is a process of detecting, diagnosing, and editing faulty data
o Incorrect data
o Incomplete (missing) data

Sources of problems with data collection


- Equipment or data transmission errors
o Malfunctioning of sensors
o Lost Internet connection – lost records
- Data collection circumstances
o Weather conditions, resulting in bad visibility and as consequence too low values for
the number of cars per hour
- Manual data entry procedures
o Typos in entries, e.g.1890 instead of 1980
- Non-optimal data collection protocol / study setup
o No air quality measurement stations in the proximity of a traffic measurement
station
- …

Minimal checks of problems with data


- Incomplete (missing) data
o No air quality measurements for certain time moments at some measurement
stations
- Out-of-range data
o The car speed of 500 km/h
o Negative values for the concentration of NO2
- Inconsistent data
o Very different air quality measurements at two stations in a close proximity to each
other: which one is right?

56
o A period with high intensity of traffic in combination with a high deviation in the car
speed values

57
Handling inconsistent, invalid or missing data
- Discard all records (rows) with at least one inconsistent, invalid or missing value or discard an
attribute (column) with lots of such values
o potentially loosing a lot of data

when it is foggy and we remove all such periods from the data ⇒ our conclusions on
o potentially introducing a bias in data: suppose no car speed data could be gathered

the remaining data set might not hold


- Impute values: fill in estimated values in place of inconsistent, invalid or missing values
o replace all missing values with a single constant value based on the domain
knowledge or an estimate, like a feature mean/median computed
 risks: the estimate can be off, e.g. the car speed in a foggy period is likely to
be far below the mean value
o use data mining methods to estimate the likely value for the missing value on the
values of other features, a feature mean for the given class (e.g. obtained with
clustering), time series analysis, …
 risks: the estimates introduce a bias in the data
- Work in the presence of missing data
In practice: try different options and see what works in your case

- Time series is a sequence of pairs (𝑡𝑛, 𝑥𝑛), where 𝑡𝑛 is the observation time and 𝑥𝑛 is the
Noise reduction in time series data

observed value, such that 𝑡𝑛 < 𝑡𝑛+1.


- For equispaced time series, the spacing of observation times (𝑡𝑛+1 – 𝑡𝑛) is constant and it is
called sampling frequency.
- Noise is an unwanted disturbance in time-series data.
o It can be in the form of small fluctuations present in the data, which you want to
remove to see trends
o or in the form of wrong measurements which you want to suppress

Median filter
- Choose the window size, e.g. 3
- Position the window at the beginning of the time series and compute
the median.
- Move the window by 1 and compute the next median value.
- Proceed till the end of time series.
- The filtered time series consists of the computed window medians.

Mean filter (moving average)


- Choose the window size, e.g. 3
- Position the window at the beginning of the time series and compute the mean.
- Move the window by 1 and compute the next mean value.
- Proceed till the end of time series.
- The filtered time series consists of the computed window means.
- Mean filter is more sensitive to outliers than median filter.
- Values far away have the same influence as values nearby

58
Gaussian filter
- Choose the filter width ω

- Consider a pair (𝑡𝑛 , 𝑥𝑛) from the time series


- Use the Gaussian kernel T for the value ω

- Assign weights 𝑤𝑖= 𝑇(𝑡𝑛- 𝑡𝑖, ω) to each value


𝑥𝑖,
- The further away, the lower the weight
- Compute the value for the filtered data as:
x n=∑ wi x n −i
- Note that ∑ wi=1
i

Convolution filters
- The same idea as Gaussian filter, but using differently shaped functions to generate weights
o You can also define your own weights, but make sure that ∑ wi=1
i
- Which kernel corresponds to the moving average (mean) filter?

Applying a convolution filter

0.5, 0.25] (i.e. 𝑤-1= 0.25, 𝑤0 =0.5 and 𝑤1 = 0.25)


- Let’s consider a convolution filter with weights [0.25,

Dependent, independent and confounding variables


- Variables are opposite to constants: variables change their value
- Independent variables are used to see how the change of their value will be reflected in a
change of value of a dependent variable
o Independent variables can sometimes be actively controlled by the experimenter
(the size of the target generated) and sometimes not (the traffic intensity)
- Statistical analysis might indicate that an independent variable appears to be correlated with
variation in the dependent variable, but the dependency is not necessarily causal
- Confounding variable is a variable that is not taken into account and that can provide an
explanation to the observed effect of the independent variable on the dependent variable
o Suppose you mined a model prediction PM 2.5 concentration based on weather
conditions. Wind, temperature and rain can lead to a decrease of the concentration
of PM 2.5 in the air, but they could also trigger an increase of traffic intensity, thus
making the effect of weather conditions on air pollution less visible
o We often do not know all potential confounding variables at the beginning of the
study

59
Feature generation: identifying informative variables
- Goal: reduce the number of data/variables in the dataset by creating new, more informative
variables from existing ones
- mouse trajectory:
o use the coordinates of the origin and the target to compute
 the distance between them
 the direction of the target w.r.t. the origin
 i.e. compute the polar coordinates for the target
- use the trace coordinates to split the path into segments
o can you propose an explanation for the shape of the last part of the path?

Mouse experiment: path features


- Data contain coordinates (X,Y) over time T (the path followed) but for problem statements
related to overshoot and correction movements we need additional measures (“features”)
that take the time aspect into account
- Examples of such features:
o PL “length of the path travelled”
o DT “distance to target” as a function of time T (Euclidean distance)
o ratio of PL and the Euclidean distance to the starting point (the minimal possible
travel distance)

Independent variables for the mouse experiment problem


- The angle to the target
- A ratio between some variables
- …
o paths(user, trial, t, x, y)
o trials(user, trial, delay, input_method, mouse_acceleration, target_radius, target_x,
target_y, …)
o users(user, major, right_handed, use_tue_laptop, …)
Candidates for dependent variables:
- Time to target
- Reaction time
- …

Path length by linear approximation


- Distance between ( x i , y i ) ∧( x i+1 , yi +1 ) can be approximated as Euclidian distance between
the two points:

√ 2 2
d i ,i +1= ( x i+1−x i ) + ( y i +1 , y i )

- Path Pk , l from point k to point l can be computed as sum

i=l−1
pk , l= ∑ d i ,i+ 1
i=k

- Remember that the precision of the approximation will be influenced by the precision of the
measurements!

60
Numerical differentiation – computing speed
- We constructed a new time series ( i , pi ) , where pi is the length of the path from the origin at
moment t i
- Now we want to compute the speed at moment t i
- Approximate the derivative:

' pi +1− pi−1


p ( ti ) =
t i +1−t i −1

Numerical differentiation in the presence of noise


- Smooth the data first using filtering techniques
- Compute the derivative on the smoothed time series
- Compute a kernel for computing derivatives!
' pi +1− pi−1
p ( ti ) =
t i +1−t i −1
- Consider a smoothing kernel with weights
[ w 1 , w 0 , w1 ] , w0 +2 w 1=1
- The direct filter to compute filtered pi+ 1−p i−1 is
[−w1 ,−w 0 , 0 , w 0 , w 1 ]
Computing the kernel for differentiation
- Consider a kernel with weights [ w 1 , w 0 , w1 ] ( w 0+2 w 1=1 )

pi+ 1−p i−1=¿


( w1 pi + w0 p i+1 +w 1 pi +2 )−( w1 pi−2 + w0 p i−1 + w1 p i) =¿
−w 1 pi−2−w0 pi−1+ 0 p i+ w0 pi+1 + w1 pi +2
- So the direct filter is [ −w1 ,−w 0 , 0 , w 0 , w 1 ]

61
Lecture 6: (Monday 18-12-2023)
Hypothesis formulation and testing

Discrete uniform distribution


- discrete uniform distribution = all outcomes have the same probability
1
P( X=k )= , k=1 , ... ,n
n
o X is a random variable denoting the outcome.
o The number of possible outcomes is finite or countable.
- Example the number of eyes when throwing a fair dice:
1
P( X=x)=
6
- Definition: A probability is a number in the interval [0, 1] that indicates how likely a certain

A probability mass function (pmf) 𝑷(𝑿 = 𝒙) describes the probability distribution for a
outcome is (the higher, the more likely the outcome).

discrete random variable 𝑿 assigning a probability to a every possible outcome 𝒙.


-

- Key property of probabilities: The sum of probabilities over all possible outcomes is equal to
1.

Probabilities – fair coin example


- Modelling successes (yes/no data) is like modelling flipping coins
- a coin might be biased, i.e. not fair: heads and tails might be not equally likely
- How likely do we get HHT if we flip a fair coin? (HHT = Head, Head, Tail)
- We indicate H with an orange disc and T with a blue disc.
- This boils down to counting sequences of two possible items (H and T).
- You can do such a calculation for any other sequence in the same way

Possible sequences – equal probabilities

- Thus, the probability of HHT = 1/8. Notation P(HHT) = 1/8 .


- The number of possible sequences 2 × 2 × 2 = 8: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT.

- What is the probability of getting heads 1 time? Options: HTT, THT, TTH, so probability = 3/8
- What is the probability of 0 or 1 heads? TTT, HHT, THH, HTH, so probability = 4/8

- Exercise 1: What is the probability to get only heads in a sequence of length 10? There are 2 ⋅
Calculating probabilities

2 ⋅ 2 … ⋅ 2 = 2^10 different sequences.


1 1
- So the probability of only heads P(10 heads) = =
2
10
1024
- Note: 10 heads is the same as 0 tails.
- Exercise 2: What is the probability of having at least one tail in a sequence of length 10?
- Use the key property: Probabilities of all possible disjoint outcomes add up to 1.
- P(0 tails) + P(at least one head) = 1 (no other possibilities)
1 1023
- So, P(at least one head) = 1 – P(0 tails) = 1− =
1024 1024

Binomial coefficients
- How many options are there to get 2 heads in 5 tries?
- Choose 2 positions for heads.
- 5 possibilities for the first head and 4 for the second one:

62
- (1,2); (1,3); (1,4); (1,5); (2,1);… ; (3,1);… ; (4,1); … ; (4,5) ─ 5x4=20 options, but double counts:
heads on positions 1,2 = heads on positions 2,1 !
- So there are (5×4)/2 = 10 options

General properties binomial coefficients


- Definition: factorial: n !=n∗(n−1)∗...∗2∗1
- 0! = 1

( nk)= k ! (n−k )! is the number of sequences consisting of k


n!
- Definition: binomial coefficient:
items of one type and (n – k) items of another type
- Note: there is symmetry between the two types (5 coins choose 3 heads = 5 coins choose 2
tails), so we have ()
n =( n )
k n−k
- Side remark: the binomial coefficient also appears in formulas for expressions like ( a+ b )5
o Binomial formula of Newton

Application of binomial coefficients


- Probability of 2 heads in 5 coin flips?
-
5 = 5 ! =10
()
2 2 ! 3!
sequences with exactly 2H in 5 coin flips.

- The number of possible H/T sequences of length 5: 25=32

- ()
5
Probability: P(2H in 5 tries) = 2
=
10
5
2 32

Probabilities – biased coin (unequal probabilities for head and tail)


- Example: biased coin with P(H) = 2/3 and P(T) = 1/3 and 3 coin flips
- P(HHT) = (2/3)*(2/3)*(1/3) = 4/27
- There are (32) sequences with 2 heads and 1 tail, the probability for each of them equals
4/27.
- P(2 heads) = 3 * (4/27) = 12/27

General setting: binomial distribution


- Let P(H) = p, so P(T) = 1 – p, and let X be a random variable denoting the number of heads in
n coin flips.
- Then the probability to get k heads in n coin flips is:

k ()
P ( X=k )= n p (1−p )
k n−k

The same formula holds in the general setting: 𝑛𝑛 independent observations, each with
success probability 𝑝
-

o i.e. failure probability = 1 – 𝑝


o “successes” and “failures” instead of heads and coins, e.g. a train coming on time or
not

binomial distribution). We write 𝑿 ∼ Bin(𝒏, 𝒑).


- This mathematical object is known as the binomial probability distribution (shortened to

63
Cumulative distribution function
- The probability mass function (pmf) P ( X=k )= (nk) p (1−p )
k n−k
gives probabilities for
exactly k successes.

cumulative probabilities, for at most 𝑙 successes:


- The cumulative distribution function𝑭 (often abbreviated as distribution function) gives

k=0 k
()
F ( l )=P ( X ≤ l )=∑ n p k (1− p )
n−k

- Note: for the binomial distribution, P( X=k )=P (X ≤ k)−P(X ≤ k−1)


Expectation (mean): averaging probabilities
- The analogue of the sample mean for a probability distribution is called the expectation

- For a random variable 𝑋𝑋 with a finite or a countable set of outcomes:


(mean) .

E ( X )=∑ kP ( X =k )
k

- For a dice:

1 1 1 1 1 1
E ( X )= ∗1+ ∗2+ ∗3+ ∗4+ ∗5+ ∗6=3.5
6 6 6 6 6 6
- For a binomial distribution one can derive an explicit formula:

E ( X )=np

Variance
- Sample variance: a number that indicates the spread of a data set.
- Variance: the analogue of sample variance for a probability distribution:

Var ( X )=∑ ( k −E ( X ) ) P ( X=k )=E ( X −E ( X ) )


2 2

k
- For a dice:

11
Var ( X )=( ( 1−3.5 ) + ( 2−3.5 ) + ( 3−3.5 ) + ( 4−3.5 ) + ( 5−3.5 ) + ( 6−3.5 ) ) =2
2 2 2 2 2 2
12

- For the binomial distribution:

Var ( X )=np ( 1− p )

Continuous data
- Continuous probability distributions are meant for continuous data.

Reminder:
- numerical data - data that has intrinsic numerical value
o continuous data – data that can attain any value on a given measurement scale

64
 interval data (continuous data for which only differences have meaning,
there is no fixed “zero point”). Examples: temperature in Celsius, pH, clock
time, IQ scores, birth year, longitude.
 ratio data (continuous data for which both differences and ratios make
sense; it has a fixed “zero point”). Examples: movie budget, temperature in
Kelvin, distance, time duration.
o discrete data – data that can only attain certain values (e.g., integers). Examples: the
number of days with sunshine in a certain year, the number of traffic incidents.

65
Continuous distribution – density function
- Discrete distributions: probability mass function
- Continuous distributions: density function
- The value of the density function f has no direct interpretation!

P( X=x)=0

- The area under the density curve shows probability!

b
P(a≤ X ≤ b)=∫ f (x)dx
a

+∞

∫ f ( x ) dx=1
−∞

E ( X )=∫ x ⋅ f ( x ) dx

Var ( X )=∫ ( x−E ( X ) ) f ( x ) dx


2

σ =√ Var ( X )

Kernel density plots

Cumulative distribution function vs Empirical cumulative distribution function


- The (cumulative) distribution function F is the function mapping x to P ( X ≤ x ) :

x
F ( x )=P ( X ≤ x )=∫ f ( u ) du
−∞

66
Continuous vs discrete distributions
- Summation <-> integration
- Taking differences <-> differentiation

Discrete (integer – valued) Continuous


x
F ( X )=∑ P ( X =k ) F ( x )= ∫ f ( u ) du
k≤x
−∞
'
P ( X=k )=F ( k )−F ( k−1 ) f ( x )=F ( x )

∑ P ( k )=1 ∫ f ( x ) dx=1
k
−∞

Normal distribution
- Notation N ( μ , σ 2 )

- symmetric around the mean 𝝁


- a continuous probability distribution over real numbers (probability 0 for individual values)

- The distribution is fully specified by the values of the mean 𝝁 and standard deviation 𝝈.

( )
2
−1 x− μ
1 2 σ
f ( x )= e
σ √2 π
- Other names for normal distribution: Gaussian distribution, bell curve distribution.
- Software or lookup tables are needed to compute probabilities.

Standard normal distribution


- Special case μ=0 and σ 2=1 is called “standard normal distribution”
X−μ
- If X ∼ N ( μ , σ 2), then ∼ N ( 0 ,1 ) , i.e. z-scores of X have the standard normal
σ
distribution
- Z score: shows how many standard deviations X is away from μ

X−μ
Z=
σ

Standard normal distribution: quantiles


- Normal quantile (percentile) z a is the value such that for Z ∼ N ( 0 , 1 ) , P ( Z ≤ z a )=a
- Frequently used values:
o Z 0.5 ≈ 0 , i.e. P ( Z ≤ 0 )=0.5
o Z 0.05 ≈−1.64, i.e. P ( Z ≤−1.64 )=0.05
o Z 0.023 ≈−2
o Z 0.0013 ≈−3
- Important: A z-score is computed for a given value x of a random variable X, e.g. for X ~ (183,
173.3−183
9.7) and x = 173.3, z-score = =1
9.7
- A normal quantile is looked up for a given probability p, e.g. for p = 0.05, the normal quantile
z 0.05=−1.64

67
Let 𝑋~𝑁 (15, 5), with X being time left till the start of an exam when a student enters the exam
Normal distribution: usage examples

room.
- What is the probability that a student comes to the exam too late?
o “Too late” means X < 0
0−15
o Compute the z-score of 0: =−3 and use python or tables to find:
5
P( X ≤ 0)≈ 0.0013
- By which time are students likely to arrive, meaning for which value t is the probability
P( X> t)=0.9 ?
o P ( X >t )=1−P ( X ≤ t )
o So P( X ≤ t)=0.1
t−15
o z 0.1=−1.28 , so =−1.28
5
o t ≈ 8.6 minutesbefore the exam

Almost normally distributed variables


- Birth weight
- Measurement errors
- User response time
- Network latency
- …

Estimations

68
Central limit theorem
- Also called the law of large numbers
X 1+ X 2 +…+ X n
- Let X = , where X i are independent random variables with expectation E(X)
n
2
and standard deviation σ X (so Var ( X )=σ x), with a large n.
2
σx
- Then the distribution of X can be approximated by N ( E (X ), )
n
- Note: although X is not normally distributed, the same mean X is !( lim ¿ .
n→∞

How fair is a coin? – Estimate of p


- How to find an estimate for the success parameter p of the binomial distribution, e.g. the
probability to get a head when flipping a coin?
- Toss a coin many times!
HHHTTHHTTHHTHTHHHHTTH …

- Divide the number 𝑋 of successes (heads) by the number 𝑛𝑛 of observations (coin flips)
- and count the number of H.

X
- Notation: ^p= – estimate of p
n

Estimation – general setup


X
- ^p=
n
- E ( X )=np, so E ( ^p )= p
p (1−p )
- Var ( X )=np ( 1− p ) , so Var ( ^p )=
n

Binomial distributions – many observations

Also here: central limit theorem


- ^p is a random variable
- ^p has the normal distribution if your sample is large!

- (
^p N p ,
p ( 1− p )
n )
- The standard deviation of ^p gets smaller for larger samples!

- Var ( ^p )=
p (1−p )
n
, σ ^p=

p ( 1− p )
n

69
70
Confidence interval
- Confidence interval is an interval containing the true unknown value that we wish to
estimate (e.g., a mean or a proportion) at a given confidence level (e.g. 95%)
- Probability interpretation:
o When the estimate computation procedure is repeated many times for different
samples of the same size coming from the same population, the confidence interval
at a confidence level 95% computed based on a sample will contain the true value of

- In this case: significance level 𝜶= 0.05 confidence level 𝟏 − 𝜶 = 0.95


the parameter in 95% of the cases, and it won’t in 5% of the cases.

- Most popular choices of 𝛼: 0.01, 0.05, 0.1. The context of the problem defines the choice!
You might need 𝛼 = 0.0001!

71
72
73
74
Hypotheses and court
- Standard legal procedure:
- 1. suspect assumed to be not guilty by court (null hypothesis), prosecution believes the
suspect is guilty (alternative hypothesis)
- 2. prosecution brings evidence for guilt (data)
- 3. in case of insufficient evidence → acquittal (null hypothesis not rejected)
- 4. in case of sufficient evidence → conviction (null hypothesis rejected )

75
-

76
77
78
79
Note the

80
81
asymmetry in the procedure!

Conclusion validity
- To which extent are conclusions influenced by assumptions made in the data analysis?

Assumptions behind hypothesis tests and confidence intervals


- Observations in your data set are independent from each other
- Observation in your data set come from the same probability distribution

- Tests and confidence intervals for proportions: 𝒏𝒑 > 𝟓, 𝒏(𝟏 − 𝒑) > 𝟓


- Tests and confidence intervals for means: normality or “large sample size”

(conditions to ensure approximate normality)


- Software may not check for you whether these assumptions are met.
o If not, you may get wrong answers!
o It is your responsibility to check these assumptions

Outliers
- Distinguish between
o wrong/incorrect data
o data that do not fit a statistical model
- Outliers could distort analysis results
- Simple graphical tool: Box-and-Whisker plot
- Rule of thumb: data points more than 2 or 3 standard deviations away from mean are
suspect
- Outliers should not just be deleted unless there is a good contextual reason!

82
Normality testing
How?
- Graphical (gives insight why normality may not be appropriate) :
o Kernel density plot (good for global assessment of shape)
o Normal probability plot (good for detecting whether there are problems)
- goodness-of-fit test (gives objective decision criterion):

Note: caution when 𝒏 < 20 (a single outlier may distort everything)


o Anderson-Darling test

Normality testing – kernel density plot


- good for “seeing” global shape, symmetry, bell shaped
-
- Not good enough for tails

Normality testing – normal probability plot


- similar to the ECDF (an improved cumulative histogram)
- Trick: transform y-axis so that ECDF becomes straight line

Normality testing: Anderson-Darling test


- This is a statistical test to be used together with the graphical methods. A statistical test gives
an objective answer, the graphical methods give insight why the data are distributed like a

- 𝐻0: data comes from normal distribution


normal distribution.

- 𝐻𝑎: data does not come from normal distribution

- Interpretation of p-value: if p is small, then reject 𝐻0


- This test requires software to perform

- Practical choice: small threshold (e.g. 0.01 instead 0.05), since the t-test is robust against
moderate deviations of normality

83
What to do when normality fails?
- Means and proportions are constructed using sums.
- If the sample size is “large”, then such sums may be approximately normally distributed by
the Central Limit Theorem. In such cases confidence intervals and p-values can be trusted.
- There is no general rule what “large” is (it depends on the data; sometimes 50 is mentioned
as a rule of thumb).

84
85
86
87
88

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy