Data Analytics Summary
Data Analytics Summary
2IAB0 – 2023/2024
Biermans, Cas
Biermanscas@gmail.com
Contents
Lecture 1: (Monday 13-11-2023).......................................................................................................3
Lecture 2: (Monday 20-11-2023).....................................................................................................12
Lecture 3: (Monday 27-11-2023).....................................................................................................27
Lecture 4: (Monday 4-12-2023).......................................................................................................34
Lecture 5: (Monday 11-12-2023).....................................................................................................46
Lecture 6: (Monday 18-12-2023).....................................................................................................58
1
Lecture 1: (Monday 13-11-2023)
Types of data analytics:
- Descriptive: insight into the past
- Predictive: looking into the future
- Prescriptive: data-driven advice on how to take action to influence or change the future
Data types:
- Categorical data: data that has no intrinsic value:
o Nominal: two or more outcomes that have no natural order
e.g. movie genre or hair color
o Ordinal: two or more outcomes that have a natural order
e.g. movie ratings (bad, neutral or good), level of education
- Numerical (quantitative) data: data that has an intrinsic numerical value:
o Continuous: data that can attain any value on a given measurement scale
Interval data: equal intervals represent equal differences, there is no fixed
“zero point”
e.g. clock time, birth year
Ratio data: both differences and ratios make sense, there is a fixed "zero
point"
e.g. movie budget, distance, time duration
o Discrete: data that can only attain certain values (typically integers)
e.g. the number of days with sunshine in a certain year, the number of traffic
incidents
Reference table: to store “all” data in a table so that it can be looked up easily.
Demonstration table: to illustrate a point (with just enough data, or with a specific summary)
2
• getting to know the data before doing further analysis
• extensively using plots
• generating questions
• detecting errors in data
Plots are a useful tool to discover unexpected relations and insights. Plots help us to explore and give
clues. Numerical summaries like averages help us to document essential features of data sets.
One should use both plots and numerical summaries. They complement each other. Numerical
summaries are often called statistics or summary statistics (note the double meaning of the word:
both a scientific field and computed numbers).
Summary statistics:
- Level: location summary statistics (typical values)
- Spread: scale summary statistics (how much do values vary?)
- Relation: association summary statistics (how do values of different quantities vary
simultaneously)
3
Pth percentile: a cut-off point for P% of data
The 0th percentile = the smallest element of the dataset
The 100th percentile = the largest element of the dataset
100
For a dataset with n observations, the interval between neighbors’ percentiles is %
n−1
p
For a percentile P we compute its location in a dataset of n observations: L p=1+ ( n−1 )
100
Let l and h be the observations at the position ⌊Lp⌋ and ⌊Lp⌋ in the ordered dataset.
Apply linear interpolation:
Often denoted by 𝝈2 or S2
n−1 i=1
√
n
1
Sample standard deviation s = ∑
n−1 i=1
( x1−x )
2
Median absolute deviation (MAD): median of the absolute deviation |x i−M | from the median M.
4
The higher these statistics, the more spread/variability in the data.
The z-score (=normalized value) of value 𝑥 shows how many standard deviations the value is from
Standardization (z-score normalization):
the mean:
x−x
z=
s
Negative z-score: the value is below the mean.
Positive z-score: the value is above the mean.
The mean value z of the z-scores of a data set is 0 and the standard deviation s z is 1.
Rule of thumb: observations with a z-score larger than 2.5 are considered to be extreme (“outliers”).
Association statistics: Association statistics try to quantify the strength of the relation between two
variables (attributes).
The sign of an association statistic indicates whether it is:
- a positive association (e.g., between budget and profit)
- a negative association (e.g., between ice cream consumption and weight)
n
1
Sample covariance s xy = ∑ ( x −x ) ( y i− y )
n−1 i=1 i
s xy
Sample correlation r xy =
sx s y
( s x =std . dev of x ) ∧(−1 ≤r xy ≤1 )
Jitter plot:
no meaning of different x-coordinates, only to indicate amount of data.
- One dimensional numerical data
Tasks: finding clusters and outliers
Not suitable for large data sets
5
6
Scatter plot
- Two-dimensional data with 2 numerical attributes (e.g. production budget and profit)
Tasks: investigate relations and see whether there are outliers
Cumulative histogram:
- One-dimensional numerical data
- The range of data values is split in bins
- The height of a bin [a,b) show the count or percentage of the number of observations not
exceeding b.
Task: understanding the distribution, exploring unexpected “jumps”, to lookup percentiles or
thresholds.
Bar chart:
7
- Two-dimensional data with:
o 1 categorical attribute
e.g. release month, genre
o 1 numerical attribute
e.g. the number of released movies
- Also used for one-dimensional categorical data with the derived numerical attribute being
the count or percentage of the number of observations per category
Tasks: looking up and comparing values
8
Box and whisker plot: convenient way to display summary statistics
- Box: 1st and 3rd quartile and the median inside it
- Outliers: dots/crosses/diamond… for all values above (Q3+1.5*IQR) and below (Q1-1.5*IQR)
- Endpoints of whiskers: the maximal and minimal non-outlier values
- Optional: the indication of the mean
9
- Bandwidth choice is important!
Influence of bandwidth!
Violin plot:
Combination of box-and-whisker plot and kernel density plot to have the best of two worlds:
ECDF is a function that for a give value x returns the fraction of observations that are smaller or equal
to x.
10
Summary advanced statistical plots:
- Kernel density plots:
o overcome the drawbacks of histograms because they do not have fixed bins, but
beware of bandwidth!
o good tools to explore distribution shapes (symmetry, skewness, unimodality,
bimodality,...)
- Box and whisker plots are simple but effective ways to compare shapes of groups.
- Violin plots combine the advantages of kernel density plot and box-and-whisker plots.
- The ECDF (empirical distribution function) overcomes the fixed bin problems of the
cumulative histogram.
11
Lecture 2: (Monday 20-11-2023)
Infographics:
- Made for telling or explaining a story.
- Focusing on narrative, not on data.
- Manually drawn and illustrated.
- Using a larger range of creative elements and styles.
- Using a presentation method that cannot simply be reused for a different dataset.
Sphygmograph:
- First tentative to record signals. (heartbeat)
Visual definition:
- Visualization is the process that transforms abstract data into interactive graphical
representations for the purpose of exploration, confirmation, and communication.
Why visualize?
- Communication to inform humans.
- Exploration when questions are not well defined.
Communication:
- Specific aspects of a larger dataset are extracted and visualized.
- Annotations highlight important aspects and allow the reader to better connect the
presented information to their existing knowledge.
- The juxtaposition of upper and lower graphs with comparable axes and scales, facilitates
comparison.
- The data are selected and often summarized to support specific (simple) analysis and
insights. Annotations highlight important aspects and facilitate comparison.
Purpose of visualization:
- Open exploration
- Confirmation
- Communication
12
13
High level actions:
- Analyze
- Search
- Query
Analyze:
- Visualization for consuming data:
o For end-users, often not technical.
o Needs good balance of details and message.
o Design matters more than for other tasks.
o Dataset will not be changed.
- Visualization for producing data:
o Extends the dataset.
o Usually interactive, more technical users.
o Additional meta-data (semantics)
o Additional data (volume)
o Additional dimensions (derived)
Goals:
When consuming data:
- Discover new information.
o Generate a hypothesis.
o Find support for a hypothesis.
- Present – communicate something you know.
- Enjoy – have fun with data visualization.
Annotate:
- Attach text or graphical elements to visualization elements.
14
Derive:
Record:
- Save an artifact, create graphical history!
Search:
- Location:
o A “key” identifying an object/artefact.
- Target:
o An aspect of interest: trends, outliers, features.
Our eyes unconsciously draw a line thought the points and “see” a maximum value.
15
Gestalt principles can help us making effective graphs:
- Position and the arrangement of visual elements is the most important channel for
visualizations.
Perception: colors
- Perception of color begins with 3 specialized retinal cells containing pigments with different
spectral sensitives, known as cone cells.
- The output of these cells in the retina is used in the visual cortex and associative areas of the
brain.
- The human eye is sensitive to light, but not the same at all frequencies.
- Perceptual processing before optic nerve:
o One achromatic channel L
o Two chromatic channels R-G and Y-B axis
- “Color blind” if one axis has degraded acuity
o 8% of men are red/green color deficient.
o Blue/yellow is very rare => blue orange is safe to use.
16
Marks and channels
Data visualization makes use of:
- Marks (geometric primitives)
o Points
o Lines
o Areas
o Complex shapes
- Channels (appearance of marks)
o Position (horizontal, vertical, both)
o Color (HSL = hue, saturation, luminance)
o Size
o Shape (e.g. circles, diamonds)
o Angle, …
Encoding attributes:
Redundant encoding:
Using two or more channels to encode one attribute. E.g. hue and horizontal position
Arrangement:
How data values determine position and alignment of visual representatives.
- Express: attribute is mapped to spatial position along an axis (e.g. MWh per capita)
- Separate: emphasize similarity and distinction (using a categorical attribute, e.g. country)
- Order: emphasize order (using an ordinal attribute, e.g. year, if you show progression in time)
- Align: emphasize quantitative comparison (for a quantitative attribute, e.g. MWh per capital
of a country)
- Use: using an existing structure for arrangement (е.g. a map)
17
Mapping color:
- Usually, 3 channels
o Hue, Saturation, Luminance (HSL)
o RGB or CMYK are less useful.
- Ordering/how-much for ordinal and numerical attributes.
o Map to Luminance
o Map to Saturation
- What/where for categorical attributes.
o Map to Hue
- Transparency is possible.
o Useful for creating visual layers.
o But: is not combinable with luminance or saturation.
18
Properties and best uses of visual encodings:
(Make sure that you have understood point 1-4 very well, before going into the visualization (point
5). Always think about what the purpose is.)
19
Design strategies: user-focus
- Consider reader’s context:
o Text: title, tags, and labels
Familiar to the reader or strange jargon?
o Colors
Will the reader have the same and right color associations?
o Color deficiencies
o Representation format (different devices, print, interactive)
o Directional orientation (reading direction)
- Compatibility with reality:
o Make the encoding of things and relationships well-aligned with the reality (the
reader’s reality)
o Patterns are easily recognized by humans, powerful tool to analyze visual and
auditory information.
o Violation of patterns is hard for humans, consistency is key!
Elements of a chart:
20
How to understand idioms (chart types)
- DATA: what is the data behind the chart?
o Number of categorical attributes
o Number of quantitative attributes
o Semantics of keys and values
- MARK: which marks (visual elements) are used?
o Points, lines, glyph/complex shape, …
- CHANNELS: how will the data be encoded in visual channels?
o Arrangement: horizontal and vertical position
o Mapping: colors, length, size, …
- TASKS: what are the supported tasks?
o Finding trends, outliers, distribution, correlation, clusters, …
21
Idioms:
22
23
24
Best practices for idioms:
- Include a descriptive title.
- Put labels on the axes and include units if applicable.
- Make the plot readable (use appropriate font sizes, colors, contrast)
- Choose appropriate ranges for the axes.
- Include a legend when you use multiple colors, marks, …
25
Effectiveness:
Effectiveness specifies how well the visualisation helps a person with their tasks. Any depiction of
data requires the designer to make choices about how that data is visually represented.
- “It’s not just about pretty pictures”
- Lots of possibilities to visualise, only few are effective
- Effective data visualisations conform to user actions and targets!
26
Ethical considerations:
- Scientific integrity
o Plagiarism
o Data fabrication and “engineering”
- Data protection and privacy
o Was an ethical committee involved? (Usually mandatory for medical research and
anything involving patients, participants, and animals.)
o GDPR and other privacy-relevant data: do we have informed consent and
permissions?
Storing personally – identifiable data needs clear information and always
explicit consent.
Note: Who owns the data, who presents it to whom for which purpose?
27
Lecture 3: (Monday 27-11-2023)
n n
SSD=∑ ( yi − y̌ i ) =∑ ( y i−( β 0 + β 1 x i ) )
2 2
i=1 i=1
i=1
n
1
σ = 2
y ∑
n−1 i=1
( y i− y )
2
2 SSD
- Use this to define the relative quality measure R2: R =1−
( n−1 ) σ 2y
- The fraction of variation in output values explained though the model by variation in input
values.
28
- The higher R2, the better the linear model (1 means all y i= y̌ i )
29
There can be multiple input variables:
- Any number of input variables can be used.
- E.g. extend the profit regression model:
o Only production budget as input
o Extended with release date as input
- R2 always increases when adding input variables, beware of overfitting.
o If data gets trained with too many variables, it wont adapt well to new inputs!
- Difficult to visualize for more than one input variable.
Clustering:
k-means clustering algorithm:
1. Pick k random points as initial centroids.
2. Assignt points to nearest centroid.
3. Recompute centroids: mean of the points in each cluster.
4. Repeat steps 2 and 3 until clusters are stable.
30
How do we assess clustering quality?
- There is no known “output” to compare to (clustering is unsupervised)
- A clustering model essentially consists of a set of centroids.
- How well does a centroid represent an observation in a cluster?
o Good, if there is a small distance between them.
o Bad, if there is a large distance between them.
- So, strive for low within-cluster distance: the sum of the distances between cluster’s centroid
and each of its observations.
- A clustering’s within-cluster-distance total W(C) is the sum of the within-cluster-distance over
all clusters (smaller is better, larger is worse)
Distances:
- Distances are fundamental in data mining!
Linear regression:
- Residuals / deviations: distance between points and linear models.
- Determines SSD, and so determine the optimal regression model.
- At the basis of quality measure R2
Clustering:
- Distances determine cluster assignment.
- Different attribute scales can be chosen, influencing the distance.
- At the basis of quality measure W(C)
31
What is distance?
- A measure for how close or far away things are from each other.
(origin = spatial distance; e.g. between two towns)
- A measure for how related things are.
(e.g. comparing apples and oranges in terms of attributes such as mass, volume, color, sugar
content, etc.)
- A single positive number: distances can be easily compared. (Close vs. far becomes good vs.
bad or in-group vs. out group)
(yes, we can fully compare apples and oranges if we define a distance!)
- There is no single appropriate distance…
Euclidean distance:
- Unrestricted movement, think of straight lines.
- Euclidean distance is also typical default (not always good)
Network distance:
- Movement between “hotspots”, think of commuting from home to school by car.
- Good when:
o We know the network of possible movements.
Manhattan distance:
- Movement in only horizontal and vertical directions.
- Appropriate when movement in restricted to a fixed grid.
- A special network of network distance.
Decision trees:
32
How good is the tree?
Confusion matrix:
˅Reality˅ >Tree> Yes No
Yes TP FN
No FP TN
TP+TN
Accuracy=
TP+TN + FP+ FN
TP
Precision=
TP+ FP
TP
Recall=
TP+ FN
TN
Specificity=
TN + FP
33
34
Use entropy! Measures purity of nodes
“Resit” node is pure! 100% yes.
How to decide?
- We compute the entropy for the whole dataset.
- For each possible split (calc, phys, …) we compute the average entropy over all outcomes of
the split (pass, resit, fail) weighed by how often such an outcome occurs.
- This average entropy, minus the entropy for the whole dataset, constitutes the information
gain of that split.
Association rules:
From frequent itemsets to association rules:
- Find all frequent itemsets.
o Length-1: X
o Length-2: X ∩Y
o Length-3: X ∩Y ∩ Z
o … until nothing is frequent anymore, then stop.
- Create association rules by replacing ∩by ⇒
o X ⇒Y ,Y ⇒ X
o X ⇒(Y ∩Z ),(Y ∩ Z )⇒ X , …
35
The confidence of a rule is its frequency:
- Confidence of a rule X ⇒ Y :
|X ∩Y |
conf (X ⇒ Y )=
| X|
- Strong association rules have high support and high confidence:
o supp( X ∩Y )≥ α
o conf (X ⇒ Y )≥ β
36
Lecture 4: (Monday 4-12-2023)
Data redundancy
- Imagine we store account information in one single table in a straightforward way.
- John Doe has several accounts -> several rows in the table
- His address is the same in all rows -> redundancy!
- What about joint accounts?
37
Data inconsistency
- John moves to Boschdijk and the record is corrected at the Eindhoven centrum branch.
o Inconsistency!
- The address needs to be updated in all records of John!
Problems solved
Redundancy and inconsistencies are avoided:
- One record with John’s address
- Changing the address means changing one record
Primary key
- A minimal set of attributes of a table that uniquely defines each row of this table
o custName defines the address → primary key
o acctNumber defines the branch and the balance → primary key
o custName does not define the account number, the account number does not define
the custName → combination (custName, acctNumber) is a primary key
- Another way to look at the “key” of a table:
o
o If we remove all “non-key columns” from the table, all rows are unique!
- A primary key is a minimal set of attributes (columns) that uniquely identifies each record
(row) in the table:
o One single column or several columns
o No subset of this set is a key
o Sometimes it is better to introduce an id
38
Database models:
- A database model is a collection of tools for describing:
o Data
o Data relationships
o Data semantics (i.e., the meaning of the data)
o Data constraints
- Historically, there have been many proposals:
o Network and Hierarchical database models (1960-1970’s)
o Relational database model (1970-1980’s)
o Object-based database models (1980-1990’s)
o XML data model (1990-2000’s)
o RDF (graph) data model (2000-2010’s)
o …
- We study the relational database model, as it is the dominant practical model, and industry
standard.
Schemas:
Example: suppose we have a bank database schema consisting of three tables:
- customer(custName, custStreet, custCity) table keeps track of each customer of the bank.
- account(acctNumber, bName, balance) table keeps track of each account of each branch of
the bank.
- depositor(custName, acctNumber) table keeps track of which customer is associated to
which account.
Instances:
- Instance – the actual content of the database at a particular point in time
o Analogous to the value of a variable
- Example: (John Doe, Kruisstraat, Eindhoven) is an instance of the schema customer
(custName, custStreet, custCity)
o Different terms used in the literature: an “entry”, a “tuple”, a “row”
39
What is database design?
- A database represents the information of a particular domain (e.g, organization, experiment)
o First and foremost: determine the information needs and the users
o Design a conceptual model for this information
o Determine functional requirements for the system: which operations should be
performed on the data?
- “Goodness” of Conceptual design:
o Accurately reflect the semantics of use in the modeled domain
40
Relationship sets (cont)
- An attribute can also be a property of a relationship set.
- For instance, the advisor relationship set between entity sets instructor and student may
have an attribute date
Attributes
- An entity is represented by a set of attributes, that is, descriptive properties possessed by all
members of an entity set.
o Example: entity sets
instructor = (ID, name, street, city, salary )
course = (course_id, title, credits)
- Domain – the set of permitted values for each attribute
o Example:
Instructor names are strings (e.g., “Jane Smith”)
Instructor salaries are numbers (e.g., 50000)
41
42
Reduction to relation schemas
- Both entity sets and relationship sets are expressed as relation schemas (tables) that
represent the contents of the database.
- A database which conforms to an E-R diagram can be represented by a collection of relation
schemas.
- For each entity set and relationship set there is a unique schema that gets the name of the
corresponding entity set or relationship set.
- Each schema has columns corresponding to attributes, which have unique names.
- An entity set reduces to a schema with the same attributes
o primary key → primary key
- A relationship set reduces to a schema whose attributes are the primary keys of the
participating entity sets and attributes of the relationship set
o The primary key is the combination of the primary keys of the participating entities
(it does not include the attributes of the relationship set!)
Basic SQL
- Industry standard query language for RDBMS (relational database management systems)
- Query language: a language in which to express data analytics
What is SQL?
- Structured Query Language (SQL = “Sequel”)
- Industry standard query language for RDBMS (relational database management systems)
- designed by IBM, now an ISO standard
43
o available in most database systems (with local variations)
o has procedural flavor but is declarative in principle
o “reads like English”
- uses a relational database model where:
o tuples in relation instance are ordered
o query results may contain duplicate tuples
Why SQL?
- Lingua franca of data intensive systems
o Relational databases
o MapReduce systems, such as Hadoop
o Streaming systems, such as Apache Spark and Apache Flink
o ...
- Stable mature language with 40+ years of international standardization and study
o SQL is essentially the industry standard for first order logic, i.e., predicate logic, i.e.,
the relational DB model (see 2ID50 in Q2)
- Via information systems, you will continue to interact directly or indirectly with SQL or SQL-
like languages for the coming decades …
Conventions in SQL
- Keywords are case-insensitive. They are often capitalized for better readability
- Table and column names can be case-sensitive (often configurable)
- The semicolon (;) at the end of an SQL statement is not mandatory for most database
systems
Basic
query structure
- The SQL data-manipulation language (DML) provides the ability to query information
- A typical SQL query has the form:
o SELECT instructor.instructor_id, instructor.name
o FROM instructor, advisor, student
o WHERE instructor.instructor_id = advsor.instructor_id and
student.student_id = advisor.student_id and
44
student.tot_credit < 100;
- SELECT lists attributes to retrieve
- FROM lists tables from which we query
- WHERE defines a predicate (i.e., a filter) over the values of attributes
o FROM loan;
45
would return a relation that is the same as the loan relation, except that the value of the
attribute amount is multiplied by 100 (euro to cents)
o SELECT ∗
- Find all pairs of instructor and teaches entries
Tuple variables
- Tuple variables are defined in the FROM clause using the AS clause
- Keyword AS is optional and may be omitted, i.e. you can write borrower b instead of
borrower AS b
Find the name, loan number and loan amount of all customers having a loan at the Perryridge
branch:
46
SELECT customer_name, borrower.loan_number, amount
FROM borrower AS b, loan AS l
WHERE b.loan_number = l.loan_number
AND
branch_name = ‘Perryridge’
Tuple variables
- Find the names of all branches that have greater assets than some branch located in
Brooklyn.
SELECT DISTINCT b2.branch_name
FROM branch b1, branch b2
WHERE b2.assets > b1.assets AND b1.branch_city = ‘Brooklyn’
Set operations:
- The set operations UNION, INTERSECT, and EXCEPT operate on relations.
o A “set” is a collection of objects without repetition of elements and without any
particular order.
o E.g., each of us has a set of hobbies, which might be empty.
Se
Set operations:
- The set operations UNION, INTERSECT, and EXCEPT operate on relations (tables).
- Each of these operations automatically eliminates duplicates; to retain all duplicates, use the
corresponding multiset versions UNION ALL, INTERSECT ALL and EXCEPT ALL.
47
Aggregate functions:
- These functions operate on all values of a column (including duplicate values, by default),
and return a value
COUNT: number of values
MIN: minimum value
MAX: maximum value
AVG: average value (on numbers)
SUM: sum of values (on numbers)
48
Lecture 5: (Monday 11-12-2023)
A mouse experiment
Based on a standardized task in Human-Computer interaction
- Using a mouse or a trackpad, move onto a square shown in the center of the window
- After a random delay of 2 to 4 seconds a target disc appears
o in one of the eight possible directions (-1350, -900, …, 1350 ,1800 to the X-axis )
o at a distance chosen from at random from 100 to 290 pixels
o the target radius is 3, 6 or 9 pixels (chosen randomly)
- Move towards the blue disc and click on it
49
Data collected
- User characteristics (right-handed?, the major study program,…)
- Characteristics of the mouse, operating system, etc.
- Trial characteristics, like
o the trial number for this user (there might be a learning curve)
o delay
o target radius
o target position
o …
- Paths
o timestamp with the corresponding position for each trial
Formulate a hypothesis
First attempt:
- The reaction time is larger if the task is more difficult.
Questions about it:
1. What exactly is meant by the reaction time?
2. How is the task difficulty defined / measured?
Reaction time and task difficulty are examples of features.
Feature
- An object can be represented in data as a vector of features
- Feature is a measurable property or characteristic
- Some features are directly gathered during an experiment,
e.g. target width, dominant hand
- Some features are computed based on the collected data,
e.g. reaction time, length of a user’s path from a source to a target
50
Hypotheses refinement process:
1. Ask questions
- paths(user, trial, t, x, y)
- trials(user, trial, delay, input_method, mouse_acceleration, …, target_radius, target_x,
target_y,…)
- users(user, major, gender, right_handed, use_tue_laptop, …)
51
Scientific method: a way of thinking
- Scientific method is a systematic approach to conducting empirical research to obtain sound
answers to questions
- The scientific (research) method is used to
o to collect valid data for exploratory or confirmatory analysis
o test a hypothesis or theory (collect evidence for a claim)
- Empirical (experimental) research characteristics:
o procedures, methods and techniques that have been tested for validity and reliability
o unbiased and objective (research attitude)
Deduction vs induction
Deductive reasoning:
- Premise 1: Rain droplets collide with airborne particles during free fall.
- Premise 2: Today, there has been heavy rainfall in the region.
- Conclusion: the more cars, the higher NO2 concentration in the air (with other factors being
equal)
- If the premises are true, the conclusion is valid.
- What if one of the premises is not true? (e.g. cars do not use fuel anymore)
Inductive reasoning:
- We make 100 measurements of air quality and traffic intensity and see that the higher NO2
concentration corresponds to the higher traffic intensity.
- Conclusion: the more cars, the higher NO2 concentration in the air.
o 1. Do we have enough observations to generalize?
o 2. Were the observations made under a wide variety of conditions?
o 3. Are there observations in conflict with a derived law?
52
Occam’s Razor / Parsimony principle:
- A simpler explanation of the phenomenon is to be preferred
o A model with less variables is preferred, if it fits the data “equally well”.
o Newton’s laws are simpler than Einstein’s laws and they are still used for situations
when they explain the phenomenon equally well (the difference in outcomes would
be negligible).
- If you can explain the concentration of NO2 using just traffic flow intensity and the wind flow
well enough, you go for such a simple model. Otherwise, you have to involve more/other
variables.
53
Reproducibility, verification by others
- Refers to the ability of others to replicate the findings
o If others do not get similar measurements/conclusions, analysis results will not be
accepted
Example: traffic
- Imagine you found that the amount of NO2 in the air is proportional to the number of cars
passing nearby, while the others do not see this relationship in their experiments. – so the
results are not reproducible!
- This can happen because of:
o Implicit assumptions, e.g. the number of cars is an appropriate as an indicator for the
amount of car emissions, assuming that all cars use fuel (think of electric cars)
o your data or their data was collected in different, very specific conditions (e.g. a
power plant near their measurement station)
o the measurements were incorrect (e.g. malfunctioning measuring instruments)
o you omitted some facts/results when drawing conclusions
o …
54
Population and sample
- Population is a complete set of all elements, like traffic and air quality data for every single
moment of time at every location in the Netherlands
o collect the data for the whole population is too expensive, or even physically
impossible
o it might be not feasible to analyze the data for the whole population (when it is
available) because of the data size and the complexity of algorithms.
- A sample is any part of the population (whether it is representative or not)
o the data from the mouse experiment in GA2 is a sample, containing data collected by
the students following the DAE course
- A sample is biased if some part of the population is overrepresented compared to others
o young people and males are overrepresented in the data collected in the mouse
experiment, since the data is collected by students following the DAE course
Convenience sampling
- Going for data that is easier to collect
o DAE students for the mouse experiment
- Advantages: saving time, effort, money, …
- Disadvantages: possible bias that is a threat to external validity
o Would the reaction time in the mouse experiment depend on the target size in the
same way for older people?
Random sampling
- Each individual is equally likely to be included into the sample
- Ignoring any knowledge of the population
55
Voluntary sampling
- individuals select themselves
o ‒ e.g. course evaluation questionnaires
- self-selection bias the effects of which are difficult to measure
o Would more engaged students fill in the questionnaire?
o more unsatisfied students?
Data cleaning
- Is a process of detecting, diagnosing, and editing faulty data
o Incorrect data
o Incomplete (missing) data
56
o A period with high intensity of traffic in combination with a high deviation in the car
speed values
57
Handling inconsistent, invalid or missing data
- Discard all records (rows) with at least one inconsistent, invalid or missing value or discard an
attribute (column) with lots of such values
o potentially loosing a lot of data
when it is foggy and we remove all such periods from the data ⇒ our conclusions on
o potentially introducing a bias in data: suppose no car speed data could be gathered
- Time series is a sequence of pairs (𝑡𝑛, 𝑥𝑛), where 𝑡𝑛 is the observation time and 𝑥𝑛 is the
Noise reduction in time series data
Median filter
- Choose the window size, e.g. 3
- Position the window at the beginning of the time series and compute
the median.
- Move the window by 1 and compute the next median value.
- Proceed till the end of time series.
- The filtered time series consists of the computed window medians.
58
Gaussian filter
- Choose the filter width ω
Convolution filters
- The same idea as Gaussian filter, but using differently shaped functions to generate weights
o You can also define your own weights, but make sure that ∑ wi=1
i
- Which kernel corresponds to the moving average (mean) filter?
59
Feature generation: identifying informative variables
- Goal: reduce the number of data/variables in the dataset by creating new, more informative
variables from existing ones
- mouse trajectory:
o use the coordinates of the origin and the target to compute
the distance between them
the direction of the target w.r.t. the origin
i.e. compute the polar coordinates for the target
- use the trace coordinates to split the path into segments
o can you propose an explanation for the shape of the last part of the path?
√ 2 2
d i ,i +1= ( x i+1−x i ) + ( y i +1 , y i )
i=l−1
pk , l= ∑ d i ,i+ 1
i=k
- Remember that the precision of the approximation will be influenced by the precision of the
measurements!
60
Numerical differentiation – computing speed
- We constructed a new time series ( i , pi ) , where pi is the length of the path from the origin at
moment t i
- Now we want to compute the speed at moment t i
- Approximate the derivative:
61
Lecture 6: (Monday 18-12-2023)
Hypothesis formulation and testing
A probability mass function (pmf) 𝑷(𝑿 = 𝒙) describes the probability distribution for a
outcome is (the higher, the more likely the outcome).
- Key property of probabilities: The sum of probabilities over all possible outcomes is equal to
1.
- What is the probability of getting heads 1 time? Options: HTT, THT, TTH, so probability = 3/8
- What is the probability of 0 or 1 heads? TTT, HHT, THH, HTH, so probability = 4/8
- Exercise 1: What is the probability to get only heads in a sequence of length 10? There are 2 ⋅
Calculating probabilities
Binomial coefficients
- How many options are there to get 2 heads in 5 tries?
- Choose 2 positions for heads.
- 5 possibilities for the first head and 4 for the second one:
62
- (1,2); (1,3); (1,4); (1,5); (2,1);… ; (3,1);… ; (4,1); … ; (4,5) ─ 5x4=20 options, but double counts:
heads on positions 1,2 = heads on positions 2,1 !
- So there are (5×4)/2 = 10 options
- ()
5
Probability: P(2H in 5 tries) = 2
=
10
5
2 32
k ()
P ( X=k )= n p (1−p )
k n−k
The same formula holds in the general setting: 𝑛𝑛 independent observations, each with
success probability 𝑝
-
63
Cumulative distribution function
- The probability mass function (pmf) P ( X=k )= (nk) p (1−p )
k n−k
gives probabilities for
exactly k successes.
k=0 k
()
F ( l )=P ( X ≤ l )=∑ n p k (1− p )
n−k
E ( X )=∑ kP ( X =k )
k
- For a dice:
1 1 1 1 1 1
E ( X )= ∗1+ ∗2+ ∗3+ ∗4+ ∗5+ ∗6=3.5
6 6 6 6 6 6
- For a binomial distribution one can derive an explicit formula:
E ( X )=np
Variance
- Sample variance: a number that indicates the spread of a data set.
- Variance: the analogue of sample variance for a probability distribution:
k
- For a dice:
11
Var ( X )=( ( 1−3.5 ) + ( 2−3.5 ) + ( 3−3.5 ) + ( 4−3.5 ) + ( 5−3.5 ) + ( 6−3.5 ) ) =2
2 2 2 2 2 2
12
Var ( X )=np ( 1− p )
Continuous data
- Continuous probability distributions are meant for continuous data.
Reminder:
- numerical data - data that has intrinsic numerical value
o continuous data – data that can attain any value on a given measurement scale
64
interval data (continuous data for which only differences have meaning,
there is no fixed “zero point”). Examples: temperature in Celsius, pH, clock
time, IQ scores, birth year, longitude.
ratio data (continuous data for which both differences and ratios make
sense; it has a fixed “zero point”). Examples: movie budget, temperature in
Kelvin, distance, time duration.
o discrete data – data that can only attain certain values (e.g., integers). Examples: the
number of days with sunshine in a certain year, the number of traffic incidents.
65
Continuous distribution – density function
- Discrete distributions: probability mass function
- Continuous distributions: density function
- The value of the density function f has no direct interpretation!
P( X=x)=0
b
P(a≤ X ≤ b)=∫ f (x)dx
a
+∞
∫ f ( x ) dx=1
−∞
E ( X )=∫ x ⋅ f ( x ) dx
σ =√ Var ( X )
x
F ( x )=P ( X ≤ x )=∫ f ( u ) du
−∞
66
Continuous vs discrete distributions
- Summation <-> integration
- Taking differences <-> differentiation
Normal distribution
- Notation N ( μ , σ 2 )
- The distribution is fully specified by the values of the mean 𝝁 and standard deviation 𝝈.
( )
2
−1 x− μ
1 2 σ
f ( x )= e
σ √2 π
- Other names for normal distribution: Gaussian distribution, bell curve distribution.
- Software or lookup tables are needed to compute probabilities.
X−μ
Z=
σ
67
Let 𝑋~𝑁 (15, 5), with X being time left till the start of an exam when a student enters the exam
Normal distribution: usage examples
room.
- What is the probability that a student comes to the exam too late?
o “Too late” means X < 0
0−15
o Compute the z-score of 0: =−3 and use python or tables to find:
5
P( X ≤ 0)≈ 0.0013
- By which time are students likely to arrive, meaning for which value t is the probability
P( X> t)=0.9 ?
o P ( X >t )=1−P ( X ≤ t )
o So P( X ≤ t)=0.1
t−15
o z 0.1=−1.28 , so =−1.28
5
o t ≈ 8.6 minutesbefore the exam
Estimations
68
Central limit theorem
- Also called the law of large numbers
X 1+ X 2 +…+ X n
- Let X = , where X i are independent random variables with expectation E(X)
n
2
and standard deviation σ X (so Var ( X )=σ x), with a large n.
2
σx
- Then the distribution of X can be approximated by N ( E (X ), )
n
- Note: although X is not normally distributed, the same mean X is !( lim ¿ .
n→∞
- Divide the number 𝑋 of successes (heads) by the number 𝑛𝑛 of observations (coin flips)
- and count the number of H.
X
- Notation: ^p= – estimate of p
n
- (
^p N p ,
p ( 1− p )
n )
- The standard deviation of ^p gets smaller for larger samples!
- Var ( ^p )=
p (1−p )
n
, σ ^p=
√
p ( 1− p )
n
69
70
Confidence interval
- Confidence interval is an interval containing the true unknown value that we wish to
estimate (e.g., a mean or a proportion) at a given confidence level (e.g. 95%)
- Probability interpretation:
o When the estimate computation procedure is repeated many times for different
samples of the same size coming from the same population, the confidence interval
at a confidence level 95% computed based on a sample will contain the true value of
- Most popular choices of 𝛼: 0.01, 0.05, 0.1. The context of the problem defines the choice!
You might need 𝛼 = 0.0001!
71
72
73
74
Hypotheses and court
- Standard legal procedure:
- 1. suspect assumed to be not guilty by court (null hypothesis), prosecution believes the
suspect is guilty (alternative hypothesis)
- 2. prosecution brings evidence for guilt (data)
- 3. in case of insufficient evidence → acquittal (null hypothesis not rejected)
- 4. in case of sufficient evidence → conviction (null hypothesis rejected )
75
-
76
77
78
79
Note the
80
81
asymmetry in the procedure!
Conclusion validity
- To which extent are conclusions influenced by assumptions made in the data analysis?
Outliers
- Distinguish between
o wrong/incorrect data
o data that do not fit a statistical model
- Outliers could distort analysis results
- Simple graphical tool: Box-and-Whisker plot
- Rule of thumb: data points more than 2 or 3 standard deviations away from mean are
suspect
- Outliers should not just be deleted unless there is a good contextual reason!
82
Normality testing
How?
- Graphical (gives insight why normality may not be appropriate) :
o Kernel density plot (good for global assessment of shape)
o Normal probability plot (good for detecting whether there are problems)
- goodness-of-fit test (gives objective decision criterion):
- Practical choice: small threshold (e.g. 0.01 instead 0.05), since the t-test is robust against
moderate deviations of normality
83
What to do when normality fails?
- Means and proportions are constructed using sums.
- If the sample size is “large”, then such sums may be approximately normally distributed by
the Central Limit Theorem. In such cases confidence intervals and p-values can be trusted.
- There is no general rule what “large” is (it depends on the data; sometimes 50 is mentioned
as a rule of thumb).
84
85
86
87
88