0% found this document useful (0 votes)
19 views

Session 4 - Exploratory Data Analysis - 2025

The document provides an overview of Exploratory Data Analysis (EDA), outlining its purpose, typical processes, and common analytical techniques. EDA is described as a method for understanding datasets through statistical techniques, including univariate, bivariate, and multivariate analyses. Key concepts such as detecting outliers, correlation, and regression are also discussed as essential components of EDA.

Uploaded by

My Thao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Session 4 - Exploratory Data Analysis - 2025

The document provides an overview of Exploratory Data Analysis (EDA), outlining its purpose, typical processes, and common analytical techniques. EDA is described as a method for understanding datasets through statistical techniques, including univariate, bivariate, and multivariate analyses. Key concepts such as detecting outliers, correlation, and regression are also discussed as essential components of EDA.

Uploaded by

My Thao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

PROBLEM SOLVING IN

BUSINESS MANAGEMENT
Session 4
Exploratory Data Analysis
M.Sc. Thien Nguyen
Email: thien.nguyen@isb.edu.vn
Phone: 0949088908
Agenda
1. Introduction to EDA
2. Common Analyses
Part I
1. What is EDA?
2. The Typical Cycle of EDA
Introduction To
Exploratory Data Analysis

3
I. Introduction
What is EDA?

➤EDA is the method of studying and exploring a dataset to deeply understand it


➤EDA can be done when we prepare data

The data can be:

➤Raw (not processed yet)


➤Not-cleaned
➤Including missing data
➤Redundant and duplicated

Source: Exploratory Data Analysis - An Important Step in Data Science


4
I. Introduction
What is EDA?

➤EDA can be considered as a set of statistical techniques for:


● Exploring
● Describing
● Summarising the nature of the data

Source: A Practical Introductory Guide to Exploratory Data Analysis

5
I. Introduction
What is EDA?

Typical Things To Do in EDA:

➤Screen the data to understand each data field (column)


➤Identify possible errors
➤Reveal the presence of outliers
➤Check the relationship between variables (correlations, casuals)
➤Make descriptive analysis: univariate, bivariate and multivariate analysis
To summarize the most significant aspects

6
I. Introduction to EDA
Typical Cycle

➤EDA is an iterative process, in which we start by making questions

Generating Use the answers to


Finding answers by
questions (or refine the questions
analyzing, modeling
hypotheses) about and generate new
and visualizing
the data questions

Source: https://r4ds.had.co.nz/exploratory-data-analysis.html
7
Summary
What categories?
With Data Fields
(columns) Frequency of each
Name, meaning, Qualitative category?
relationship?
Descriptive measures
of each category
Data type? Making Questions
EDA Univariate analysis: & Answering
No. of missing - Descriptive measures
Quantitative - Outliers? Abnormals?
values?
Multivariate analysis:
Common errors? Covariance? Correlation?
Duplicates?
Qualitative &
Quantitative Inferential statistics:
Regression? Clustering?
Analysis

8
Source: https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/
Part III
1. Univariate Analysis
2. Detecting Outliers
3. Multivariate Analysis
4. Regression Analysis
Common Analyses

10
III Common Analyses
1. Univariate Analysis

➤Fundamental Measures in Univariate Analysis

Measures of frequency
Number of Occurrences, Percentage
(độ đo về tần số)

Measures of central tendency


Mean, Median (trung vị), Mode (yếu vị)
(độ đo về khuynh hướng tập trung)

Measures of spread (dispersion/variability) Range, Variance & Standard Deviation, Standard


(độ đo về sự mở rộng) Error

Measures of position Percentiles & Quantiles, Quartiles (tứ phân vị),


(độ đo về phân vị) Standard Scores

Measures of shape Skewness (độ lệch)/ Kurtosis (độ gù, độ nhọn),


(độ đo về hình dạng phân bố) Normal Distribution

11
Univariate
Analysis
Most important
measures
Qualitative Quantitative
(categorical & discrete) (discrete & continuous)
Variables Variables

Measures of Measures of
Measures of Measures of Measures of
Central Spread/
Frequency Position Shape
Tendency Dispersion

● Max/ Min/ Range


● Charts / Graphs ● Mean ● Variance ● Quartile
● Skewness
● Counts ● Median ● Standard ● Quantile
● Kurtosis
● Percentages ● Mode Deviation ● Ranking
● Standard Error

12
III Common Analyses
2. Detecting outliers

➤It depends on domain knowledge


➤A simple way is to use boxplot

In statistics, a point is considered


outliers as an outlier if its Z-score > 3.0
(is far from the mean more than
three times of std)

13
III.3 Basic Multivariate Analyses

➤ Bivariate Analysis (phân tích nhị/song biến)


➤ Multivariate Analysis (phân tích đa biến)

Bi-/ Multi-variate
Analysis

Qualitative Quantitative
(categorical & discrete) (discrete & continuous)
Variables Variables

Frequency Table Covariance,


Contingency Tables
Cross-tabulation Correlation,
(Pivot Table of SUM,
(or Contingency
MEAN, MEDIAN…) Regression
Table of Counting)
14
III.3 Basic Multivariate Analyses
(1) Contingency Table

Contingency Table (Bảng 2 chiều, Bảng Phát Sinh, Bảng Tương Quan)

➤Used to summarize and analyse relationships between 02 categorical variables

➤Cross-tabulation (crosstab, VN: bảng chéo): a simple way of summarizing frequency


(COUNT, PERCENTAGE)
⇒ NOTE: Only for Qualitative (Categorical) columns

Gender
Female Male Sub-Total
Branch
Da Nang 117 93 210
Ha Noi 143 130 273
HCM City 277 240 517
Total 537 463 1000

Total number of sales in each city


15
III.3 Basic Multivariate Analyses
(1) Contingency Table

Contingency Table (VN: Bảng 2 chiều, Bảng Phát Sinh, Bảng Tương Quan)

➤If we combine a qualitative variable with a quantitative variable


● In Excel: PivotTable
● Measures: SUM, AVERAGE, MEDIAN, STD…

Gender
Female Male Sub-Total
Branch
Da Nang 39,155.36 27,994.39 67,149.75
Ha Noi 47,664.66 37,759.64 85,424.30
HCM City 94,923.93 75,469.45 170,393.38
Total 181,743.95 141,223.48 322,967.43

Total revenue in each city


16
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Covariance (VN: hiệp phương sai): measure the relationship between two random
variables and how they change together (or how they move relative to each other)

➤ When Xs are moving away from mean-of-Xs, how Ys move away from mean-of-Ys
➤ 2 types: positive covariance vs. negative covariance
➤ Range of covariance value: -∞ < Cov(x,y) < +∞

17
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Correlation (VN: sự tương quan):

➤ Covariance cannot show if a relationship is "strong" or "weak"


➤ Correlation: a normalized version of covariance
⇒ Measures both the strength and direction of the linear relationship
➤ Correlation coefficient (Pearson):

18
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Correlation:

➤ Correlation coefficient indicates a ratio, and has no unit


➤ Value: -1 < corrcoef < 1

19
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

➤Use scatter plot


➤Calculate correlation coefficient

20
III Common Analyses
4. Regression

➤Find the trendline and regression formula

21
III Common Analyses
4. Regression

➤Find the regression formula

Oops! It seems like


customers with higher
payment are less happy!

22
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy