0% found this document useful (0 votes)

31 views29 pages

6) Exploratory Data Analysis

Uploaded by

karlamarsu97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views29 pages

6) Exploratory Data Analysis

Uploaded by

karlamarsu97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

EXPLORATORY

DATA ANALYSIS
Exploratory Data Analysis
• Exploratory Data Analysis, or EDA is a way to analyze data sets, often using data visualization
methods, to summarize the main characteristics of the data.
• EDA helps in better understanding the different features of the data, the relationship
between them, and in determining the statistical techniques appropriate for the data set.
• matplotlib and seaborn libraries are pretty good for EDA.
Univariate Analysis (1/7)
• A dataset may contain one or more features/variables/columns.
• Univariate Analysis provides summary statistics only on one variable.
• Univariate Analysis only describes the data and helps identify any patterns in the data.

Categorical vs Continuous Data

• There are two types of data;
• Categorical/Discrete, e.g., spam vs no spam, male vs female
• Continuous, e.g., Age of population
• EDA is performed differently on the two types of data.
Univariate Analysis (2/7)
• Consider the following dataframe containing 5 features containing information
about iris plants.
• The last column is categorical, while the rest are continuous.
Univariate Analysis – Continuous Data (3/7)
Scatter Plot
• We can use a number of graphs to perform EDA on continuous data.
• In the given figure, we use the .scatterplot() function of the seaborn library to plot the
‘petal.length’ column of the dataframe.
• The hue parameter assigns different color to the data points based on the category
they belong to in the column specified in the hue parameter.
Univariate Analysis – Continuous Data (4/7)
Strip Plot
• Strip plots are also a good way to analyze the distribution of the variables for each
category.
• In the given figure, we have plotted ‘petal.length’ column for each category in the
‘variety’ column.
• On the Y-axis, we have the distribution of each category, and on the X-axis we have the
categories.
Univariate Analysis – Continuous Data (5/7)
Distribution Plot
• To find the distribution of a variable/feature/column, use the .distplot() function of the
seaborn library.
• In this figure, we plot the distribution of the ‘petal.length’ column.
Univariate Analysis – Continuous Data (6/7)
Histograms
• To analyze the frequency of values, we can plot a histogram using the .hist() method of
the matplotlib library.
• In this figure, we plot the frequency of values in the ‘petal.length’ column.
Univariate Analysis – Continuous Data (7/7)
Box Plot
• A box plot provides great insights into the data using 5-number summary; minimum
value, first quartile, second quartile, third quartile, and maximum value.
• We can use the .boxplot() function of either matplotlib or seaborn.

minimum value
maximum value

Q1
Q2 (median)
Univariate Analysis – Categorical Data (1/2)
Count Plot
• We can use count plots to perform EDA on categorical data.
• The .countplot() function of the seaborn library plots the total count of each value as a
bar chart.
• In the given figure, we see that we have equal instances of each category in the
dataframe.
Univariate Analysis – Categorical Data (2/2)
Pie Chart
• We can also visualize the proportion of each category using a pie chart as shown in the
figure.
Bivariate Analysis – Continuous & Continuous (1/4)
• Bivariate Analysis is used to study the relationship between exactly two
variables/features/columns of the data set.
• Consider the following dataframe.
Bivariate Analysis – Continuous & Continuous (2/4)
Scatter Plot
• If the two variables in question are both continuous, we can plot a scatter plot
between them to get an idea of their distribution.
• In the given figure, we plot the ‘Age’ column of the dataframe against the ‘Fare’
column, both of which are continuous.
• It is evident that the two features are independent of each other.
Bivariate Analysis – Continuous & Continuous (3/4)
Correlation
• We can also find the correlation between two continuous features.
• If the correlation is high, the two features are significantly related.
• Use the .corr() function to compute correlation between two features.
Bivariate Analysis – Continuous & Continuous (4/4)
Heatmp
• We can also use a heatmap to graphically visualize the correlation between something.
• Use the .heatmap() method of the seaborn library to create a heatmap.
• Provide the correlation dataframe that we created in the previous slide as an argument
to the .heatmap()
Bivariate Analysis – Categorical & Categorical (1/3)
• Consider the following dataframe from the previous slide.
• We will see how we can use bar plot to identify any relation between two categorical
variables, namely ‘Pclass’ and ‘Survived’.
Bivariate Analysis – Categorical & Categorical (2/3)
• We first select the two columns that we are interested in, namely ‘Pclass’ and
‘Survived’.
• We then group this new dataframe based on ‘Pclass’ column and apply the .sum()
function to find total number of survivors from each category in the Pclass.
• The output of this part is shown below.
Bivariate Analysis – Categorical & Categorical (3/3)
• We then plot a bar plot, with ‘Pclass’ on the X-axis.
• As evident, more passengers in ‘Pclass’ 1 were able to survive than those from ‘Pclass’
2 or 3.
Bivariate Analysis – Continuous & Categorical (1/3)
• Sometimes, we want to find out the relationship between a continuous and a
categorical variable.
• Consider the following dataframe from the previous slide.
Bivariate Analysis – Continuous & Categorical (2/3)
Box Plot
• We can use a box plot to perform EDA on continuous and categorical data.
• In the given figure, we plot 'Age' (continuous feature) against 'Survived' (categorical
feature).
• The plot suggests that the younger people had a greater chance of survival.
Bivariate Analysis – Continuous & Categorical (3/3)
Bar Plot
• Bar plots can also be used for bivariate analysis of a continuous and a categorical
feature.
• In the given figure we plot 'Age' (continuous feature) against 'Sex' (categorical feature).
1 in the 'Sex' column represents male and 0 represents 'female' passengers.
• The bar plot suggests that elderly passengers were mostly male.
Detecting Outliers
• Outliers/anomalies in the data are the observations that do not fit into the standard
pattern of the data.
• In chapetr 4, we discussed major techniques to detect outliers/anomalies e.g.;
• Median-based anomaly detection
• Mean-based anomaly detection
• Z-score-based anomaly detection
• IQR-based anomaly detection
• In this chapter, we will learn different ways to treat such outliers/anomalies in the data.
Outliers Treatment (1/3)
Trimming Outliers
• One naïve way of treating outliers is to remove them from the data. However, this
approach is not very good.
• Outliers can be removed from the data in several ways.
• Consider the given Series. It is safe to say that the value 150 is an outlier in the data.
Outliers Treatment (2/3)
Trimming Outliers
• We delete the rows of the Series for which the absolute value of the Z-score is bigger
than 1.5, which means that the values lie outside of 1.5 standard deviations of the data.
Outliers Treatment (3/3)
Mean/Median Imputation
• We can also replace outliers with either mean or median.
• In the given figure, we detect outliers in the data using Z-score and replace them with the
median value of the Series.
Categorical Variable Transformation (1/3)
• We discussed about numerical variable transformation in chapter 4 using normalization
and standardization.
• In this chapter we will discuss categorical variable transformation.
• There are several ways to transform categorical variable to make them more
meaningful for machines, we will discuss only some of them.
• Consider the following dataframe. We will transorm the 'Sex' variable.
Categorical Variable Transformation (2/3)
Label Encoding
• In label encoding, we replace categorical data with numbers.
• We replace male with 1 and female with 0.
Categorical Variable Transformation (3/3)
Frequency Encoding
• In this encoding method, we replace each value in the categorical variable by its
frequency.
• In the given figure, we replaced male and female labels in the 'Sex' column with their
frequencies.
Resources
• https://www.kaggle.com/residentmario/univariate-plotting-with-pandas
• https://purnasaigudikandula.medium.com/exploratory-data-analysis-beginner-univariate-
bivariate-and-multivariate-habberman-dataset-2365264b751

TRANSMISSION LINE COMMISSIONING Publication - No.292 PDF
100% (6)
TRANSMISSION LINE COMMISSIONING Publication - No.292 PDF
46 pages
Livro WCM - Completo PDF
No ratings yet
Livro WCM - Completo PDF
193 pages
Real Analysis Notes
100% (5)
Real Analysis Notes
141 pages
Arch Bridges
100% (2)
Arch Bridges
131 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
DataAnalytics (Unit 2)
No ratings yet
DataAnalytics (Unit 2)
131 pages
Programming For AI: Exploratory Data Analysis
No ratings yet
Programming For AI: Exploratory Data Analysis
52 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
Unit 2
No ratings yet
Unit 2
36 pages
Slide 3
No ratings yet
Slide 3
54 pages
Unit 3
No ratings yet
Unit 3
47 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Unit 2
No ratings yet
Unit 2
34 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
4082WWT Grit Removal
No ratings yet
4082WWT Grit Removal
21 pages
Data Visualization Part 2
No ratings yet
Data Visualization Part 2
18 pages
Effectiveness Factor
No ratings yet
Effectiveness Factor
13 pages
ML 3
No ratings yet
ML 3
18 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
Unit 3
No ratings yet
Unit 3
222 pages
Exp7 11 Data Science
No ratings yet
Exp7 11 Data Science
23 pages
Wxwidgets: Quick Guide To Get You Started
No ratings yet
Wxwidgets: Quick Guide To Get You Started
25 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
Ace 2014 P
No ratings yet
Ace 2014 P
139 pages
CG DADL - 2024 June - Lecture 02
No ratings yet
CG DADL - 2024 June - Lecture 02
64 pages
Advanced Plot Types With Seaborn
No ratings yet
Advanced Plot Types With Seaborn
8 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
DAVP Lab Manual
No ratings yet
DAVP Lab Manual
12 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
139 pages
Exploratory Data Analysis of Heart Disease Dataset 1737826105
No ratings yet
Exploratory Data Analysis of Heart Disease Dataset 1737826105
50 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
2.1 Exploratory Data Analysis Using Python
No ratings yet
2.1 Exploratory Data Analysis Using Python
12 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
42 pages
AIML Expt
No ratings yet
AIML Expt
7 pages
Kottakkal Farook Arts
No ratings yet
Kottakkal Farook Arts
9 pages
Math IMU CET Sample Questions 05
No ratings yet
Math IMU CET Sample Questions 05
18 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Experiment No 9
No ratings yet
Experiment No 9
13 pages
Data Science - Module 2 (Updated)
No ratings yet
Data Science - Module 2 (Updated)
94 pages
Datavisualization Interview
No ratings yet
Datavisualization Interview
3 pages
M 106 Module Updated Midterms
No ratings yet
M 106 Module Updated Midterms
59 pages
DBMS Keys - Primary, Foreign, Candidate and Super Key
100% (1)
DBMS Keys - Primary, Foreign, Candidate and Super Key
5 pages
Design of RE Slope
No ratings yet
Design of RE Slope
44 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
2.3 Transfer of Thermal Energy
No ratings yet
2.3 Transfer of Thermal Energy
31 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory
No ratings yet
CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory
17 pages
PDF Experiments-1 DADV
No ratings yet
PDF Experiments-1 DADV
41 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
AI Lab5
No ratings yet
AI Lab5
5 pages
A Detailed Lesson Plan
0% (1)
A Detailed Lesson Plan
3 pages
Machine
No ratings yet
Machine
10 pages
Flops Memory Parallel Processing
No ratings yet
Flops Memory Parallel Processing
8 pages
What Is Exploratory Data Analysis?: Intuition
No ratings yet
What Is Exploratory Data Analysis?: Intuition
8 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Concurrent and Systems Programming: Daemon & Ws Network Communication Patterns Net, HTTP, Url, Websocket Packages
No ratings yet
Concurrent and Systems Programming: Daemon & Ws Network Communication Patterns Net, HTTP, Url, Websocket Packages
15 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
SA Alarm Review Nokia v1.5
No ratings yet
SA Alarm Review Nokia v1.5
14 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Unit 5.11 - Unit 5 Test Population Dynamics
No ratings yet
Unit 5.11 - Unit 5 Test Population Dynamics
4 pages
U-3 (B) Case Study
No ratings yet
U-3 (B) Case Study
9 pages
Activity 1 - ERD Solomon
No ratings yet
Activity 1 - ERD Solomon
2 pages
Install Instruction For Diagbox v7.65
No ratings yet
Install Instruction For Diagbox v7.65
2 pages
Tugas Resume Strategi Pembelajaran Di SD
No ratings yet
Tugas Resume Strategi Pembelajaran Di SD
8 pages
Repair Manual of D-903: Co., LTD
No ratings yet
Repair Manual of D-903: Co., LTD
16 pages
Jacobs 2007
No ratings yet
Jacobs 2007
9 pages
DBMS QB Unit 2
No ratings yet
DBMS QB Unit 2
5 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
VFR Navigation
No ratings yet
VFR Navigation
3 pages
Advanced Plot Types With Seaborn
No ratings yet
Advanced Plot Types With Seaborn
4 pages
Physics I Problems PDF
No ratings yet
Physics I Problems PDF
1 page
Problem No. 1: Solving Problems Involving Ellipse & Hyperbola
No ratings yet
Problem No. 1: Solving Problems Involving Ellipse & Hyperbola
2 pages
2061-MIX-001 Rev1
No ratings yet
2061-MIX-001 Rev1
2 pages
Quantitative Method-Breviary - SPSS: A problem-oriented reference for market researchers
From Everand
Quantitative Method-Breviary - SPSS: A problem-oriented reference for market researchers
Jens K. Perret
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

6) Exploratory Data Analysis

Uploaded by

6) Exploratory Data Analysis

Uploaded by

EXPLORATORY

Categorical vs Continuous Data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.