0% found this document useful (0 votes)
31 views29 pages

6) Exploratory Data Analysis

Uploaded by

karlamarsu97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views29 pages

6) Exploratory Data Analysis

Uploaded by

karlamarsu97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

EXPLORATORY

DATA ANALYSIS
Exploratory Data Analysis
• Exploratory Data Analysis, or EDA is a way to analyze data sets, often using data visualization
methods, to summarize the main characteristics of the data.
• EDA helps in better understanding the different features of the data, the relationship
between them, and in determining the statistical techniques appropriate for the data set.
• matplotlib and seaborn libraries are pretty good for EDA.
Univariate Analysis (1/7)
• A dataset may contain one or more features/variables/columns.
• Univariate Analysis provides summary statistics only on one variable.
• Univariate Analysis only describes the data and helps identify any patterns in the data.

Categorical vs Continuous Data


• There are two types of data;
• Categorical/Discrete, e.g., spam vs no spam, male vs female
• Continuous, e.g., Age of population
• EDA is performed differently on the two types of data.
Univariate Analysis (2/7)
• Consider the following dataframe containing 5 features containing information
about iris plants.
• The last column is categorical, while the rest are continuous.
Univariate Analysis – Continuous Data (3/7)
Scatter Plot
• We can use a number of graphs to perform EDA on continuous data.
• In the given figure, we use the .scatterplot() function of the seaborn library to plot the
‘petal.length’ column of the dataframe.
• The hue parameter assigns different color to the data points based on the category
they belong to in the column specified in the hue parameter.
Univariate Analysis – Continuous Data (4/7)
Strip Plot
• Strip plots are also a good way to analyze the distribution of the variables for each
category.
• In the given figure, we have plotted ‘petal.length’ column for each category in the
‘variety’ column.
• On the Y-axis, we have the distribution of each category, and on the X-axis we have the
categories.
Univariate Analysis – Continuous Data (5/7)
Distribution Plot
• To find the distribution of a variable/feature/column, use the .distplot() function of the
seaborn library.
• In this figure, we plot the distribution of the ‘petal.length’ column.
Univariate Analysis – Continuous Data (6/7)
Histograms
• To analyze the frequency of values, we can plot a histogram using the .hist() method of
the matplotlib library.
• In this figure, we plot the frequency of values in the ‘petal.length’ column.
Univariate Analysis – Continuous Data (7/7)
Box Plot
• A box plot provides great insights into the data using 5-number summary; minimum
value, first quartile, second quartile, third quartile, and maximum value.
• We can use the .boxplot() function of either matplotlib or seaborn.

Q3

minimum value
maximum value

Q1
Q2 (median)
Univariate Analysis – Categorical Data (1/2)
Count Plot
• We can use count plots to perform EDA on categorical data.
• The .countplot() function of the seaborn library plots the total count of each value as a
bar chart.
• In the given figure, we see that we have equal instances of each category in the
dataframe.
Univariate Analysis – Categorical Data (2/2)
Pie Chart
• We can also visualize the proportion of each category using a pie chart as shown in the
figure.
Bivariate Analysis – Continuous & Continuous (1/4)
• Bivariate Analysis is used to study the relationship between exactly two
variables/features/columns of the data set.
• Consider the following dataframe.
Bivariate Analysis – Continuous & Continuous (2/4)
Scatter Plot
• If the two variables in question are both continuous, we can plot a scatter plot
between them to get an idea of their distribution.
• In the given figure, we plot the ‘Age’ column of the dataframe against the ‘Fare’
column, both of which are continuous.
• It is evident that the two features are independent of each other.
Bivariate Analysis – Continuous & Continuous (3/4)
Correlation
• We can also find the correlation between two continuous features.
• If the correlation is high, the two features are significantly related.
• Use the .corr() function to compute correlation between two features.
Bivariate Analysis – Continuous & Continuous (4/4)
Heatmp
• We can also use a heatmap to graphically visualize the correlation between something.
• Use the .heatmap() method of the seaborn library to create a heatmap.
• Provide the correlation dataframe that we created in the previous slide as an argument
to the .heatmap()
Bivariate Analysis – Categorical & Categorical (1/3)
• Consider the following dataframe from the previous slide.
• We will see how we can use bar plot to identify any relation between two categorical
variables, namely ‘Pclass’ and ‘Survived’.
Bivariate Analysis – Categorical & Categorical (2/3)
• We first select the two columns that we are interested in, namely ‘Pclass’ and
‘Survived’.
• We then group this new dataframe based on ‘Pclass’ column and apply the .sum()
function to find total number of survivors from each category in the Pclass.
• The output of this part is shown below.
Bivariate Analysis – Categorical & Categorical (3/3)
• We then plot a bar plot, with ‘Pclass’ on the X-axis.
• As evident, more passengers in ‘Pclass’ 1 were able to survive than those from ‘Pclass’
2 or 3.
Bivariate Analysis – Continuous & Categorical (1/3)
• Sometimes, we want to find out the relationship between a continuous and a
categorical variable.
• Consider the following dataframe from the previous slide.
Bivariate Analysis – Continuous & Categorical (2/3)
Box Plot
• We can use a box plot to perform EDA on continuous and categorical data.
• In the given figure, we plot 'Age' (continuous feature) against 'Survived' (categorical
feature).
• The plot suggests that the younger people had a greater chance of survival.
Bivariate Analysis – Continuous & Categorical (3/3)
Bar Plot
• Bar plots can also be used for bivariate analysis of a continuous and a categorical
feature.
• In the given figure we plot 'Age' (continuous feature) against 'Sex' (categorical feature).
1 in the 'Sex' column represents male and 0 represents 'female' passengers.
• The bar plot suggests that elderly passengers were mostly male.
Detecting Outliers
• Outliers/anomalies in the data are the observations that do not fit into the standard
pattern of the data.
• In chapetr 4, we discussed major techniques to detect outliers/anomalies e.g.;
• Median-based anomaly detection
• Mean-based anomaly detection
• Z-score-based anomaly detection
• IQR-based anomaly detection
• In this chapter, we will learn different ways to treat such outliers/anomalies in the data.
Outliers Treatment (1/3)
Trimming Outliers
• One naïve way of treating outliers is to remove them from the data. However, this
approach is not very good.
• Outliers can be removed from the data in several ways.
• Consider the given Series. It is safe to say that the value 150 is an outlier in the data.
Outliers Treatment (2/3)
Trimming Outliers
• We delete the rows of the Series for which the absolute value of the Z-score is bigger
than 1.5, which means that the values lie outside of 1.5 standard deviations of the data.
Outliers Treatment (3/3)
Mean/Median Imputation
• We can also replace outliers with either mean or median.
• In the given figure, we detect outliers in the data using Z-score and replace them with the
median value of the Series.
Categorical Variable Transformation (1/3)
• We discussed about numerical variable transformation in chapter 4 using normalization
and standardization.
• In this chapter we will discuss categorical variable transformation.
• There are several ways to transform categorical variable to make them more
meaningful for machines, we will discuss only some of them.
• Consider the following dataframe. We will transorm the 'Sex' variable.
Categorical Variable Transformation (2/3)
Label Encoding
• In label encoding, we replace categorical data with numbers.
• We replace male with 1 and female with 0.
Categorical Variable Transformation (3/3)
Frequency Encoding
• In this encoding method, we replace each value in the categorical variable by its
frequency.
• In the given figure, we replaced male and female labels in the 'Sex' column with their
frequencies.
Resources
• https://www.kaggle.com/residentmario/univariate-plotting-with-pandas
• https://purnasaigudikandula.medium.com/exploratory-data-analysis-beginner-univariate-
bivariate-and-multivariate-habberman-dataset-2365264b751

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy