6) Exploratory Data Analysis
6) Exploratory Data Analysis
DATA ANALYSIS
Exploratory Data Analysis
• Exploratory Data Analysis, or EDA is a way to analyze data sets, often using data visualization
methods, to summarize the main characteristics of the data.
• EDA helps in better understanding the different features of the data, the relationship
between them, and in determining the statistical techniques appropriate for the data set.
• matplotlib and seaborn libraries are pretty good for EDA.
Univariate Analysis (1/7)
• A dataset may contain one or more features/variables/columns.
• Univariate Analysis provides summary statistics only on one variable.
• Univariate Analysis only describes the data and helps identify any patterns in the data.
Q3
minimum value
maximum value
Q1
Q2 (median)
Univariate Analysis – Categorical Data (1/2)
Count Plot
• We can use count plots to perform EDA on categorical data.
• The .countplot() function of the seaborn library plots the total count of each value as a
bar chart.
• In the given figure, we see that we have equal instances of each category in the
dataframe.
Univariate Analysis – Categorical Data (2/2)
Pie Chart
• We can also visualize the proportion of each category using a pie chart as shown in the
figure.
Bivariate Analysis – Continuous & Continuous (1/4)
• Bivariate Analysis is used to study the relationship between exactly two
variables/features/columns of the data set.
• Consider the following dataframe.
Bivariate Analysis – Continuous & Continuous (2/4)
Scatter Plot
• If the two variables in question are both continuous, we can plot a scatter plot
between them to get an idea of their distribution.
• In the given figure, we plot the ‘Age’ column of the dataframe against the ‘Fare’
column, both of which are continuous.
• It is evident that the two features are independent of each other.
Bivariate Analysis – Continuous & Continuous (3/4)
Correlation
• We can also find the correlation between two continuous features.
• If the correlation is high, the two features are significantly related.
• Use the .corr() function to compute correlation between two features.
Bivariate Analysis – Continuous & Continuous (4/4)
Heatmp
• We can also use a heatmap to graphically visualize the correlation between something.
• Use the .heatmap() method of the seaborn library to create a heatmap.
• Provide the correlation dataframe that we created in the previous slide as an argument
to the .heatmap()
Bivariate Analysis – Categorical & Categorical (1/3)
• Consider the following dataframe from the previous slide.
• We will see how we can use bar plot to identify any relation between two categorical
variables, namely ‘Pclass’ and ‘Survived’.
Bivariate Analysis – Categorical & Categorical (2/3)
• We first select the two columns that we are interested in, namely ‘Pclass’ and
‘Survived’.
• We then group this new dataframe based on ‘Pclass’ column and apply the .sum()
function to find total number of survivors from each category in the Pclass.
• The output of this part is shown below.
Bivariate Analysis – Categorical & Categorical (3/3)
• We then plot a bar plot, with ‘Pclass’ on the X-axis.
• As evident, more passengers in ‘Pclass’ 1 were able to survive than those from ‘Pclass’
2 or 3.
Bivariate Analysis – Continuous & Categorical (1/3)
• Sometimes, we want to find out the relationship between a continuous and a
categorical variable.
• Consider the following dataframe from the previous slide.
Bivariate Analysis – Continuous & Categorical (2/3)
Box Plot
• We can use a box plot to perform EDA on continuous and categorical data.
• In the given figure, we plot 'Age' (continuous feature) against 'Survived' (categorical
feature).
• The plot suggests that the younger people had a greater chance of survival.
Bivariate Analysis – Continuous & Categorical (3/3)
Bar Plot
• Bar plots can also be used for bivariate analysis of a continuous and a categorical
feature.
• In the given figure we plot 'Age' (continuous feature) against 'Sex' (categorical feature).
1 in the 'Sex' column represents male and 0 represents 'female' passengers.
• The bar plot suggests that elderly passengers were mostly male.
Detecting Outliers
• Outliers/anomalies in the data are the observations that do not fit into the standard
pattern of the data.
• In chapetr 4, we discussed major techniques to detect outliers/anomalies e.g.;
• Median-based anomaly detection
• Mean-based anomaly detection
• Z-score-based anomaly detection
• IQR-based anomaly detection
• In this chapter, we will learn different ways to treat such outliers/anomalies in the data.
Outliers Treatment (1/3)
Trimming Outliers
• One naïve way of treating outliers is to remove them from the data. However, this
approach is not very good.
• Outliers can be removed from the data in several ways.
• Consider the given Series. It is safe to say that the value 150 is an outlier in the data.
Outliers Treatment (2/3)
Trimming Outliers
• We delete the rows of the Series for which the absolute value of the Z-score is bigger
than 1.5, which means that the values lie outside of 1.5 standard deviations of the data.
Outliers Treatment (3/3)
Mean/Median Imputation
• We can also replace outliers with either mean or median.
• In the given figure, we detect outliers in the data using Z-score and replace them with the
median value of the Series.
Categorical Variable Transformation (1/3)
• We discussed about numerical variable transformation in chapter 4 using normalization
and standardization.
• In this chapter we will discuss categorical variable transformation.
• There are several ways to transform categorical variable to make them more
meaningful for machines, we will discuss only some of them.
• Consider the following dataframe. We will transorm the 'Sex' variable.
Categorical Variable Transformation (2/3)
Label Encoding
• In label encoding, we replace categorical data with numbers.
• We replace male with 1 and female with 0.
Categorical Variable Transformation (3/3)
Frequency Encoding
• In this encoding method, we replace each value in the categorical variable by its
frequency.
• In the given figure, we replaced male and female labels in the 'Sex' column with their
frequencies.
Resources
• https://www.kaggle.com/residentmario/univariate-plotting-with-pandas
• https://purnasaigudikandula.medium.com/exploratory-data-analysis-beginner-univariate-
bivariate-and-multivariate-habberman-dataset-2365264b751