UNIT 1 Exploratory Data Analysis
UNIT 1 Exploratory Data Analysis
UNIT 1 Exploratory Data Analysis
1.Fundamentals of EDA:
Exploratory Data Analysis (EDA):
• EDA is a crucial step in the data analysis process.
• It involves the initial examination of data sets to discover patterns, anomalies, and
insights.
• EDA aims to understand the data's structure, identify relationships, and generate
hypotheses.
Principles of EDA:
• Data summarization through descriptive statistics and visualizations.
• Handling missing data and outliers effectively.
• Exploring data distributions, correlations, and trends.
• Iterative process involving data cleaning and feature engineering.
Tools for EDA:
• Popular tools include Python libraries like Pandas, Matplotlib, and Seaborn.
• Data visualization tools such as Tableau and Power BI.
• Statistical software like R for in-depth analysis.
There are several phases of data analysis, including data requirements, data collection, data
processing, data cleaning, exploratory data analysis, modelling and algorithms, and data
product and communication.
• Data collection: Data collected from several sources must be stored in the correct
format and transferred to the right information technology personnel within a
company. As mentioned previously, data can be collected from several objects on
several events using different types of sensors and storage tools.
• Data cleaning: Pre-processed data is still not ready for detailed analysis. It must be
correctly transformed for an incompleteness check, duplicates check, error check,
and missing value check. These tasks are performed in the data cleaning stage, which
involves responsibilities such as matching the correct record, finding inaccuracies in
the dataset, understanding the overall data quality, removing duplicate items, and
filling in the missing values
• EDA: Exploratory data analysis, as mentioned before, is the stage where we actually
start to understand the message contained in the data. It should be noted that
several types of data transformation techniques might be required during the
process of exploration.
• Data Product: Any computer software that uses data as inputs, produces outputs,
and provides feedback based on the output to control the environment is referred to
as a data product. A data product is generally based on a model developed during
data analysis, for example, a recommendation model that inputs user purchase
history and recommends a related item that the user is highly likely to buy.
• Communication: This stage deals with disseminating the results to end stakeholders
to use the result for business intelligence. One of the most notable steps in this stage
is data visualization. Visualization deals with information relay techniques such as
tables, charts, summary diagrams, and bar charts to show the analyzed result.
EDA as a Foundation:
• Exploratory Data Analysis (EDA) is the cornerstone of data analysis.
• It forms the initial phase in the data analysis pipeline.
• EDA sets the stage for all subsequent data-driven decisions
Uncovering Patterns and Anomalies:
• EDA helps reveal hidden patterns, trends, and anomalies within data.
• These discoveries can provide valuable insights for decision-makers.
• Identifying outliers or unusual data points can highlight data quality issues.
Data Quality Assessment:
• EDA aids in assessing the quality and reliability of data.
• It allows for the identification and handling of missing values, duplications, and
errors.
• Ensuring data quality is essential for meaningful analysis and decision-making.
Hypothesis Generation:
• EDA encourages the formulation of hypotheses about the data.
• These hypotheses can guide further analysis and experimentation.
• Data-driven hypotheses are more likely to lead to actionable insights.
Feature Selection and Engineering:
• EDA helps in selecting relevant features for modelling.
• Feature engineering, based on EDA findings, can enhance model performance.
• Efficient feature selection reduces computational complexity.
Communication of Insights:
• EDA results are often presented through visualizations and summaries.
• Clear communication of insights to stakeholders is vital.
• Visualizations make complex data more understandable to non-technical audiences.
Risk Mitigation:
• EDA can uncover risks and potential issues within the data.
• Addressing these issues early in the analysis process can prevent costly errors and
wrong decisions.
• Risk mitigation is a crucial aspect of data-driven decision-making.
Continuous Improvement:
• As more data becomes available or as research questions evolve, EDA should be
revisited.
• EDA is not a one-time process but an iterative one.
• Continuous EDA ensures that insights remain up-to-date and relevant.
In summary, EDA is of paramount importance in the data analysis process as it lays the
foundation for informed decision-making. It uncovers insights, assesses data quality,
generates hypotheses, and helps communicate findings effectively, ultimately leading to
more robust and reliable decisions.
These steps collectively form the EDA process, which is essential for understanding data,
making informed decisions, and preparing the data for advanced analytics and modelling.
NumPy: NumPy is a fundamental package for scientific computing with Python. It provides
support for working with arrays and matrices.
import numpy as np
# Create a 1D array and perform basic operations
arr = np.array([1, 2, 3, 4, 5])
mean = np.mean(arr)
std_dev = np.std(arr)
print(f"Mean: {mean}, Standard Deviation: {std_dev}")
Plotly: Plotly is a library that allows creating interactive and dynamic visualizations, including
2D and 3D charts.
import plotly.express as px
# Sample data
data = px.data.iris()
# Create a scatter plot using Plotly
fig = px.scatter(data, x='sepal_width', y='sepal_length', color='species')
# Show the interactive plot
fig.show()
Seaborn: Seaborn is built on top of Matplotlib and provides a higher-level interface for
creating informative and attractive statistical graphics.
import seaborn as sns
import matplotlib.pyplot as plt
# Load sample dataset
tips = sns.load_dataset('tips')
# Create a scatter plot
sns.scatterplot(x='total_bill', y='tip', data=tips)
# Add labels and title
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Scatter Plot of Tips vs Total Bill')
# Display the plot
plt.show()
Matplotlib: Matplotlib is a widely used Python library for creating static, interactive, and
animated visualizations in Python.
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 8, 6, 4, 2]
# Create a line plot
plt.plot(x, y, marker='o')
# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sample Line Plot')
# Display the plot
plt.show()
Pandas: Pandas is a popular Python library used for data manipulation and analysis. It
provides data structures like DataFrame for handling tabular data.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
df = pd.DataFrame(data)
# Display basic statistics
print(df.describe())
# Display first few rows
print(df.head())
2.EDA:
2.1 Making Sense of Data:
Making sense of complex datasets through EDA involves employing various techniques and
approaches to gain insights and understand data patterns. This process is crucial for making
informed decisions and drawing meaningful conclusions. Here, we'll explore EDA techniques
for different types of data, including numerical data (both discrete and continuous),
categorical data, and understanding different measuring scales.
1.Numerical Data:
Discrete Numerical Data:
• Refers to data that can only take specific, distinct values.
• Common examples include counts, integers, and whole numbers.
• EDA techniques:
• Create frequency distributions and histograms to visualize data distribution.
• Calculate summary statistics such as mean, median, mode, and standard deviation.
• Explore measures of central tendency and dispersion.
2.Categorical Data:
Categorical Data:
• Consists of non-numeric categories or labels.
• Examples include gender, colors, or product categories.
• EDA techniques:
• Create bar charts or pie charts to display category frequencies.
• Calculate proportions and percentages.
• Explore cross-tabulations to understand relationships between categorical variables.
Measuring Scales:
1.Nominal Scale:
• Represents data with unordered categories.
• EDA considerations:
• Use frequency tables and bar charts.
• Calculate mode (most frequent category).
• No meaningful arithmetic operations.
2.Ordinal Scale:
• Involves data with ordered categories but unknown intervals.
• EDA considerations:
• Visualize using ordered bar charts.
• Calculate mode and median.
• Limited arithmetic operations (e.g., ranking).
3.Interval Scale:
• Numerical data with known intervals but no true zero point.
• EDA considerations:
• Visualize with histograms or density plots.
• Calculate mean, median, and standard deviation.
• Arithmetic operations (e.g., addition and subtraction) are meaningful, but ratios are
not (no true zero).
4.Ratio Scale:
• Numerical data with known intervals and a true zero point.
• EDA considerations:
• Employ various numerical and visual techniques (histograms, box plots).
• Calculate mean, median, mode, standard deviation, and coefficients of variation.
• All arithmetic operations are meaningful (e.g., multiplication, division).
2.2 Comparing EDA with Classical and Bayesian Analysis:
1.Classical Analysis:
Definition: Classical statistical analysis, also known as frequentist statistics, relies on fixed
parameters and emphasizes hypothesis testing.
Focus:
• Hypothesis Testing: Classical analysis often starts with predefined hypotheses and
aims to accept or reject them based on data.
• Parameter Estimation: It involves estimating fixed model parameters (e.g., mean,
variance) with point estimates.
Data Exploration:
• Limited Emphasis: Classical analysis focuses more on hypothesis testing and less on
data exploration.
• EDA is typically not the primary focus but may be used alongside classical methods
for initial data assessment.
Assumptions:
• Assumes fixed, unchanging parameters.
• Assumes large sample sizes for accurate results.
• Relies on p-values for significance testing.
Applicability:
• Suitable for well-defined research questions with clear hypotheses.
• Less flexible when dealing with complex or dynamic data.
2.Bayesian Analysis:
Definition: EDA is an approach to data analysis that focuses on visually and numerically
exploring data to gain insights.
Focus:
• Data Exploration: EDA's primary focus is to understand the data's structure, uncover
patterns, and generate hypotheses.
• Visualizations: It heavily relies on data visualizations, such as histograms, scatter
plots, and box plots.
Data Exploration:
• Emphasizes data exploration, data cleaning, and initial insights.
• Typically used in the early stages of analysis to guide subsequent steps.
Assumptions:
• Assumption-free: EDA doesn't rely on specific assumptions about data distribution.
• Flexible and adaptable to various data types.
Applicability:
• Valuable for understanding data, identifying outliers, and formulating hypotheses.
• Serves as a foundation for choosing appropriate statistical methods.
2.3 Software tools for EDA:
1.Polymer Search: Polymer Search is a tool that allows users to harness the power of AI to
generate insights from their data and create interactive databases that allows for easy
filtering and data exploration. The main features include: an interactive spreadsheet,
interactive pivot table, interactive graphs/charts, the ‘auto-explainer’ feature that allows you
to instantly generate summaries, rankings and find anomalies within the data, and ‘Smart
Start’ where you can use the AI to suggest insights about your data. This tool is especially
powerful for marketers, salespeople and business intelligence people looking to perform
analysis and present their data.
3.Pandas Profiling: Pandas Profiling is an open source Python module which allows both
non-technical users and data scientists to quickly perform EDA and present the information
on a web-based interactive report. Using Pandas Profiling, you generate interactive
graphs/charts and visualize the distribution of each variable in the dataset using just a few
lines of code. Data scientists often use Pandas Profiling to save hours of time needed for the
EDA process.
4.DataPrep: DataPrep is a tool on Python that saves countless hours of cleansing, preparing
data and performing EDA. It works similarly to Pandas Profiling - that within a couple lines of
code, you can plot a series of interactive graphs and distributions charts to get an overall
sense of the data. You can also find & analyze missing values and outliers within seconds
using a few lines of code. This allows the user to be aware of data quality in each column
and find possible reasons for these missing values or outliers. Overall, DataPrep is a very
powerful tool for cleansing data, analyzing missing variables, checking correlations and
seeing the distribution of each variable.
5.Trifacta: Trifacta allows you to prepare and explore any dataset on cloud data warehouse
or cloud data lakehouse through an interactive user interface.The tool uses in-built machine
learning algorithms to guide you through the exploration of your data. One of its features is
data profiling: determining how accurate, complete and valid a dataset is. Trifacta does this
automatically with its intelligent AI. Another feature is its no-code ETL (extract, transform,
load) or ELT. You can transform your dataset simply by providing an example format, and the
machine learning algorithm will fill in the rest.
6.Knime: KNIME is a tool that allows you to dive deep into data processing without learning
how to code. KNIME is often used by data scientists, especially from the chem/biotech
industry, for data processing and building production grade applications. It has plenty of
features that’ll come in hand for exploratory data analysis including data cleansing and
manipulation, merging datasets together, creating interactive visualizations and building
models.
7.Excel: For many datasets, Excel is all that’s needed for data analysis. The advantages of
Excel are that it’s easy to cleanse/manipulate the dataset using basic Excel functions, and it’s
ultra convenient to quickly create graphs/charts. Although Excel is a paid program, Google
Sheets is a free alternative that does exactly the same thing.
9.IBM Cognos: IBM Cognos Analytics is a business intelligence tool designed for business
professionals who aren’t data-savvy. Using its built-in AI tools, users can explore and
generate insights about their data in a matter of clicks. The tool also automates data
preparation for cleansing and aggregating data.
10.R Language: R is a statistical programming language commonly used for data analysis and
visualization. It has a rich ecosystem of packages for EDA
# Load data
data <- read.csv('data.csv')
# Summary statistics
summary(data)
# Create a histogram
hist(data$variable, main="Histogram", xlab="Values")
1. Line Chart:
• Definition: A line chart displays data points connected by straight lines, representing
trends and changes over time or continuous data.
• Usage: It's used to show patterns, trends, and relationships between data points.
• Variable: Typically used for continuous data on the x-axis, such as time or numerical
values.
• Inference: Easily identifies trends and fluctuations in data over time.
• Code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Chart')
plt.show()
2. Bar Chart:
• Definition: A bar chart uses rectangular bars to represent categorical data. The length
of each bar is proportional to the value it represents.
• Usage: Used to compare data among different categories or discrete items.
• Variable: Discrete categories or items on the x-axis.
• Inference: Allows quick comparisons between categories and identifies the highest or
lowest values.
• Code:
import matplotlib.pyplot as plt
categories = ['Category A', 'Category B', 'Category C']
values = [25, 40, 30]
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')
plt.show()
3. Scatter Plot:
• Definition: A scatter plot displays individual data points as dots on a two-dimensional
plane. It shows the relationship between two continuous variables.
• Usage: Used to visualize the correlation or distribution of data points.
• Variable: Two continuous variables, one on each axis.
• Inference: Reveals patterns, clusters, or outliers in the relationship between the
variables.
• Code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()
4. Area Plot:
• Definition: An area plot represents data with
shaded areas, depicting cumulative quantities over time or categories.
• Usage: Shows the composition or cumulative change in data.
• Variable: Typically used for continuous data on the x-axis, such as time.
• Inference: Highlights the total contribution of each category to the whole.
• Code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
plt.fill_between(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Area Plot')
plt.show()
5. Pie Chart:
• Definition: A pie chart represents data as sectors of a circle, showing the proportion
of each category in relation to the whole.
• Usage: Used to display the percentage distribution of different parts of a whole.
• Variable: Discrete categories.
• Inference: Easily compares the proportions of different categories within a dataset.
• Code:
import matplotlib.pyplot as plt
labels = ['A', 'B', 'C', 'D']
sizes = [15, 30, 45, 10]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart')
plt.show()
6. Table Chart:
• Definition: A table chart presents data in a tabular format with rows and columns.
• Usage: Used to display detailed data, allowing for precise examination.
• Variable: Both discrete and continuous data can be shown in a table.
• Inference: Enables viewers to read specific values and make detailed comparisons.
• Code:
import pandas as pd
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
7. Polar Chart:
• Definition: A polar chart displays data on a circular grid, using radial lines to
represent different values.
• Usage: Used to show data patterns that involve angles or cyclical trends.
• Variable: Continuous data on the radial axis.
• Inference: Highlights cyclical patterns and comparisons across different angles.
• Code:
import matplotlib.pyplot as plt
import numpy as np
theta = np.linspace(0, 2*np.pi, 6)
r = [1, 2, 3, 4, 5, 6]
plt.polar(theta, r)
plt.title('Polar Chart')
plt.show()
8. Histogram:
• Definition: A histogram groups continuous data into intervals (bins) and represents
the frequency of data points in each bin.
• Usage: Shows the distribution and frequency of data within specific ranges.
• Variable: Continuous data on the x-axis, divided into intervals.
• Inference: Reveals the underlying distribution of the data, including peaks, valleys,
and skewness.
• Code:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=20)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
9. Lollipop Chart:
• Definition: A lollipop chart combines a scatter plot with vertical lines, emphasizing
data points on a continuous scale.
• Usage: Used to compare individual data points within a dataset.
• Variable: Continuous data on the x-axis.
• Inference: Highlights specific data points and their values compared to others.
• Code:
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values = [5, 10, 15, 8]
plt.stem(categories, values, basefmt=' ')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Lollipop Chart')
plt.show()
Description: Merging databases involves combining data from multiple sources or tables to
create a unified dataset. It is often used when dealing with related data stored in separate
databases or files.
# Example Code (Python - Pandas):
import pandas as pd
# left join
left_join_df = pd.merge(df1, df2, on='ID', how='left')
print(left_join_df)
# right join
right_join_df = pd.merge(df1, df2, on='ID', how='right')
print(right_join_df)
# outer join
outer_join_df = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join_df)
# Merge on index
index_merge_df = pd.merge(df1_indexed, df2_indexed, left_index=True, right_index=True,
how='inner')
print(index_merge_df)
Output:
ID Name Age
0 2 Bob 25
1 3 Charlie 30
ID Name Age
0 1 Alice NaN
1 2 Bob 25.0
2 3 Charlie 30.0
ID Name Age
0 2 Bob 25
1 3 Charlie 30
2 4 NaN 28
ID Name Age
0 1 Alice NaN
1 2 Bob 25.0
2 3 Charlie 30.0
3 4 NaN 28.0
ID Name Age
2 Bob 25
3 Charlie 30
Description: Reshaping involves changing the structure of data, often from a wide format to
a long format or vice versa. It is useful for different types of analyses and visualization.
df = pd.DataFrame(data)
stacked_df = df.set_index('Date').stack().reset_index(name='Value')
print(stacked_df)
Output:
Date level_1 Value
0 2022-01-01 Temperature 32
1 2022-01-01 Humidity 50
2 2022-01-02 Temperature 35
3 2022-01-02 Humidity 45
level_1 Humidity Temperature
Date
2022-01-01 50 32
2022-01-02 45 35
2022-01-01 50 32
2022-01-02 45 35
Output:
A B
0 1 4.0
1 2 5.0
2 2 NaN
3 3 NaN
4 4 8.0
A B
0 1 4.0
1 999 5.0
2 999 NaN
3 3 NaN
4 4 8.0
5 4 8.0
A B
0 1 4.0
1 999 5.0
4 4 8.0
5 4 8.0
A
0 1
1 999
2 999
3 3
4 4
5 4
A B
0 1 4.0
1 999 5.0
2 999 0.0
3 3 0.0
4 4 8.0
5 4 8.0
A B
0 1 4.0
1 999 5.0
2 999 6.0
3 3 7.0
4 4 8.0
5 4 8.0
Data transformation is a crucial part of the data preprocessing pipeline, enabling data
analysts and data scientists to prepare data for analysis, modeling, and visualization. These
techniques, along with others like feature engineering and missing data handling, play a
significant role in making data more informative and actionable.