IDA Question Bank Ch2
IDA Question Bank Ch2
Date : 05/09/2024
Created by AJS
C
IDA CH 2
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process where we
examine and summarize the main characteristics of a dataset. The purpose is to understand the
data better before applying more complex statistical methods or machine learning algorithms.
Here’s a breakdown of what EDA is, why it’s important, and its main goals:
What is EDA?
EDA involves using various techniques to explore and analyze the dataset. These techniques
often include:
• Descriptive Statistics: This includes measures like mean, median, mode, standard
deviation, and variance that summarize the central tendency and spread of the data.
• Data Visualization: Creating charts, graphs, and plots (such as histograms, scatter
plots, and box plots) to visually inspect the data.
• Data Cleaning: Identifying and addressing issues like missing values, outliers, and
inconsistencies.
1. Understand Data Distribution: EDA helps you grasp the distribution of data points.
For example, you can see if your data is normally distributed or if there are any skewed
patterns.
2. Detect Outliers: It allows you to identify anomalies or outliers that might affect the
analysis.
3. Identify Relationships: By visualizing the data, you can spot potential correlations or
patterns between variables.
4. Check Assumptions: Before applying statistical models or machine learning
algorithms, EDA helps verify assumptions (like linearity or normality) that these
methods rely on.
5. Guide Further Analysis: The insights gained from EDA can guide you in selecting
appropriate statistical techniques or algorithms for deeper analysis.
Goals of EDA
1
IDA CH 2
5. Guide Modeling: Inform decisions about the choice of models and techniques for
subsequent analysis based on the insights gained during EDA.
In summary, EDA is like a preliminary investigation into your data. It’s an essential part of the
data analysis process because it helps you understand what you’re working with, which in turn
guides you in making informed decisions about how to analyze and interpret the data.
Univariate EDA
Univariate EDA focuses on analyzing a single variable at a time. The goal is to understand
the distribution and characteristics of that individual variable. Common techniques used in
univariate EDA include:
• Descriptive Statistics: Measures like mean, median, mode, range, variance, and
standard deviation.
• Visualizations: Histograms, bar charts, and box plots.
1. Descriptive Statistics: Calculate the average age, median age, and standard deviation
of the ages.
o Mean age: 35 years
o Median age: 34 years
o Standard deviation: 10 years
2. Visualization: Create a histogram to see the distribution of ages.
o The histogram might show that most people are between 25 and 45 years old,
with fewer individuals in the younger or older age brackets.
By focusing on one variable (age), univariate EDA helps you understand its basic properties
and distribution.
Multivariate EDA
2
IDA CH 2
Let’s extend the previous example to include not just ages, but also income and education level
in your dataset.
1. Correlation Analysis: Compute the correlation coefficients between age, income, and
education level.
o You might find that age is positively correlated with income (older people earn
more), while education level might be positively correlated with income as well.
2. Scatter Plots: Create scatter plots to explore the relationships between two variables.
o A scatter plot of age vs. income might reveal a trend that older people generally
have higher incomes.
o A scatter plot of education level vs. income might show that higher education
is associated with higher income.
3. Pair Plots: Display scatter plots for all pairs of variables to visually assess
relationships.
o This can help you see if there are any complex interactions or trends among age,
income, and education level.
By analyzing multiple variables together, multivariate EDA helps you understand how
different factors interact with each other and how they collectively influence the outcomes in
your dataset.
Summary
• Univariate EDA: Focuses on one variable to understand its distribution and basic
characteristics. Useful for initial insights into individual features.
• Multivariate EDA: Examines multiple variables simultaneously to uncover
relationships and interactions. Helps in understanding how different variables relate to
each other and influence the dataset as a whole.
Both approaches are complementary. Univariate EDA provides foundational insights into
individual variables, while multivariate EDA helps you see the bigger picture and understand
complex interactions between variables.
3
IDA CH 2
Characteristics of a Scatterplot:
• Axes: The x-axis corresponds to one variable, and the y-axis corresponds to another.
The position of each point reflects the values of both variables for a single observation.
• Points: Each point on the graph represents an individual data observation. The position
of the point shows the relationship between the two variables.
Example of a Scatterplot:
Suppose you have a dataset with students' heights and weights. You could create a scatterplot
where:
Each point on the plot would represent one student, showing their height and weight together.
Applications of Scatterplots:
Scatterplots are useful in a wide range of applications where relationships between two
continuous variables are analyzed:
4
IDA CH 2
Conclusion:
Scatterplots are invaluable for identifying relationships between two continuous variables.
They help detect trends, patterns, correlations, and outliers, making them a widely used tool
across various fields like business, healthcare, environmental studies, and social sciences. By
visualizing data in this way, scatterplots allow you to explore and understand complex
relationships in a simple and intuitive manner.
Three-dimensional (3D) plotting allows you to visualize data in three dimensions, adding depth
to your plots. This is particularly useful for understanding relationships between three variables
or presenting complex data in a more informative way.
5
IDA CH 2
Before you can create 3D plots, you need to import the necessary libraries. Specifically, you
need matplotlib and the Axes3D class from mpl_toolkits.mplot3d.
python
Copy code
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
2. Create Data
Generate or load data for the three dimensions. For demonstration purposes, you can create
sample data using numpy.
python
Copy code
# Generate sample data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))
3. Initialize a 3D Plot
python
Copy code
# Create a new figure
fig = plt.figure()
# Add a 3D subplot
ax = fig.add_subplot(111, projection='3d')
4. Plot Data
Use the plotting functions provided by the Axes3D class to create different types of 3D plots.
Here are a few common types:
Surface Plot
6
IDA CH 2
python
Copy code
# Create a surface plot
ax.plot_surface(x, y, z, cmap='viridis')
Wireframe Plot
A wireframe plot shows the structure of the surface without solid coloring.
python
Copy code
# Create a wireframe plot
ax.plot_wireframe(x, y, z, color='black')
Scatter Plot
python
Copy code
# Create a scatter plot
ax.scatter(x, y, z, c='r', marker='o')
5. Customize Plot
python
Copy code
# Add labels and title
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
ax.set_title('3D Plot Example')
6. Show Plot
python
Copy code
# Show the plot
plt.show()
7
IDA CH 2
import numpy as np
# Add a 3D subplot
ax = fig.add_subplot(111, projection='3d')
Summary
This process enables you to visualize and analyze data in three dimensions, providing a deeper
understanding of complex relationships
1. Data Understanding
• EDA helps in familiarizing with the data: its structure, key features, and overall
patterns.
8
IDA CH 2
• It enables analysts to see distributions, trends, and potential biases that could impact
future analyses.
• During EDA, you identify data quality issues such as missing values, outliers, or
incorrect data types.
• Ensuring data quality early helps avoid misleading results in the later stages of analysis.
3. Hypothesis Generation
• EDA helps generate hypotheses about the relationships and interactions between
variables.
• It allows for testing of initial ideas and assumptions, guiding the direction for further
in-depth analysis.
• EDA helps detect underlying patterns, correlations, or clusters within the data.
• For instance, scatterplots and correlation matrices can reveal relationships between
variables that may be useful for predictive modeling.
• Based on insights from EDA, you can make better decisions about which statistical
models or algorithms are appropriate for the data.
• EDA can show whether data requires transformation or which variables should be
prioritized in modeling.
• By closely examining the data during EDA, you reduce the chances of making
analytical mistakes, such as applying incorrect assumptions or overlooking key data
issues.
7. Better Communication
Conclusion:
EDA is essential in data analytics because it forms the foundation for accurate, insightful, and
reliable analysis. It ensures that data is well understood, cleaned, and ready for advanced
techniques, minimizing the risk of errors and maximizing the effectiveness of the analysis.
9
IDA CH 2
Interval Estimation
Key Concepts:
• Point Estimate: A single value estimate of a population parameter (e.g., the sample
mean as an estimate of the population mean).
• Confidence Interval: A range of values, derived from the sample data, that is likely to
contain the population parameter.
• Confidence Level: The probability that the interval estimate will contain the population
parameter. Common confidence levels are 90%, 95%, and 99%.
Suppose you are analyzing the average income of a sample of people in a city. You calculate a
sample mean of $50,000. To account for sampling variability, you might compute a 95%
confidence interval, say from $48,000 to $52,000. This means you are 95% confident that the
true average income of the entire population lies within this range.
Importance in EDA:
Hypothesis Testing
Hypothesis Testing is a statistical method used to make decisions about a population based on
sample data. It involves testing an assumption (the hypothesis) about a population parameter.
Key Concepts:
• Null Hypothesis (H₀): The default assumption that there is no effect or no difference.
For example, "the average income in the city is $50,000."
• Alternative Hypothesis (H₁): The competing assumption that there is an effect or a
difference. For example, "the average income in the city is not $50,000."
• Test Statistic: A standardized value derived from the sample data, used to decide
whether to reject the null hypothesis.
10
IDA CH 2
• p-Value: The probability of obtaining a test statistic at least as extreme as the one
observed, assuming the null hypothesis is true.
• Significance Level (α): The threshold for deciding whether to reject the null
hypothesis, commonly set at 0.05.
Suppose you want to test whether the average income in the city is different from $50,000. You
perform a hypothesis test with the null hypothesis that the mean income is $50,000. If your p-
value is less than 0.05, you might reject the null hypothesis, suggesting that the average income
is likely different from $50,000.
Importance in EDA:
• Get the Data: First, you load the dataset into your analysis environment (like Python
or R).
• Familiarize: Check what kind of data you're working with, the size of the dataset
(number of rows and columns), and the types of variables (numerical, categorical, etc.).
2. Data Cleaning
• Handle Missing Data: Identify any missing or incomplete values in your dataset and
decide how to deal with them (e.g., filling them with averages, removing them, etc.).
• Fix Errors: Look for and correct any inconsistencies or mistakes in the data (e.g.,
incorrect values, duplicates).
• Convert Data Types: Ensure that variables are in the correct format (e.g., converting
strings that represent numbers into actual numerical values).
11
IDA CH 2
3. Descriptive Statistics
• Summarize Data: Calculate basic statistics like the mean (average), median,
minimum, maximum, and standard deviation for numerical variables.
• Frequency Counts: For categorical data, count how often each category appears.
4. Data Visualization
• Visualize Distributions: Use plots like histograms, box plots, and bar charts to see the
distribution (spread) of each variable.
• Check Relationships: Create scatter plots or correlation matrices to explore
relationships between different variables.
• Spot Trends: Look for any trends or patterns in the data (e.g., increases or decreases
over time, clustering of points).
• Outliers: Identify any outliers (unusually high or low values) that might need special
attention or removal.
6. Hypothesis Generation
• Form Hypotheses: Based on what you’ve discovered in your analysis, generate ideas
about how different variables might be related. These hypotheses will guide future
analyses and tests.
• Feature Engineering: Create new features or variables that might be useful for future
modeling.
• Finalize Clean Data: Make sure the data is clean, structured, and ready for more
advanced analysis like machine learning or statistical modeling.
Summary:
These steps help you explore the dataset, identify important insights, and prepare it for deeper
analysis.
12
IDA CH 2
• Components:
o Box: Shows the interquartile range (IQR), which contains the middle 50% of
the data (from Q1 to Q3).
o Line inside the box: Represents the median (Q2).
o Whiskers: Extend from the box to the minimum and maximum values within
1.5 times the IQR.
o Outliers: Points that lie outside the whiskers are considered outliers and are
plotted as individual points.
Example:
A box plot of students' test scores might show that the majority scored between 60 and 80, with
a median score of 70. There could also be a few outliers representing very low or very high
scores.
2. Histograms
Example:
A histogram of students' ages might show most students are between 18-22 years old, with a
peak (highest bar) around age 20.
3. Scatter Plots
A scatter plot is a graph used to visualize the relationship between two continuous variables.
Each point on the scatter plot represents one observation in the dataset, with one variable on
the x-axis and the other on the y-axis.
13
IDA CH 2
• Points: Each point represents an observation, showing how the two variables relate to
each other.
Example:
A scatter plot of students' heights and weights might show a positive correlation, where taller
students tend to weigh more. The points would form an upward trend, indicating a relationship
between height and weight.
Summary of Terms:
• Box Plots: Visualize the spread and identify outliers in a dataset using quartiles.
• Histograms: Show the distribution of a continuous variable by dividing data into bins.
• Scatter Plots: Display the relationship between two continuous variables using points
on a graph.
Each plot serves different purposes in data analysis, helping to visualize and understand data
distributions and relationships.
14
IDA CH 2
Understanding the
relationship Understanding complex
Describing the data and its
(correlation, interactions and
Focus central tendency,
association) dependencies among
dispersion, etc.
between two multiple variables.
variables.
Summary:
1. df.info()
• Description: This function is used with pandas DataFrames to get a concise summary
of the dataset.
• Working:
o It displays information about the DataFrame, including the total number of
entries (rows), the number of columns, the data types of each column, and the
number of non-null values in each column.
o It helps quickly identify data types and whether there are any missing values in
the dataset.
Example:
python
Copy code
import pandas as pd
df = pd.read_csv('data.csv')
df.info()
This will print a summary of the dataset, such as column names, types, and how many non-null
values each column contains.
2. sns.countplot()
• Description: A Seaborn function used to create a count plot. It shows the frequency of
categories in a categorical variable by plotting bars.
• Working:
15
IDA CH 2
o It takes a column from the DataFrame and creates a bar plot where the x-axis
represents the unique categories and the y-axis represents the count (frequency)
of each category.
o It is particularly useful for visualizing the distribution of categorical data.
Example:
python
Copy code
import seaborn as sns
sns.countplot(x='gender', data=df)
This will create a bar plot showing the number of occurrences for each category in the gender
column.
3. replace()
Example:
python
Copy code
df['gender'].replace({'Male': 1, 'Female': 0}, inplace=True)
This replaces the string values 'Male' and 'Female' with 1 and 0, respectively, in the gender
column.
4. df.corr()
• Description: This pandas function calculates the correlation matrix for the numerical
columns in the DataFrame.
• Working:
o The function computes the pairwise correlation between columns, typically
using Pearson's correlation coefficient, which measures the linear relationship
between variables.
o The output is a matrix where each element represents the correlation between
two variables, with values ranging from -1 to 1. A value close to 1 indicates a
strong positive correlation, while a value close to -1 indicates a strong negative
correlation.
16
IDA CH 2
Example:
python
Copy code
correlation_matrix = df.corr()
print(correlation_matrix)
This will print a matrix showing the correlation between all numerical columns in the
DataFrame.
Summary:
• df.info(): Provides a quick summary of the DataFrame, including data types and
missing values.
• sns.countplot(): Visualizes the frequency of categories in a categorical variable
using a bar plot.
• replace(): Replaces specified values in the DataFrame or Series with other values.
• df.corr(): Calculates the correlation matrix for numerical columns to measure
relationships between variables.
1. Data
• Definition: Data refers to raw facts, figures, or observations that are collected for
analysis or reference. It can come in various forms, such as numbers, text, images, or
measurements, and is the foundation of any analytical or computational process.
• Example: Temperature readings (e.g., 22°C, 25°C, 30°C) or survey responses (e.g.,
"Yes", "No").
2. DataFrame
17
IDA CH 2
3. Dataset
Summary:
Exploratory Data Analysis (EDA) is a crucial step in data analysis, used to summarize the
main characteristics of a dataset, often with visual methods. It helps in understanding the data's
underlying patterns, spotting anomalies, and forming hypotheses. EDA is broadly categorized
into different types and uses various tools for performing the analysis.
Types of EDA:
1. Univariate Analysis:
o Definition: Examines each variable individually to understand its distribution
and summary statistics.
o Purpose: Understand central tendency (mean, median), spread (range,
variance), and shape (skewness, kurtosis).
o Tools: Histograms, Box plots, Frequency tables.
2. Bivariate Analysis:
o Definition: Analyzes the relationship between two variables.
o Purpose: Discover associations, correlations, or trends between variables.
o Tools: Scatter plots, Correlation matrices, Bar plots, Box plots.
3. Multivariate Analysis:
o Definition: Examines the interactions among three or more variables.
18
IDA CH 2
1. Pandas (Python):
o Description: A powerful library for data manipulation and analysis in Python.
It provides functions like df.describe(), df.info(), and df.corr()
to quickly summarize data.
o Use: Cleaning, summarizing, and basic statistics for data.
2. Matplotlib:
o Description: A low-level plotting library for creating static, animated, and
interactive visualizations in Python.
o Use: Building custom plots such as histograms, scatter plots, and line plots.
3. Seaborn:
o Description: A higher-level data visualization library built on top of Matplotlib.
Seaborn simplifies complex visualizations with a more intuitive API and better
aesthetics.
o Use: Creating statistical visualizations like pair plots, count plots, and
correlation heatmaps.
4. R Language:
o Description: A popular language for data analysis and statistical computing,
with powerful packages like ggplot2 for visualization and dplyr for data
manipulation.
o Use: Extensive data analysis, visualizations, and statistical modeling.
5. Tableau/Power BI:
o Description: User-friendly data visualization tools that enable interactive and
dynamic visual exploration of data without needing to write code.
o Use: Creating dashboards, reports, and visual analysis with drag-and-drop
functionality.
Summary:
EDA is a critical step in understanding and preparing data, using types like univariate, bivariate,
and multivariate analysis. Tools like Pandas, Matplotlib, Seaborn, R, and Tableau help in
summarizing, visualizing, and exploring data to uncover insights
Exploratory Data Analysis (EDA) is the initial process of analyzing datasets to summarize
their main characteristics, often using statistical graphics and other data visualization methods.
19
IDA CH 2
The goal of EDA is to understand the structure, patterns, and relationships within the data,
identify anomalies, and develop hypotheses for further analysis.
• Action: Load and inspect the dataset to understand its structure, dimensions, and types
of variables.
• Tools: df.info(), df.head(), df.shape()
• Example:
python
Copy code
import pandas as pd
df = pd.read_csv('student_grades.csv')
print(df.info()) # Get column information
print(df.head()) # Preview the first few row
• Action: Identify and handle missing values, duplicate records, or erroneous data entries.
• Tools: df.isnull().sum(), df.dropna(), df.duplicated()
• Example:
python
Copy code
print(df.isnull().sum()) # Check for missing values
df.dropna(inplace=True) # Remove rows with missing data
df.drop_duplicates(inplace=True) # Remove duplicate rows
python
20
IDA CH 2
Copy code
import matplotlib.pyplot as plt
df['math_score'].hist(bins=10) # Histogram for a single
variable
plt.show()
print(df['math_score'].describe()) # Summary statistics
• Action: Explore relationships between two variables, often using scatter plots or
correlation matrices for numerical variables and bar plots for categorical variables.
• Tools: Scatter plots, Correlation matrices, Bar plots
• Example:
python
Copy code
import seaborn as sns
sns.scatterplot(x='study_hours', y='math_score', data=df)
# Scatter plot
plt.show()
python
Copy code
sns.pairplot(df[['study_hours', 'math_score',
'reading_score']]) # Pair plot for multiple variables
plt.show()
• Action: Identify and handle outliers in the dataset using visual methods or statistical
techniques.
• Tools: Box plots, Z-scores, IQR method
• Example:
python
Copy code
sns.boxplot(x='math_score', data=df) # Box plot to visualize
outliers
plt.show()
21
IDA CH 2
• Action: Create new features or transform existing features that could improve the
analysis or predictive models.
• Tools: Combining features, creating interaction terms, normalization, etc.
• Example:
python
Copy code
df['total_score'] = df['math_score'] + df['reading_score']
# Create a new feature
• Action: Summarize findings and generate hypotheses based on the patterns observed
in the data.
• Example:
o Observation: Students who study more hours tend to have higher math scores.
o Hypothesis: Increasing study hours could improve performance in math.
Example Summary:
Suppose the dataset contains student records with variables like math_score,
reading_score, study_hours, and attendance. Through EDA, you identify that:
This exploration helps in forming hypotheses and sets the foundation for predictive modeling
or further statistical testing.
Conclusion:
EDA is a crucial step in data analysis, enabling you to understand the dataset, spot patterns and
trends, and prepare the data for more advanced analytics or machine learning models. By
following the step-by-step process, you ensure that your data is ready for deeper insights and
that your analysis is based on sound foundations
22
IDA CH 2
Data Analytics is the process of examining and interpreting data to extract useful insights and
support decision-making. It involves collecting, cleaning, and analyzing data to uncover
patterns, trends, and relationships that can help organizations make informed decisions.
• Action: Identify the problem or question you want to answer with data.
• Example: A company wants to know why their sales have dropped in the last quarter.
2. Collect Data
• Action: Process and clean the data to ensure it's accurate and usable.
• Tasks: Handle missing values, remove duplicates, and correct errors.
• Example: Fill in missing sales figures, correct typos in customer names.
• Action: Perform exploratory data analysis (EDA) to understand the data and find
patterns.
• Techniques: Summary statistics, data visualization (charts, graphs), and correlation
analysis.
• Example: Create histograms of sales figures, scatter plots to see if there's a relationship
between marketing spend and sales.
• Action: Use statistical models or machine learning algorithms to analyze data and make
predictions.
• Techniques: Regression analysis, classification models, clustering.
• Example: Build a regression model to predict future sales based on historical data and
marketing efforts.
23
IDA CH 2
6. Interpret Results
• Action: Analyze the results of the models and translate them into actionable insights.
• Tasks: Identify key findings, trends, and implications for decision-making.
• Example: Determine that increased marketing spend is likely to boost sales based on
the model results.
7. Communicate Findings
• Action: Present the insights and recommendations in a clear and understandable way.
• Tools: Reports, dashboards, presentations.
• Example: Create a presentation showing how marketing spend affects sales and
recommend strategies to increase budget allocation.
• Action: Apply the insights to make business decisions and monitor their impact.
• Tasks: Implement changes, track performance, and adjust strategies as needed.
• Example: Increase the marketing budget and track sales performance over the next
quarter to see if there’s an improvement.
Summary
This process helps organizations make data-driven decisions, solve problems, and achieve their
goals effectively.
24
IDA CH 2
Here’s a table that outlines the differences between Data Analytics and Data Science in simple
terms:
Summary
25