UNIT 1 Exploratory Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

UNIT 1 Exploratory Data Analysis:

1.Fundamentals of EDA:
Exploratory Data Analysis (EDA):
• EDA is a crucial step in the data analysis process.
• It involves the initial examination of data sets to discover patterns, anomalies, and
insights.
• EDA aims to understand the data's structure, identify relationships, and generate
hypotheses.
Principles of EDA:
• Data summarization through descriptive statistics and visualizations.
• Handling missing data and outliers effectively.
• Exploring data distributions, correlations, and trends.
• Iterative process involving data cleaning and feature engineering.
Tools for EDA:
• Popular tools include Python libraries like Pandas, Matplotlib, and Seaborn.
• Data visualization tools such as Tableau and Power BI.
• Statistical software like R for in-depth analysis.

1.1 Understanding data science


Explanation: Contextualizing EDA within the broader field of data science and its relevance.
Contextualizing EDA:
• EDA is a foundational component of data science.
• It helps data scientists gain initial insights into data before advanced analysis.
• EDA aids in data preprocessing, which is crucial for building robust models.
Data Science in a Nutshell:
• Data science is a multidisciplinary field that uses scientific methods, algorithms,
processes, and systems to extract knowledge and insights from structured and
unstructured data.
• It encompasses various stages, including data collection, data cleaning, EDA, model
building, and deployment.
Relevance of Data Science:
• Data science has widespread applications in industries such as healthcare, finance,
marketing, and technology.
• It plays a pivotal role in making data-driven decisions, improving business strategies,
and predicting future trends.
• Data scientists are in high demand due to their ability to harness data for competitive
advantages.
Skills in Data Science:
• Data scientists require a strong foundation in statistics, programming, and domain-
specific knowledge.
• Proficiency in tools like Python, R, SQL, and machine learning frameworks.
• Communication skills to convey findings and insights effectively.

There are several phases of data analysis, including data requirements, data collection, data
processing, data cleaning, exploratory data analysis, modelling and algorithms, and data
product and communication.

• Data requirements: There can be various sources of data for an organization. It is


important to comprehend what type of data is required for the organization to be
collected, curated, and stored.
For example, an application tracking the sleeping pattern of patients suffering from
dementia requires several types of sensors' data storage, such as sleep data, heart
rate from the patient, electro-dermal activities, and user activities pattern. All of
these data points are required to correctly diagnose the mental state of the person.

• Data collection: Data collected from several sources must be stored in the correct
format and transferred to the right information technology personnel within a
company. As mentioned previously, data can be collected from several objects on
several events using different types of sensors and storage tools.

• Data processing: Preprocessing involves the process of pre-curating the dataset


before actual analysis. Common tasks involve correctly exporting the dataset, placing
them under the right tables, structuring them, and exporting them in the correct
format.

• Data cleaning: Pre-processed data is still not ready for detailed analysis. It must be
correctly transformed for an incompleteness check, duplicates check, error check,
and missing value check. These tasks are performed in the data cleaning stage, which
involves responsibilities such as matching the correct record, finding inaccuracies in
the dataset, understanding the overall data quality, removing duplicate items, and
filling in the missing values

• EDA: Exploratory data analysis, as mentioned before, is the stage where we actually
start to understand the message contained in the data. It should be noted that
several types of data transformation techniques might be required during the
process of exploration.

• Modeling and algorithm: From a data science perspective, generalized models or


mathematical formulas can represent or exhibit relationships among different
variables, such as correlation or causation. These models or equations involve one or
more variables that depend on other variables to cause an event.
In general, a model always describes the relationship between independent and
dependent variables. Inferential statistics deals with quantifying relationships
between particular variables.
For example, when buying, say, pens, the total price of pens(Total) = price for one
pen(UnitPrice) * the number of pens bought (Quantity). Hence, our model would be
Total = UnitPrice * Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent variable and the unit price is
referred to as an independent variable.

• Data Product: Any computer software that uses data as inputs, produces outputs,
and provides feedback based on the output to control the environment is referred to
as a data product. A data product is generally based on a model developed during
data analysis, for example, a recommendation model that inputs user purchase
history and recommends a related item that the user is highly likely to buy.

• Communication: This stage deals with disseminating the results to end stakeholders
to use the result for business intelligence. One of the most notable steps in this stage
is data visualization. Visualization deals with information relay techniques such as
tables, charts, summary diagrams, and bar charts to show the analyzed result.

1.2 Significance of EDA:


Explanation: Understanding why EDA is crucial in the data analysis process and decision-
making.

EDA as a Foundation:
• Exploratory Data Analysis (EDA) is the cornerstone of data analysis.
• It forms the initial phase in the data analysis pipeline.
• EDA sets the stage for all subsequent data-driven decisions
Uncovering Patterns and Anomalies:
• EDA helps reveal hidden patterns, trends, and anomalies within data.
• These discoveries can provide valuable insights for decision-makers.
• Identifying outliers or unusual data points can highlight data quality issues.
Data Quality Assessment:
• EDA aids in assessing the quality and reliability of data.
• It allows for the identification and handling of missing values, duplications, and
errors.
• Ensuring data quality is essential for meaningful analysis and decision-making.
Hypothesis Generation:
• EDA encourages the formulation of hypotheses about the data.
• These hypotheses can guide further analysis and experimentation.
• Data-driven hypotheses are more likely to lead to actionable insights.
Feature Selection and Engineering:
• EDA helps in selecting relevant features for modelling.
• Feature engineering, based on EDA findings, can enhance model performance.
• Efficient feature selection reduces computational complexity.
Communication of Insights:
• EDA results are often presented through visualizations and summaries.
• Clear communication of insights to stakeholders is vital.
• Visualizations make complex data more understandable to non-technical audiences.
Risk Mitigation:
• EDA can uncover risks and potential issues within the data.
• Addressing these issues early in the analysis process can prevent costly errors and
wrong decisions.
• Risk mitigation is a crucial aspect of data-driven decision-making.
Continuous Improvement:
• As more data becomes available or as research questions evolve, EDA should be
revisited.
• EDA is not a one-time process but an iterative one.
• Continuous EDA ensures that insights remain up-to-date and relevant.

In summary, EDA is of paramount importance in the data analysis process as it lays the
foundation for informed decision-making. It uncovers insights, assesses data quality,
generates hypotheses, and helps communicate findings effectively, ultimately leading to
more robust and reliable decisions.

1.3 Steps for EDA:


1.Data Collection:
• Gather relevant data from various sources.
• Ensure data is complete, accurate, and well-documented.
2.Data Cleaning:
• Identify and handle missing values, duplicates, and outliers.
• Ensure data quality for meaningful analysis.
3.Data Summarization:
• Compute descriptive statistics to understand data distribution.
• Gain initial insights into data characteristics.
4.Data Visualization:
• Create graphs and plots to visually represent data.
• Reveal patterns, trends, and relationships.
5.Correlation Analysis:
• Examine the strength and direction of relationships between variables.
• Identify dependencies and associations.
6.Categorical Data Exploration:
• Analyze categorical variables using frequency tables, bar charts, etc.
• Understand distributions and proportions.
7.Time Series Analysis (if applicable):
• Explore data over time to identify trends and seasonality.
• Time-based patterns and anomalies.
8.Dimensionality Reduction (if applicable):**
- Reduce the number of features while preserving essential information.
- Techniques like PCA for high-dimensional data.
9.Features:
• Investigate feature importance and relevance for modelling.
• Select or engineer features based on EDA findings.
10.Interactive Exploration:
• Use interactive tools and dashboards for in-depth exploration.
• Allows for dynamic analysis and drilling down into data.
11.Hypothesis Generation:
• Formulate hypotheses based on EDA insights.
• Guide further analysis and experimentation.
12.Iteration and Refinement:
• EDA is an iterative process.
• Revisit and refine analysis as needed, especially with new data or questions.
13.Documentation and Communication:
• Document EDA procedures, findings, and visualizations.
• Effectively communicate insights to stakeholders.

These steps collectively form the EDA process, which is essential for understanding data,
making informed decisions, and preparing the data for advanced analytics and modelling.

1.4 Python Packages:

NumPy: NumPy is a fundamental package for scientific computing with Python. It provides
support for working with arrays and matrices.
import numpy as np
# Create a 1D array and perform basic operations
arr = np.array([1, 2, 3, 4, 5])
mean = np.mean(arr)
std_dev = np.std(arr)
print(f"Mean: {mean}, Standard Deviation: {std_dev}")
Plotly: Plotly is a library that allows creating interactive and dynamic visualizations, including
2D and 3D charts.
import plotly.express as px
# Sample data
data = px.data.iris()
# Create a scatter plot using Plotly
fig = px.scatter(data, x='sepal_width', y='sepal_length', color='species')
# Show the interactive plot
fig.show()

Seaborn: Seaborn is built on top of Matplotlib and provides a higher-level interface for
creating informative and attractive statistical graphics.
import seaborn as sns
import matplotlib.pyplot as plt
# Load sample dataset
tips = sns.load_dataset('tips')
# Create a scatter plot
sns.scatterplot(x='total_bill', y='tip', data=tips)
# Add labels and title
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Scatter Plot of Tips vs Total Bill')
# Display the plot
plt.show()

Matplotlib: Matplotlib is a widely used Python library for creating static, interactive, and
animated visualizations in Python.
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 8, 6, 4, 2]
# Create a line plot
plt.plot(x, y, marker='o')
# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sample Line Plot')
# Display the plot
plt.show()

Pandas: Pandas is a popular Python library used for data manipulation and analysis. It
provides data structures like DataFrame for handling tabular data.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
df = pd.DataFrame(data)
# Display basic statistics
print(df.describe())
# Display first few rows
print(df.head())

2.EDA:
2.1 Making Sense of Data:
Making sense of complex datasets through EDA involves employing various techniques and
approaches to gain insights and understand data patterns. This process is crucial for making
informed decisions and drawing meaningful conclusions. Here, we'll explore EDA techniques
for different types of data, including numerical data (both discrete and continuous),
categorical data, and understanding different measuring scales.

1.Numerical Data:
Discrete Numerical Data:
• Refers to data that can only take specific, distinct values.
• Common examples include counts, integers, and whole numbers.
• EDA techniques:
• Create frequency distributions and histograms to visualize data distribution.
• Calculate summary statistics such as mean, median, mode, and standard deviation.
• Explore measures of central tendency and dispersion.

Continuous Numerical Data:


• Represents data that can take any value within a range.
• Examples include real numbers, measurements, and decimals.
• EDA techniques:
• Construct density plots or kernel density estimations for distribution visualization.
• Compute summary statistics and examine skewness and kurtosis.
• Use box plots to identify outliers and understand data variability.

2.Categorical Data:
Categorical Data:
• Consists of non-numeric categories or labels.
• Examples include gender, colors, or product categories.
• EDA techniques:
• Create bar charts or pie charts to display category frequencies.
• Calculate proportions and percentages.
• Explore cross-tabulations to understand relationships between categorical variables.
Measuring Scales:
1.Nominal Scale:
• Represents data with unordered categories.
• EDA considerations:
• Use frequency tables and bar charts.
• Calculate mode (most frequent category).
• No meaningful arithmetic operations.
2.Ordinal Scale:
• Involves data with ordered categories but unknown intervals.
• EDA considerations:
• Visualize using ordered bar charts.
• Calculate mode and median.
• Limited arithmetic operations (e.g., ranking).
3.Interval Scale:
• Numerical data with known intervals but no true zero point.
• EDA considerations:
• Visualize with histograms or density plots.
• Calculate mean, median, and standard deviation.
• Arithmetic operations (e.g., addition and subtraction) are meaningful, but ratios are
not (no true zero).
4.Ratio Scale:
• Numerical data with known intervals and a true zero point.
• EDA considerations:
• Employ various numerical and visual techniques (histograms, box plots).
• Calculate mean, median, mode, standard deviation, and coefficients of variation.
• All arithmetic operations are meaningful (e.g., multiplication, division).
2.2 Comparing EDA with Classical and Bayesian Analysis:

1.Classical Analysis:

Definition: Classical statistical analysis, also known as frequentist statistics, relies on fixed
parameters and emphasizes hypothesis testing.
Focus:
• Hypothesis Testing: Classical analysis often starts with predefined hypotheses and
aims to accept or reject them based on data.
• Parameter Estimation: It involves estimating fixed model parameters (e.g., mean,
variance) with point estimates.
Data Exploration:
• Limited Emphasis: Classical analysis focuses more on hypothesis testing and less on
data exploration.
• EDA is typically not the primary focus but may be used alongside classical methods
for initial data assessment.
Assumptions:
• Assumes fixed, unchanging parameters.
• Assumes large sample sizes for accurate results.
• Relies on p-values for significance testing.
Applicability:
• Suitable for well-defined research questions with clear hypotheses.
• Less flexible when dealing with complex or dynamic data.
2.Bayesian Analysis:

Definition: Bayesian analysis is a probabilistic approach that models uncertainty through


probability distributions.
Focus:
• Probabilistic Inference: Bayesian analysis focuses on estimating probability
distributions over parameters.
• Updating Beliefs: It incorporates prior knowledge and continuously updates beliefs as
more data becomes available.
Data Exploration:
• Bayesian analysis naturally incorporates data exploration within its framework.
• EDA principles can align well with Bayesian techniques, especially in the initial stages
of analysis.
Assumptions:
• Employs prior distributions to represent prior beliefs about parameters.
• Updates these beliefs based on observed data.
Applicability:
• Well-suited for scenarios with limited data or prior information.
• Particularly valuable when dealing with uncertain or evolving data.

3.EDA (Exploratory Data Analysis):

Definition: EDA is an approach to data analysis that focuses on visually and numerically
exploring data to gain insights.
Focus:
• Data Exploration: EDA's primary focus is to understand the data's structure, uncover
patterns, and generate hypotheses.
• Visualizations: It heavily relies on data visualizations, such as histograms, scatter
plots, and box plots.
Data Exploration:
• Emphasizes data exploration, data cleaning, and initial insights.
• Typically used in the early stages of analysis to guide subsequent steps.
Assumptions:
• Assumption-free: EDA doesn't rely on specific assumptions about data distribution.
• Flexible and adaptable to various data types.
Applicability:
• Valuable for understanding data, identifying outliers, and formulating hypotheses.
• Serves as a foundation for choosing appropriate statistical methods.
2.3 Software tools for EDA:

1.Polymer Search: Polymer Search is a tool that allows users to harness the power of AI to
generate insights from their data and create interactive databases that allows for easy
filtering and data exploration. The main features include: an interactive spreadsheet,
interactive pivot table, interactive graphs/charts, the ‘auto-explainer’ feature that allows you
to instantly generate summaries, rankings and find anomalies within the data, and ‘Smart
Start’ where you can use the AI to suggest insights about your data. This tool is especially
powerful for marketers, salespeople and business intelligence people looking to perform
analysis and present their data.

2.Rattle(R): R is complicated to learn with not so great documentation available, however,


Rattle is the opposite. It is a graphical interface for R which allows in-depth data mining and
requires no coding, no command line prompts - just clicks. Rattle allows you to easily explore
your data and create quick visualizations. You can also use it to clean & transform your data
and build models. The tool is fast and ideal for handling big data for those who don’t know
how to code.

3.Pandas Profiling: Pandas Profiling is an open source Python module which allows both
non-technical users and data scientists to quickly perform EDA and present the information
on a web-based interactive report. Using Pandas Profiling, you generate interactive
graphs/charts and visualize the distribution of each variable in the dataset using just a few
lines of code. Data scientists often use Pandas Profiling to save hours of time needed for the
EDA process.

4.DataPrep: DataPrep is a tool on Python that saves countless hours of cleansing, preparing
data and performing EDA. It works similarly to Pandas Profiling - that within a couple lines of
code, you can plot a series of interactive graphs and distributions charts to get an overall
sense of the data. You can also find & analyze missing values and outliers within seconds
using a few lines of code. This allows the user to be aware of data quality in each column
and find possible reasons for these missing values or outliers. Overall, DataPrep is a very
powerful tool for cleansing data, analyzing missing variables, checking correlations and
seeing the distribution of each variable.

5.Trifacta: Trifacta allows you to prepare and explore any dataset on cloud data warehouse
or cloud data lakehouse through an interactive user interface.The tool uses in-built machine
learning algorithms to guide you through the exploration of your data. One of its features is
data profiling: determining how accurate, complete and valid a dataset is. Trifacta does this
automatically with its intelligent AI. Another feature is its no-code ETL (extract, transform,
load) or ELT. You can transform your dataset simply by providing an example format, and the
machine learning algorithm will fill in the rest.

6.Knime: KNIME is a tool that allows you to dive deep into data processing without learning
how to code. KNIME is often used by data scientists, especially from the chem/biotech
industry, for data processing and building production grade applications. It has plenty of
features that’ll come in hand for exploratory data analysis including data cleansing and
manipulation, merging datasets together, creating interactive visualizations and building
models.

7.Excel: For many datasets, Excel is all that’s needed for data analysis. The advantages of
Excel are that it’s easy to cleanse/manipulate the dataset using basic Excel functions, and it’s
ultra convenient to quickly create graphs/charts. Although Excel is a paid program, Google
Sheets is a free alternative that does exactly the same thing.

8.Rapidminer: Rapidminer is a no-code solution for non-technical people to do advanced


data mining. For stuff like using text mining to build predictive models, it can take several
months of learning to do that in R or Python, but using Rapidminer, you can learn to do that
in days or weeks.Rapidminer also allows more advanced users to pull in their R or Python
scripts
seamlessly. Although Rapidminer handles big data relatively well and can be used for
machine learning, do note that it's slower and inferior to R and Python.

9.IBM Cognos: IBM Cognos Analytics is a business intelligence tool designed for business
professionals who aren’t data-savvy. Using its built-in AI tools, users can explore and
generate insights about their data in a matter of clicks. The tool also automates data
preparation for cleansing and aggregating data.
10.R Language: R is a statistical programming language commonly used for data analysis and
visualization. It has a rich ecosystem of packages for EDA

# Load data
data <- read.csv('data.csv')

# Summary statistics
summary(data)

# Create a histogram
hist(data$variable, main="Histogram", xlab="Values")

11.Python: See Packages in topic 1.4

3.Visual aids for EDA:

1. Line Chart:
• Definition: A line chart displays data points connected by straight lines, representing
trends and changes over time or continuous data.
• Usage: It's used to show patterns, trends, and relationships between data points.
• Variable: Typically used for continuous data on the x-axis, such as time or numerical
values.
• Inference: Easily identifies trends and fluctuations in data over time.
• Code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Chart')
plt.show()

2. Bar Chart:
• Definition: A bar chart uses rectangular bars to represent categorical data. The length
of each bar is proportional to the value it represents.
• Usage: Used to compare data among different categories or discrete items.
• Variable: Discrete categories or items on the x-axis.
• Inference: Allows quick comparisons between categories and identifies the highest or
lowest values.
• Code:
import matplotlib.pyplot as plt
categories = ['Category A', 'Category B', 'Category C']
values = [25, 40, 30]
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')
plt.show()

3. Scatter Plot:
• Definition: A scatter plot displays individual data points as dots on a two-dimensional
plane. It shows the relationship between two continuous variables.
• Usage: Used to visualize the correlation or distribution of data points.
• Variable: Two continuous variables, one on each axis.
• Inference: Reveals patterns, clusters, or outliers in the relationship between the
variables.
• Code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()

4. Area Plot:
• Definition: An area plot represents data with
shaded areas, depicting cumulative quantities over time or categories.
• Usage: Shows the composition or cumulative change in data.
• Variable: Typically used for continuous data on the x-axis, such as time.
• Inference: Highlights the total contribution of each category to the whole.
• Code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
plt.fill_between(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Area Plot')
plt.show()

5. Pie Chart:
• Definition: A pie chart represents data as sectors of a circle, showing the proportion
of each category in relation to the whole.
• Usage: Used to display the percentage distribution of different parts of a whole.
• Variable: Discrete categories.
• Inference: Easily compares the proportions of different categories within a dataset.
• Code:
import matplotlib.pyplot as plt
labels = ['A', 'B', 'C', 'D']
sizes = [15, 30, 45, 10]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart')
plt.show()

6. Table Chart:
• Definition: A table chart presents data in a tabular format with rows and columns.
• Usage: Used to display detailed data, allowing for precise examination.
• Variable: Both discrete and continuous data can be shown in a table.
• Inference: Enables viewers to read specific values and make detailed comparisons.
• Code:
import pandas as pd
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)

7. Polar Chart:
• Definition: A polar chart displays data on a circular grid, using radial lines to
represent different values.
• Usage: Used to show data patterns that involve angles or cyclical trends.
• Variable: Continuous data on the radial axis.
• Inference: Highlights cyclical patterns and comparisons across different angles.
• Code:
import matplotlib.pyplot as plt
import numpy as np
theta = np.linspace(0, 2*np.pi, 6)
r = [1, 2, 3, 4, 5, 6]
plt.polar(theta, r)
plt.title('Polar Chart')
plt.show()

8. Histogram:
• Definition: A histogram groups continuous data into intervals (bins) and represents
the frequency of data points in each bin.
• Usage: Shows the distribution and frequency of data within specific ranges.
• Variable: Continuous data on the x-axis, divided into intervals.
• Inference: Reveals the underlying distribution of the data, including peaks, valleys,
and skewness.
• Code:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=20)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

9. Lollipop Chart:
• Definition: A lollipop chart combines a scatter plot with vertical lines, emphasizing
data points on a continuous scale.
• Usage: Used to compare individual data points within a dataset.
• Variable: Continuous data on the x-axis.
• Inference: Highlights specific data points and their values compared to others.
• Code:
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values = [5, 10, 15, 8]
plt.stem(categories, values, basefmt=' ')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Lollipop Chart')
plt.show()

4.Data Transformation Techniques:


Data transformation is a critical step in data preprocessing and analysis. It involves modifying
the structure or content of data to make it more suitable for analysis, reporting, or
visualization. Some common data transformation techniques include merging databases,
reshaping, and other transformation methods.

4.1 Merging Databases:

Description: Merging databases involves combining data from multiple sources or tables to
create a unified dataset. It is often used when dealing with related data stored in separate
databases or files.
# Example Code (Python - Pandas):
import pandas as pd

# Create two sample dataframes


df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30, 28]})

# Merge dataframes using a common key (ID)


merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

# left join
left_join_df = pd.merge(df1, df2, on='ID', how='left')
print(left_join_df)

# right join
right_join_df = pd.merge(df1, df2, on='ID', how='right')
print(right_join_df)

# outer join
outer_join_df = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join_df)

# Reset the index to use it for merging


df1_indexed = df1.set_index('ID')
df2_indexed = df2.set_index('ID')

# Merge on index
index_merge_df = pd.merge(df1_indexed, df2_indexed, left_index=True, right_index=True,
how='inner')
print(index_merge_df)

Output:
ID Name Age
0 2 Bob 25
1 3 Charlie 30

ID Name Age
0 1 Alice NaN
1 2 Bob 25.0
2 3 Charlie 30.0

ID Name Age
0 2 Bob 25
1 3 Charlie 30
2 4 NaN 28
ID Name Age
0 1 Alice NaN
1 2 Bob 25.0
2 3 Charlie 30.0
3 4 NaN 28.0

ID Name Age
2 Bob 25
3 Charlie 30

4.2 Reshaping and Pivoting:

Description: Reshaping involves changing the structure of data, often from a wide format to
a long format or vice versa. It is useful for different types of analyses and visualization.

# Example Code (Python - Pandas):


import pandas as pd

# Create a sample dataframe


data = {'Date': ['2022-01-01', '2022-01-02'],
'Temperature': [32, 35],
'Humidity': [50, 45]}

df = pd.DataFrame(data)

stacked_df = df.set_index('Date').stack().reset_index(name='Value')
print(stacked_df)

unstacked_df = stacked_df.set_index(['Date', 'level_1']).unstack('level_1')


unstacked_df.columns = unstacked_df.columns.droplevel(0)
print(unstacked_df)

pivot_df = stacked_df.pivot_table(index='Date', columns='level_1', values='Value')


print(pivot_df)

Output:
Date level_1 Value
0 2022-01-01 Temperature 32
1 2022-01-01 Humidity 50
2 2022-01-02 Temperature 35
3 2022-01-02 Humidity 45
level_1 Humidity Temperature
Date

2022-01-01 50 32
2022-01-02 45 35

level_1 Humidity Temperature


Date

2022-01-01 50 32
2022-01-02 45 35

4.3 Transformation Techniques:

Description: Transformation techniques involve modifying data values or creating new


features to make data more suitable for analysis. Common transformations include scaling,
encoding categorical variables, and creating derived features.

#Example Code (Python - Pandas)


import pandas as pd
import numpy as np

data = {'A': [1, 2, 2, 3, 4, 4],


'B': [4, 5, np.nan, np.nan, 8, 8]}
df = pd.DataFrame(data)

# Remove duplicate rows based on all columns


df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

# Replace a specific value in a column


df['A'] = df['A'].replace(2, 999)
print(df)

# Remove rows with NaN values


df_no_na_rows = df.dropna()
print(df_no_na_rows)

# Remove columns with NaN values


df_no_na_columns = df.dropna(axis=1)
print(df_no_na_columns)
# Fill NaN values with a specific value (e.g., 0)
df_filled = df.fillna(0)
print(df_filled)

# Interpolate missing values using linear interpolation


df_interpolated = df.interpolate()
print(df_interpolated)

Output:

A B
0 1 4.0
1 2 5.0
2 2 NaN
3 3 NaN
4 4 8.0

A B
0 1 4.0
1 999 5.0
2 999 NaN
3 3 NaN
4 4 8.0
5 4 8.0

A B
0 1 4.0
1 999 5.0
4 4 8.0
5 4 8.0

A
0 1
1 999
2 999
3 3
4 4
5 4
A B
0 1 4.0
1 999 5.0
2 999 0.0
3 3 0.0
4 4 8.0
5 4 8.0

A B
0 1 4.0
1 999 5.0
2 999 6.0
3 3 7.0
4 4 8.0
5 4 8.0

Data transformation is a crucial part of the data preprocessing pipeline, enabling data
analysts and data scientists to prepare data for analysis, modeling, and visualization. These
techniques, along with others like feature engineering and missing data handling, play a
significant role in making data more informative and actionable.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy