OOM Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 145

UNIT 2

Data Transformation
Data Transformation
• Data transformation is a set of techniques used to convert data from one
format or structure to another format or structure.

• Data deduplication
• Data cleansing
• Data validation
• Format revisioning
• Data derivation
• Data aggregation
• Data integration
• Data filtering
• Data joining
Python libraries
• We will be using the following
• Pandas
• Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data.
• NumPy
• NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, fourier transform, and matrices
• Seaborn
• Python Seaborn library is a widely popular data visualization library that is commonly used
for data science and machine learning tasks
• Matplotlib
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python
Pandas DataFrames
• A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.
• Create a simple Pandas DataFrame:
Named Indexes

• With the index argument, you can name your own indexes.
• Add a list of names to give each row a name:
Load Files Into a DataFrame

• If your data sets are stored in a file, Pandas can load them into a
DataFrame.
• Load a comma separated file (CSV file) into a DataFrame:
append, concat, merge, or join.
Many beginner developers get confused when working with
pandas dataframes, especially regarding when to use append,
concat, merge, or join.

• merge() for combining data on common columns or indices


• join() is used to combine two DataFrames on the index but
not on columns
• concat() for combining DataFrames across rows or columns
Concat ()
• Concat () combining data across rows or columns

• With concatenation, your datasets are just stitched together along


an axis — either the row axis or column axis

• axis represents the axis that you’ll concatenate along. The default value
is 0, which concatenates along the index, or row axis. Alternatively, a
value of 1 will concatenate vertically, along columns. You can also use the
string values "index" or "columns"

• Ignore_index takes a Boolean True or False value. It defaults to False.


If True, then the new combined dataset won’t preserve the original index
values in the axis specified in the axis parameter. This lets you have
entirely new index values.
database-style dataframes – Use Case 1
• In the below dataset, the first column contains information about
student identifiers and the second column contains their respective
scores in any subject.
• The structure of the dataframes is the same in both cases. In this
case, we would need to concatenate them.

dataframe = pd.concat([dataFrame1, dataFrame2],


ignore_index=True)
dataframe
Combined the dataframes along
• To combine them side by
axis=0: It combines them
side, specify axis=1 together in the same direction

pd.concat([dataFrame1, dataFrame2], axis=1)


USE CASE 2
Where you are teaching two courses:

Software Engineering and Introduction

to Machine Learning. You will get two

dataframes from each subject:

• Two for the Software Engineering


course
• Another two for the Introduction
to Machine Learning course
Important details you need to
note in the dataframes:

• There are some students who are not


taking the software engineering exam.

• There are some students who are not


taking the machine learning exam.

• There are students who appeared in


both courses
Questions to be answered
• How many students appeared for the
exams in total?

• How many students only appeared


for the Software Engineering course?

• How many students only appeared


for the Machine Learning course?
Concatenating along with an axis : pd.concat() method

The code for combining the dataframes


Merge
• Combining Data on Common Columns or Indices
• You can use merge() anytime you want functionality similar to a
database’s join operations
• how defines what kind of merge to make. It defaults to 'inner', but
other possible options include 'outer', 'left', and 'right’
• on tells merge() which columns or indices, also called key
columns or key indices, you want to join on.
Merge
• left_on and right_on specify a column or index that’s present only in
the left or right object that you’re merging. Both default to None.

• left_index and right_index both default to False, but if you want to


use the index of the left or right object to be merged, then you can
set the relevant argument to True.
Different Type of Joins
These are the following types of joins
• The inner join takes the intersection from two or more dataframes. It
corresponds to the INNER JOIN in Structured Query Language (SQL).
• The outer join takes the union from two or more dataframes. It
corresponds to the FULL OUTER JOIN in SQL.
• The left join uses the keys from the left-hand dataframe only. It
corresponds to the LEFT OUTER JOIN in SQL.
• The right join uses the keys from the right-hand dataframe only. It
corresponds to the RIGHT OUTER JOIN in SQL.
Using df.merge with an inner join : df.merge() method
List of students who appeared in both the courses
• Concatenate the individual dataframes from each of the subjects,
• Then use df.merge() methods

• Inner join on each dataframe includes the dataframe


which exists in both dataframes as new dataframe.
21 students who took both the courses.
Using the pd.merge() method with a left join
A left outer join is a method of combining tables. The result
includes unmatched rows from only the table that is
specified before the LEFT OUTER JOIN clause

how many students only appeared for the Software


Engineering course. The total number would be 5
Using the pd.merge() method with a right
join

Right join to get a list of all the students who appeared


only in the Machine Learning course.
Using pd.merge() methods with outer join

Total number of students appeared for examination


Merging on index
• Sometimes the keys for merging dataframes are located in the
dataframes index. In such a situation, we can pass left_index=True or
right_index=True to indicate that the index should be accepted as the
merge key.

Merging on index is done in the following steps:


1. Consider the following two dataframes:
• Try merging using an inner join, which is the
default type of merge.

• In this case, the default merge is the


intersection of the keys

Since there is no cat key in the second dataframe, it is not


included in the final table.
let's try merging using an outer join
join
•join() is used to combine two DataFrames on the index
•lsuffix: Suffix to use from left frame’s overlapping columns.
•rsuffix: Suffix to use from right frame’s overlapping columns.
>>> df >>> other
key A key B
0 K0 A0 0 K0 B0
1 K1 A1 1 K1 B1
2 K2 A2 2 K2 B2
3 K3 A3
4 K4 A4
5 K5 A5
Write a Pandas program to combine the columns of two potentially
differently-indexed DataFrames into a single result DataFrame.
import pandas as pd
data1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])

data2 = pd.DataFrame({'C': ['C0', 'C2', 'C3'],


'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])

print("Original DataFrames:")
print(data1)
print("--------------------")
print(data2)
print("\nMerged Data (Joining on index):")
result = data1.join(data2)
print(result)
append
• Pandas dataframe.append() function is used to append rows of other
data frames to the end of the given data frame, returning a new data
frame object.
• Columns not in the original data frames are added as new columns and
the new cells are populated with NaN value.
import pandas as pd

df1 = pd.DataFrame({"a":[1, 2, 3, 4],


"b":[5, 6, 7, 8]})

df2 = pd.DataFrame({"a":[1, 2, 3],


"b":[5, 6, 7],
"c":[1, 5, 4]})

# for appending df2 at the end of df1


df1 = df1.append(df2, ignore_index = True)
df1
Write a Pandas program to join the two given dataframes along rows
and assign all data.
import pandas as pd

student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha Preston'],
'marks': [201, 200, 198, 219, 201]})

print("Original DataFrames:")
print(student_data1)
print("-------------------------------------")
print(student_data2)
print("\nJoin the said two dataframes along rows:")
result_data = pd.concat([student_data1, student_data2])
print(result_data)
Write a Pandas program to join the two given dataframes along columns
and assign all data.
import pandas as pd

student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha Preston'],
'marks': [201, 200, 198, 219, 201]})

print("Original DataFrames:")
print(student_data1)
print("-------------------------------------")
print(student_data2)
print("\nJoin the said two dataframes along columns:")
result_data = pd.concat([student_data1, student_data2], axis = 1)
print(result_data)
Write a Pandas program to append rows to an existing DataFrame and
display the combined data.
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})

s6 = pd.Series(['S6', 'Scarlette Fisher', 205], index=['student_id', 'name', 'marks'])


print("Original DataFrames:")
print(student_data1)
print("\nNew Row(s)")
print(s6)

combined_data = student_data1.append(s6, ignore_index = True)

print("\nCombined Data:")
print(combined_data)
Write a Pandas program to join the two dataframes using the common
column of both dataframes.
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha Preston'],
'marks': [201, 200, 198, 219, 201]})

merged_data = pd.merge(student_data1, student_data2, on='student_id', how='inner')


print("Merged data (inner join):")
print(merged_data)
Write a Pandas program to join the two dataframes with matching
records from both sides where available
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha
Preston'],
'marks': [201, 200, 198, 219, 201]})

print("Original DataFrames:")
print(student_data1)
print(student_data2)
merged_data = pd.merge(student_data1, student_data2, on='student_id', how='outer')
print("Merged data (outer join):")
print(merged_data)
Write a Pandas program to create a new DataFrame based on existing
series, using specified argument and override the existing columns
names.
import pandas as pd
s1 = pd.Series([0, 1, 2, 3], name='col1')
s2 = pd.Series([0, 1, 2, 3])
s3 = pd.Series([0, 1, 4, 5], name='col3')
df = pd.concat([s1, s2, s3], axis=1,
keys=['column1', 'column2', 'column3'])
print(df)
Write a Pandas program to merge two given dataframes with different
columns.
import pandas as pd
data1 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'P': ['P0', 'P1', 'P2', 'P3'],
'Q': ['Q0', 'Q1', 'Q2', 'Q3']})
data2 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'R': ['R0', 'R1', 'R2', 'R3'],
'S': ['S0', 'S1', 'S2', 'S3']})
print("Original DataFrames:")
print(data1)
print("--------------------")
print(data2)
print("\nMerge two dataframes with different columns:")
result = pd.concat([data1,data2], axis=0, ignore_index=True)
print(result)
Data deduplication
• The dataframe may contains duplicate rows
• Removing them is essential to enhance the quality of the dataset.
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4,
'column 2': [10, 10, 22, 23, 23, 24, 24]})
frame3
Data deduplication
• The pandas dataframe comes with a duplicated() method that returns
a Boolean series stating which of the rows are duplicates:

frame3.duplicated
()

The rows that say True are the ones that contain duplicated data.
Drop duplication
• We can drop these duplicates using the drop_duplicates() method:

frame4 = frame3.drop_duplicates()
frame4

both the duplicated and drop_duplicates methods keep the first observed value during the
duplication removal process. If we pass the take_last=True argument, the methods return the
last one.
You can use the duplicated() function to find duplicate values in a
pandas DataFrame.

This function uses the following basic syntax:


Find Duplicate Rows Across All Columns
keep=’last’ to display the first duplicate rows instead of the last
Find Duplicate Rows Across Specific
Columns
The following code shows how to find duplicate rows across just the ‘team’
and ‘points’ columns of the DataFrame:
Find Duplicate Rows in One Column
The following code shows how to find duplicate rows in just the ‘team’ column of the
DataFrame:
Exercise 1
Scenario: You are working as a data analyst for an e-commerce
company that has a massive dataset containing customer information.
The dataset has accumulated duplicate entries due to various reasons
like multiple registrations, system errors, or data entry mistakes. Your
task is to use the Pandas library in Python to identify and remove
these duplicate entries efficiently
Exercise 2
You're analyzing a sales dataset from a retail store chain that
contains transactions from multiple branches. Due to system
errors and data integration issues, the dataset has accumulated
duplicate records of sales transactions. Your task is to use
Pandas to identify these duplicates based on specific columns
and retain only the unique transactions.
Replacing Values

• Use replace method to find


and replace some values inside
a dataframe
Exercise
• Scenario:
You're working with a dataset that contains information about product sales.
However, you've noticed that certain entries have inconsistent or incorrect values
in the 'Category' column. You need to standardize these categories by replacing
specific values with a unified category name using Pandas' replace() method.

data = {
'Product_ID': [1, 2, 3, 4, 5],
'Product_Name': ['Apple', 'Banana', 'Orange', 'Pear', 'Grapes'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Veggie', 'Fruit']
}
Handling Missing Data
Whenever there are missing values, a NaN value is used, which indicates
that there is no value specified for that particular index. There could be
several reasons why a value could be NaN:
• It can happen when data is retrieved from an external source and there are
some incomplete values in the dataset.
• It can also happen when we join two different datasets and some values are
not matched.
• Missing values due to data collection errors.
• When the shape of data changes, there are new additional rows or columns
that are not determined.
• Reindexing of data can result in incomplete data.
data = np.arange(15, 30).reshape(5, 3)
dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi',
'grapes', 'mango'], columns=['store1', 'store2', 'store3'])
dfx

The dataframe is showing sales of different fruits from different stores.


Add some missing values to our dataframe
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
The following characteristics of missing values in the preceding
dataframe:
• An entire row can contain NaN values.
• An entire column can contain NaN values.
• Some (but not necessarily all) values in both a row and a column can be
NaN.
NaN Values in Pandas Objects
isnull() function from the pandas library to identify NaN values:

dfx.isnull()
True values indicate the values that are NaN
NaN Values in Pandas Objects
notnull() function from the pandas library to identify not NaN values:

dfx.notnull()
True values indicate the values that are not NaN
NaN Values in Pandas Objects
sum() method to count the number of NaN values

dfx.isnull().sum()
True is 1 and False is 0 is the main logic for summing
NaN Values in Pandas Objects
To find the total number of missing values
dfx.isnull().sum().sum
()

15
NaN Values in Pandas Objects
To count the number of reported values
dfx.count()
NaN Values in Pandas Objects
To count the number of reported values
dfx.count()
NaN Values in Pandas Objects
Check for missing values in a specific column
df['Column_Name'].isnull().sum()
df[‘store4'].isnull().sum()

5
Exercise -1
• Scenario: You've received a dataset containing information about students' performance,
but the dataset has missing values across multiple columns due to various reasons. Your
task is to analyze and handle these missing values using Pandas' methods.
• Questions:
1. How many missing values are there in the 'Grade' column of the dataset?
2. What is the total count of missing values in the entire dataset?
3. Can you identify and display rows where 'Absenteeism' values are missing?
4. How many rows have complete data (non-missing) in the 'Study Hours' column?
5. What is the sum of missing values across each column in the dataset?
6. How many missing values are there in each column of the dataset?
Dropping Missing Values
One of the ways to handle missing values is to simply remove them from
dataset.

The dropna() method just returns a copy of the dataframe by dropping


the rows with NaN. The original dataframe is not changed.

dataframe.dropna(axis, how, thresh, subset, inplace)


Dropping Missing Values
• Dropping by rows
• Use the how=all argument to drop only those rows entire values
are entirely NaN:

dfx.dropna(how='all')
Dropping by columns
• Pass axis=1 to indicate a check for NaN by columns.

dfx.dropna(how='all',
axis=1)
Drop rows where specific columns have missing values
df.dropna(subset=['Column_Name'],inplace=True)
import pandas as pd
data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}
df=pd.DataFrame(data)
print(df)
# Dropping rows where 'B' column has missing values
df_dropped_b = df.dropna(subset=['B'])
#DataFrame after dropping rows with missing values in 'B'
column:")
print(df_dropped_b)
Thresh specify number of NoT NULL values to keep
the row

df.dropna(thresh=4)
Thresh :Drop column with minimum number of NaN
exist

df.dropna(thresh=4, axis=1)
Drop rows with any missing values
# Use inplace=True to modify the original DataFrame

df.dropna(inplace=True)
You are analyzing a dataset containing information about employees in
a company. However, the dataset is not entirely clean and has missing
values in various columns due to data entry errors and incomplete
records. Your task is to preprocess the data by handling missing values
using Pandas' dropna() method.
Questions:
1. How would you drop all rows with any missing values from the entire dataset using
Pandas?
2. If the 'Salary' column is critical for analysis, how can you drop only the rows where
'Salary' values are missing while preserving the rest of the dataset?
3. In some columns, missing values are acceptable up to a certain limit. How would
you drop rows with more than two missing values across any column?
4. For the 'Department' column, there are missing values that can't be imputed. How
can you drop rows where 'Department' is missing while keeping rows with other
missing values intact?
5. Considering the 'Address' and 'Phone' columns have a significant number of
missing values, how can you drop these columns entirely from the dataset?
6. If you want to drop rows where 'Years of Experience' are missing and 'Salary' is
also missing or zero, how would you achieve this?
Mathematical operations with NaN
The pandas and numpy libraries handle NaN values differently for
mathematical operations.

• When a NumPy function encounters NaN values, it returns NaN.

• Pandas, on the other hand, ignores the NaN values and moves
ahead with processing. When performing the sum operation, NaN
is treated as 0. If all the values are NaN, the result is also NaN.
Example
import numpy as np
Output
import pandas as pd
ar1 = np.array([100, 200, np.nan, 300])
(nan, 200.0)
ser1 = pd.Series(ar1)
ar1.mean(), ser1.mean()
The total quantity of fruits sold by store4
and its mean value
ser2 = dfx.store4
ser2.sum()
ser2.mean()
Output of the preceding code
38.0
19.0

store4 has five NaN values. However, during the summing process, these values are treated
as 0 and the result is 38.0
Cumulative summing
ser2 = dfx.store4
ser2.cumsum()
Output of the preceding code
apple 20.0
banana NaN
kiwi NaN
grapes NaN
mango NaN
watermelon 38.0
oranges NaN
Name: store4, dtype: float64
Filling missing values
The fillna() method to replace NaN values with any particular values.
filledDf = dfx.fillna(0)
filledDf
dfx.mean() filledDf.mean()

There are slightly different values. Hence, filling with 0 might not be the optimal solution.
Backward and forward filling
• NaN values can be filled based on the last known values.
Forward-filling technique
‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.

dfx.store4.fillna(method='ffill')
Forward-filling technique – Row Axis
• ffill() function to fill the missing values along the index axis
df.ffill(axis = 0)

Values in the first row is still NaN value because there is no row above it from which non-NA
value could be propagated
Forward-filling technique – Column axis
• When ffill is applied across the column axis, then missing values are
filled by the value in previous column in the same row.
df.ffill(axis = 1)
Backward filling
bfill() is used to backward fill the missing values in the dataset.

dfx.store4.fillna(method='bfill')
Backward filling
bfill() function to populate missing values NaN values in the dataframe across
rows.
When axis='rows', then value in current NaN cells are filled from the corresponding value in the next row. If the
next row is also NaN value then it won’t be populated.

df.bfill(axis ='rows')
Backward filling
bfill() function to populate missing values NaN values in the dataframe across
the columns.
when axis='columns', then the current NaN cells will be filled from the value present in the next column in the
same row. If the next column is also NaN cell then it won’t be filled.

df.bfill(axis =columns')
Interpolating missing values
• Linear interpolation is a method for calculating intermediate data
between known values

• interpolate() function both for the series and the dataframe. By default, it
performs a linear interpolation of our missing values
Example

ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])


ser3.interpolate()

• It is done by taking the first value before and after any


sequence of the NaN values.

• In the preceding series, ser3, the first and the last values are
100 and 292 respectively. Hence, it calculates the next value
as (292-100)/(5-1) = 48.

• So, the next value after 100 is 100 + 48 = 148.


Exercise
Consider a DataFrame representing student information with columns for
"Name," "Age," "Grade," and "Attendance." However, due to various reasons,
there are missing values in the "Age," "Grade," and "Attendance" columns

1. Create dataframe
2.Print the original DataFrame.
3.Explore the DataFrame to identify the missing
values.
4.Decide on an appropriate strategy for handling
missing values in each column (e.g., mean
imputation, forward fill, etc.).
5.Implement your chosen strategy to fill missing
values in the DataFrame.
6.Print the DataFrame after filling missing
values.
Replacing Missing Values

Replace missing values in the quantity column with mean, price column with a median, Bought column
with standard deviation. Forenoon column with the minimum value in that column. Afternoon column
with maximum value in that column.
Replacing Missing Values

Mean: data=data.fillna(data.mean())

Median: data=data.fillna(data.median())

Standard Deviation: data=data.fillna(data.std())

Min: data=data.fillna(data.min())

Max: data=data.fillna(data.max())
Exercise
Consider a DataFrame representing sales data for a company.
The DataFrame contains columns for "Product", "Quantity Sold",
"Price", and "Date". However, due to some issues, there are
missing values in the "Quantity Sold" and "Price" columns.
1. Create the dataframe as given
2.Print the original DataFrame.
3.Fill the missing values in the "Quantity
Sold" column with the mean value of that
column.
4.Fill the missing values in the "Price"
column with the median value of that
column.
5.Print the DataFrame after filling the
missing values.
Renaming axis indexes

Consider the example from the Reshaping and pivoting section. Say
you want to transform the index terms to capital letters:
dframe1.index = dframe1.index.map(str.upper)
dframe1
Discretization and binning
• Continuous datasets need to be converted into discrete or interval
forms.
• Each interval is referred to as a bin

data on the heights of a group of students

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

dataset into intervals of 118 to 125, 126 to 135, 136 to 160, and finally 160 and higher.
cut() method is used to convert the dataset into intervals

import pandas as pd
bins = [118, 125, 135, 160, 200]
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141,132]
category = pd.cut(height, bins)
category

• A parenthesis indicates that the side is open.


• A square bracket means that it is closed or inclusive.
• Here 118 is not included but anything greater than 118 is included while 125 is
included in the interval
• Set a right=False argument to change the form of interval:

import pandas as pd
bins = [118, 126, 136, 161, 200]
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141,132]
category2 = pd.cut(height, bins, right=False)
category2

Output form of closeness has been changed. Now, the results are in the form of
left-closed, Right-open.
. We can also indicate the bin names by passing a list of labels:
5

bin_names = ['Short Height', 'Average height', 'Good Height’, 'Taller']


pd.cut(height, bins, labels=bin_names)

And the output is as follows:

[
•Labels to bins and count no of values in each bin
import pandas as pd
bins = [118, 125, 135, 160, 200]
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141,132]
bin_names = ['Short Height', 'Average height', 'Good Height','Taller’]
category = pd.cut(height, bins, labels=bin_names)
Category

Output is
[Short Height, Short Height, Short Height, Average height, Short
Height, ..., Average height, Taller, Good Height, Good Height, Average height]

Length: 12
Categories (4, object): [Short Height < Average height < Good
Height < Taller]
pd.value_counts(category)
Exercise
• Suppose you have a DataFrame with a column of ages and you want
to discretize these ages into different age groups. Assign suitable
labels for each bin
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 42, 31, 19, 50]}
df = pd.DataFrame(data)

# Define bins for different age groups


bins = [0, 18, 30, 40, 60] # Age groups: 0-18, 19-30, 31-40, 41-60

# Define labels for the bins


labels = ['0-18', '19-30', '31-40', '41-60']

# Discretize the 'Age' column into bins using cut() function


df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df)
Discretization and Binning
• Equal Width Discretization:
Divides the range of values into equally spaced intervals.
Example:
import pandas as pd

data = pd.DataFrame({'A': [10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54]})
bins = pd.cut(data['A'], bins=3) # Divides data into 3 equal-width bins

• Equal Frequency Discretization (Quantile-Based):


Divides the data into intervals with approximately the same number of observations in each
bin.
Example:
bins = pd.qcut(data['A'], q=3) # Divides data into 3 quantile-based bins
Equal Width Discretization
• Dataset points: 10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54

• Step 1: Sort the dataset in ascending order: 10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54.

• Step 2: Determine the width of each bin using the formula: w = (max-min) / N ( N is number of bin)

• w is calculated as (54 - 10) / 4 = 11.

• BIN 1 : [lower bound , upper bound] = [(min) , (min + w -1)] = [10, 20]
• BIN 2 : [lower bound , upper bound] = [(min + w) , (min + 2w -1)] = [21, 31]
• BIN 3 : [lower bound , upper bound] = [(min + 2w) , (min + 3w -1)] = [32, 42]
• BIN 4 : [lower bound , upper bound] = [(min + 3w) , (max)] = [43, 54]

• BIN 1 : [10, 15, 18, 20]


• BIN 2 : [31]
• BIN 3 : [34, 41]
• BIN 4 : [46, 51, 53, 54]
Equal Frequency Discretization
• Dataset: 10,15, 18, 20, 31, 34, 41, 46, 51, 53, 54, 60

• Step 1: sort the data.



• Step 2: find the frequency. To calculate the frequency
• total number of data points/number of bins.

• The total number of data points is 12, and the number of bins required is 3.
• Therefore, the frequency comes out to be 4

• BIN 1: 10, 15, 18, 20


• BIN 2: 31, 34, 41, 46
• BIN 3: 51, 53, 54, 60
• Compute equal-length bins based on the minimum and maximum
values in the data.

import numpy as np
pd.cut(np.random.rand(40), 5, precision=2)

[(0.81, 0.99], (0.094, 0.27], (0.81, 0.99], (0.45, 0.63], (0.63,0.81], ..., (0.81, 0.99], (0.45, 0.63], (0.45, 0.63],
(0.81, 0.99],
(0.81, 0.99]] Length: 40

Categories (5, interval[float64]): [(0.094, 0.27] < (0.27, 0.45] <(0.45, 0.63] < (0.63, 0.81] < (0.81, 0.99]]
Quantiles are points taken at regular intervals from the cumulative
distribution function of a random variable. They are used to partition
a dataset into intervals with an equal number of data points or to
represent divisions of the probability distribution of a dataset.

• Median (2-Quantile): Divides the data into two equal halves. It's
the value below which 50% of the data falls.

• Quartiles (4-Quantiles): Divide the data into four equal parts.

• qcut method that forms the bins based on sample quantiles


randomNumbers = np.random.rand(2000)
# cut into quartiles
category3 = pd.qcut(randomNumbers, 4)
category3

[(0.77, 0.999], (0.261, 0.52], (0.261, 0.52], (-0.000565, 0.261],(-0.000565,


0.261], ..., (0.77, 0.999], (0.77, 0.999], (0.261, 0.52],(-0.000565, 0.261],
(0.261, 0.52]]
Length: 2000
Categories (4, interval[float64]): [(-0.000565, 0.261] < (0.261, 0.52] <(0.52,
0.77] < (0.77, 0.999]]
Outlier detection and filtering
•Outliers are data points that diverge from other observations
for several reasons.
•An outlier is an observation that lies an abnormal distance
from other values in a random sample from a population
•During the EDA phase, one of our common tasks is to detect
and filter these outliers.
•The main reason for this detection and filtering of outliers is
that the presence of such outliers can cause serious issues
in statistical analysis
Outlier detection and filtering

• .
• Several methods can be employed to detect outliers.
• Filtering can involve handling them by removing, capping,
or transforming them.
Outlier Detection Methods
• Statistical Methods:

• Z-Score: Identifying values far from the mean in terms of standard deviations.
• IQR (Interquartile Range): Using quartiles to identify values outside a specific range.

• Visualization Techniques:

• Boxplots: Visual representation of the data's distribution to spot outliers.


• Scatterplots: Visualizing relationships and identifying points distant from others.

• Machine Learning Methods:

• Clustering Algorithms: Detecting data points that don’t cluster well with others.
• Isolation Forest, Local Outlier Factor (LOF): Algorithms specifically designed for outlier detection.
Filtering Outliers
• Removal:

• Deleting the identified outlier data points from the dataset.

• Capping or Flooring:

• Setting a threshold and capping or flooring outlier values to a specific


limit.

• Transformations:

• Applying mathematical transformations (log, square root) to reduce


the impact of outliers.
Example
• import pandas as pd
• import numpy as np

• # Load housing price data


• housing_data = pd.read_csv('housing_data.csv')

• # Calculate Z-Score for 'Price' column


• mean_price = np.mean(housing_data['Price'])
• std_price = np.std(housing_data['Price'])
• z_scores = (housing_data['Price'] - mean_price) / std_price

• # Detect and filter outliers (considering a Z-Score threshold of 3)


• outlier_threshold = 3
• filtered_data = housing_data[abs(z_scores) < outlier_threshold]

• # Explore filtered data


• print(filtered_data.head())
Random sampling
• Random sampling is a method used in statistics and data analysis to
select a subset of individuals or data points from a larger population.
• The key characteristic of random sampling is that each member of
the population has an equal chance of being chosen and that the
selection process is entirely by chance.
Example of Random Sampling
• Suppose you have a population of 1,000 students, and you
want to conduct a survey on their favorite subjects. To perform
random sampling:
• Random Sampling: You could assign each student a number
and use a random number generator to select, let's say, 100
students without bias.

It aims to ensure that the sample reflects the characteristics of


the larger population.
Permutation and Random Sampling

import numpy as np
import pandas as pd
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)
df
numpy.random.permutation() function, we can randomly
select or permute a series of rows in a dataframe
import numpy as np
import pandas as pd
dat = np.arange(80).reshape(10,8) df.take(sampler)
df = pd.DataFrame(dat)
Df
sampler = np.random.permutation(10)
sampler
Output
array([1, 5, 3, 6, 2, 4, 9, 0, 7, 8])
take() function is used to take the orginal data from
dataframe using index value of the array generated by
permutation function
Random sampling without replacement
To compute random sampling without replacement, follow these steps:
1. To perform random sampling without replacement, we first create a
permutation array.
2. Next, we slice off the first n elements of the array where n is the desired size
of the subset you want to sample.
3. Then we use the df.take() method to obtain actual samples:

df.take(np.random.permutation(len(df))[:3])
Random sampling :Sample method
In pandas, the DataFrame.sample method is used to randomly sample
a specified number of rows or a fraction of rows from a DataFrame.

DataFrame.sample(n=None, frac=None, replace=False, weights=None,


random_state=None, axis=None)
Random sampling with replacement
Random sampling with replacement refers to the process of
selecting elements from a dataset where each element has an
equal probability of being chosen, and after each selection, the
element is placed back into the dataset, allowing it to be chosen
again
Random sampling with replacement
•To generate random sampling with replacement, follow the
given steps
We can generate a random sample with replacement using
the numpy.random.randint() method and drawing
random integers:

sack = np.array([4, 8, -2, 7, 5])


sampler = np.random.randint(0, len(sack), size = 10)
sampler
Output array([3, 3, 0, 4, 0, 0, 1, 2, 1,
To draw the required samples:

draw = sack.take(sampler)
draw
Output
array([ 7, 7, 4, 5, 4, 4, 8, -2, 8,
5])
Random Sampling using sample() method
import pandas as pd

# Creating a sample DataFrame


data = {
'Student_ID': range(1, 21),
'Test_Score': [82, 85, 88, 90, 85, 95, 78, 85, 100, 92, 85, 87, 80, 98, 85, 82, 85, 87, 88, 91]
}
df = pd.DataFrame(data)
#Performing random sampling to select 5 students randomly
sampled_data = df.sample(n=5, random_state=42)
# Specifying random_state for reproducibility

print("Randomly Sampled Data:")


print(sampled_data)
import pandas as pd
# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)

# Example 1: Sample 2 random rows


sampled_rows = df.sample(n=2)
print("Sampled 2 random rows:")
print(sampled_rows)

# Example 2: Sample 30% of the rows


sampled_frac = df.sample(frac=0.3)
print("\nSampled 30% of the rows:")
print(sampled_frac)
# Example 3: Sample with replacement
sampled_with_replace = df.sample(n=5, replace=True)

print("\nSampled with replacement:")


print(sampled_with_replace)
Exercise
Suppose you have a dataset containing information about customers'
purchases at a store. The dataset (customer_data.csv) includes
columns: 'Customer_ID', 'Age', 'Gender', 'Purchase_Amount'. Your task
is to perform random sampling to select a subset of 20 customers from
this dataset for a survey.
Benefits of data transformation
• Data transformation promotes interoperability between several applications. The
main reason for creating a similar format and structure in the dataset is that it
becomes compatible with other systems.
• Comprehensibility for both humans and computers is improved when using
better-organized data compared to messier data.
• Data transformation ensures a higher degree of data quality and protects
applications from several computational challenges such as null values,
unexpected duplicates, and incorrect indexing's, as well as incompatible structures
or formats.
• Data transformation ensures higher performance and scalability for
modernanalytical databases and dataframes.
Reshaping and pivoting
•Often need to rearrange data in a dataframe in some
consistent manner. This can be done with hierarchical
indexing using two actions:

• Stacking: Stack rotates from any particular column in the data to


the rows.
• Unstacking: Unstack rotates from the rows into the column.
• Let's create a dataframe that records the rainfall, humidity, and wind
conditions of five different counties in Norway:
Using the stack() method on the preceding dframe1, we can pivot the
columns into rows to produce a series:

stacked = dframe1.stack()
stacked
•The preceding series stored unstacked in the variable can be
rearranged into a dataframe using the unstack() method:
stacked.unstack()

•This should revert the series into the original dataframe

•Note that there is a chance that unstacking will create


missing data if all the values are not present in each of the
sub-groups.
let's unstack the concatenated frame

Since in series1, there are no fours, fives, and sixes, their values are stored as NaN during the unstacking
process. Similarly, there are no ones, twos, and zeros in series2, so the corresponding values are stored as NaN
References
• https://www.w3resource.com/python-exercises/pandas/joining-and-
merging/index.php
• https://www.geeksforgeeks.org/python-pandas-merging-joining-and-
concatenating/
• https://www.statology.org/pandas-find-duplicates/
• https://www.geeksforgeeks.org/working-with-missing-data-in-panda
s/
• https://medium.com/nerd-for-tech/dealing-with-missing-data-using-
python-3fd785b77a05

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy