OOM Unit 2
OOM Unit 2
OOM Unit 2
Data Transformation
Data Transformation
• Data transformation is a set of techniques used to convert data from one
format or structure to another format or structure.
• Data deduplication
• Data cleansing
• Data validation
• Format revisioning
• Data derivation
• Data aggregation
• Data integration
• Data filtering
• Data joining
Python libraries
• We will be using the following
• Pandas
• Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data.
• NumPy
• NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, fourier transform, and matrices
• Seaborn
• Python Seaborn library is a widely popular data visualization library that is commonly used
for data science and machine learning tasks
• Matplotlib
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python
Pandas DataFrames
• A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.
• Create a simple Pandas DataFrame:
Named Indexes
• With the index argument, you can name your own indexes.
• Add a list of names to give each row a name:
Load Files Into a DataFrame
• If your data sets are stored in a file, Pandas can load them into a
DataFrame.
• Load a comma separated file (CSV file) into a DataFrame:
append, concat, merge, or join.
Many beginner developers get confused when working with
pandas dataframes, especially regarding when to use append,
concat, merge, or join.
• axis represents the axis that you’ll concatenate along. The default value
is 0, which concatenates along the index, or row axis. Alternatively, a
value of 1 will concatenate vertically, along columns. You can also use the
string values "index" or "columns"
print("Original DataFrames:")
print(data1)
print("--------------------")
print(data2)
print("\nMerged Data (Joining on index):")
result = data1.join(data2)
print(result)
append
• Pandas dataframe.append() function is used to append rows of other
data frames to the end of the given data frame, returning a new data
frame object.
• Columns not in the original data frames are added as new columns and
the new cells are populated with NaN value.
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})
student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha Preston'],
'marks': [201, 200, 198, 219, 201]})
print("Original DataFrames:")
print(student_data1)
print("-------------------------------------")
print(student_data2)
print("\nJoin the said two dataframes along rows:")
result_data = pd.concat([student_data1, student_data2])
print(result_data)
Write a Pandas program to join the two given dataframes along columns
and assign all data.
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})
student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha Preston'],
'marks': [201, 200, 198, 219, 201]})
print("Original DataFrames:")
print(student_data1)
print("-------------------------------------")
print(student_data2)
print("\nJoin the said two dataframes along columns:")
result_data = pd.concat([student_data1, student_data2], axis = 1)
print(result_data)
Write a Pandas program to append rows to an existing DataFrame and
display the combined data.
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})
print("\nCombined Data:")
print(combined_data)
Write a Pandas program to join the two dataframes using the common
column of both dataframes.
import pandas as pd
student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
'marks': [200, 210, 190, 222, 199]})
student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha Preston'],
'marks': [201, 200, 198, 219, 201]})
student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha
Preston'],
'marks': [201, 200, 198, 219, 201]})
print("Original DataFrames:")
print(student_data1)
print(student_data2)
merged_data = pd.merge(student_data1, student_data2, on='student_id', how='outer')
print("Merged data (outer join):")
print(merged_data)
Write a Pandas program to create a new DataFrame based on existing
series, using specified argument and override the existing columns
names.
import pandas as pd
s1 = pd.Series([0, 1, 2, 3], name='col1')
s2 = pd.Series([0, 1, 2, 3])
s3 = pd.Series([0, 1, 4, 5], name='col3')
df = pd.concat([s1, s2, s3], axis=1,
keys=['column1', 'column2', 'column3'])
print(df)
Write a Pandas program to merge two given dataframes with different
columns.
import pandas as pd
data1 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'P': ['P0', 'P1', 'P2', 'P3'],
'Q': ['Q0', 'Q1', 'Q2', 'Q3']})
data2 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'R': ['R0', 'R1', 'R2', 'R3'],
'S': ['S0', 'S1', 'S2', 'S3']})
print("Original DataFrames:")
print(data1)
print("--------------------")
print(data2)
print("\nMerge two dataframes with different columns:")
result = pd.concat([data1,data2], axis=0, ignore_index=True)
print(result)
Data deduplication
• The dataframe may contains duplicate rows
• Removing them is essential to enhance the quality of the dataset.
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4,
'column 2': [10, 10, 22, 23, 23, 24, 24]})
frame3
Data deduplication
• The pandas dataframe comes with a duplicated() method that returns
a Boolean series stating which of the rows are duplicates:
frame3.duplicated
()
The rows that say True are the ones that contain duplicated data.
Drop duplication
• We can drop these duplicates using the drop_duplicates() method:
frame4 = frame3.drop_duplicates()
frame4
both the duplicated and drop_duplicates methods keep the first observed value during the
duplication removal process. If we pass the take_last=True argument, the methods return the
last one.
You can use the duplicated() function to find duplicate values in a
pandas DataFrame.
data = {
'Product_ID': [1, 2, 3, 4, 5],
'Product_Name': ['Apple', 'Banana', 'Orange', 'Pear', 'Grapes'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Veggie', 'Fruit']
}
Handling Missing Data
Whenever there are missing values, a NaN value is used, which indicates
that there is no value specified for that particular index. There could be
several reasons why a value could be NaN:
• It can happen when data is retrieved from an external source and there are
some incomplete values in the dataset.
• It can also happen when we join two different datasets and some values are
not matched.
• Missing values due to data collection errors.
• When the shape of data changes, there are new additional rows or columns
that are not determined.
• Reindexing of data can result in incomplete data.
data = np.arange(15, 30).reshape(5, 3)
dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi',
'grapes', 'mango'], columns=['store1', 'store2', 'store3'])
dfx
dfx.isnull()
True values indicate the values that are NaN
NaN Values in Pandas Objects
notnull() function from the pandas library to identify not NaN values:
dfx.notnull()
True values indicate the values that are not NaN
NaN Values in Pandas Objects
sum() method to count the number of NaN values
dfx.isnull().sum()
True is 1 and False is 0 is the main logic for summing
NaN Values in Pandas Objects
To find the total number of missing values
dfx.isnull().sum().sum
()
15
NaN Values in Pandas Objects
To count the number of reported values
dfx.count()
NaN Values in Pandas Objects
To count the number of reported values
dfx.count()
NaN Values in Pandas Objects
Check for missing values in a specific column
df['Column_Name'].isnull().sum()
df[‘store4'].isnull().sum()
5
Exercise -1
• Scenario: You've received a dataset containing information about students' performance,
but the dataset has missing values across multiple columns due to various reasons. Your
task is to analyze and handle these missing values using Pandas' methods.
• Questions:
1. How many missing values are there in the 'Grade' column of the dataset?
2. What is the total count of missing values in the entire dataset?
3. Can you identify and display rows where 'Absenteeism' values are missing?
4. How many rows have complete data (non-missing) in the 'Study Hours' column?
5. What is the sum of missing values across each column in the dataset?
6. How many missing values are there in each column of the dataset?
Dropping Missing Values
One of the ways to handle missing values is to simply remove them from
dataset.
dfx.dropna(how='all')
Dropping by columns
• Pass axis=1 to indicate a check for NaN by columns.
dfx.dropna(how='all',
axis=1)
Drop rows where specific columns have missing values
df.dropna(subset=['Column_Name'],inplace=True)
import pandas as pd
data = {
'A': [1, 2, None, 4, 5],
'B': ['a', 'b', 'c', None, 'e'],
'C': [10.5, None, 30.2, 40.1, None]
}
df=pd.DataFrame(data)
print(df)
# Dropping rows where 'B' column has missing values
df_dropped_b = df.dropna(subset=['B'])
#DataFrame after dropping rows with missing values in 'B'
column:")
print(df_dropped_b)
Thresh specify number of NoT NULL values to keep
the row
df.dropna(thresh=4)
Thresh :Drop column with minimum number of NaN
exist
df.dropna(thresh=4, axis=1)
Drop rows with any missing values
# Use inplace=True to modify the original DataFrame
df.dropna(inplace=True)
You are analyzing a dataset containing information about employees in
a company. However, the dataset is not entirely clean and has missing
values in various columns due to data entry errors and incomplete
records. Your task is to preprocess the data by handling missing values
using Pandas' dropna() method.
Questions:
1. How would you drop all rows with any missing values from the entire dataset using
Pandas?
2. If the 'Salary' column is critical for analysis, how can you drop only the rows where
'Salary' values are missing while preserving the rest of the dataset?
3. In some columns, missing values are acceptable up to a certain limit. How would
you drop rows with more than two missing values across any column?
4. For the 'Department' column, there are missing values that can't be imputed. How
can you drop rows where 'Department' is missing while keeping rows with other
missing values intact?
5. Considering the 'Address' and 'Phone' columns have a significant number of
missing values, how can you drop these columns entirely from the dataset?
6. If you want to drop rows where 'Years of Experience' are missing and 'Salary' is
also missing or zero, how would you achieve this?
Mathematical operations with NaN
The pandas and numpy libraries handle NaN values differently for
mathematical operations.
• Pandas, on the other hand, ignores the NaN values and moves
ahead with processing. When performing the sum operation, NaN
is treated as 0. If all the values are NaN, the result is also NaN.
Example
import numpy as np
Output
import pandas as pd
ar1 = np.array([100, 200, np.nan, 300])
(nan, 200.0)
ser1 = pd.Series(ar1)
ar1.mean(), ser1.mean()
The total quantity of fruits sold by store4
and its mean value
ser2 = dfx.store4
ser2.sum()
ser2.mean()
Output of the preceding code
38.0
19.0
store4 has five NaN values. However, during the summing process, these values are treated
as 0 and the result is 38.0
Cumulative summing
ser2 = dfx.store4
ser2.cumsum()
Output of the preceding code
apple 20.0
banana NaN
kiwi NaN
grapes NaN
mango NaN
watermelon 38.0
oranges NaN
Name: store4, dtype: float64
Filling missing values
The fillna() method to replace NaN values with any particular values.
filledDf = dfx.fillna(0)
filledDf
dfx.mean() filledDf.mean()
There are slightly different values. Hence, filling with 0 might not be the optimal solution.
Backward and forward filling
• NaN values can be filled based on the last known values.
Forward-filling technique
‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.
dfx.store4.fillna(method='ffill')
Forward-filling technique – Row Axis
• ffill() function to fill the missing values along the index axis
df.ffill(axis = 0)
Values in the first row is still NaN value because there is no row above it from which non-NA
value could be propagated
Forward-filling technique – Column axis
• When ffill is applied across the column axis, then missing values are
filled by the value in previous column in the same row.
df.ffill(axis = 1)
Backward filling
bfill() is used to backward fill the missing values in the dataset.
dfx.store4.fillna(method='bfill')
Backward filling
bfill() function to populate missing values NaN values in the dataframe across
rows.
When axis='rows', then value in current NaN cells are filled from the corresponding value in the next row. If the
next row is also NaN value then it won’t be populated.
df.bfill(axis ='rows')
Backward filling
bfill() function to populate missing values NaN values in the dataframe across
the columns.
when axis='columns', then the current NaN cells will be filled from the value present in the next column in the
same row. If the next column is also NaN cell then it won’t be filled.
df.bfill(axis =columns')
Interpolating missing values
• Linear interpolation is a method for calculating intermediate data
between known values
• interpolate() function both for the series and the dataframe. By default, it
performs a linear interpolation of our missing values
Example
• In the preceding series, ser3, the first and the last values are
100 and 292 respectively. Hence, it calculates the next value
as (292-100)/(5-1) = 48.
1. Create dataframe
2.Print the original DataFrame.
3.Explore the DataFrame to identify the missing
values.
4.Decide on an appropriate strategy for handling
missing values in each column (e.g., mean
imputation, forward fill, etc.).
5.Implement your chosen strategy to fill missing
values in the DataFrame.
6.Print the DataFrame after filling missing
values.
Replacing Missing Values
Replace missing values in the quantity column with mean, price column with a median, Bought column
with standard deviation. Forenoon column with the minimum value in that column. Afternoon column
with maximum value in that column.
Replacing Missing Values
Mean: data=data.fillna(data.mean())
Median: data=data.fillna(data.median())
Min: data=data.fillna(data.min())
Max: data=data.fillna(data.max())
Exercise
Consider a DataFrame representing sales data for a company.
The DataFrame contains columns for "Product", "Quantity Sold",
"Price", and "Date". However, due to some issues, there are
missing values in the "Quantity Sold" and "Price" columns.
1. Create the dataframe as given
2.Print the original DataFrame.
3.Fill the missing values in the "Quantity
Sold" column with the mean value of that
column.
4.Fill the missing values in the "Price"
column with the median value of that
column.
5.Print the DataFrame after filling the
missing values.
Renaming axis indexes
Consider the example from the Reshaping and pivoting section. Say
you want to transform the index terms to capital letters:
dframe1.index = dframe1.index.map(str.upper)
dframe1
Discretization and binning
• Continuous datasets need to be converted into discrete or interval
forms.
• Each interval is referred to as a bin
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]
dataset into intervals of 118 to 125, 126 to 135, 136 to 160, and finally 160 and higher.
cut() method is used to convert the dataset into intervals
import pandas as pd
bins = [118, 125, 135, 160, 200]
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141,132]
category = pd.cut(height, bins)
category
import pandas as pd
bins = [118, 126, 136, 161, 200]
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141,132]
category2 = pd.cut(height, bins, right=False)
category2
Output form of closeness has been changed. Now, the results are in the form of
left-closed, Right-open.
. We can also indicate the bin names by passing a list of labels:
5
[
•Labels to bins and count no of values in each bin
import pandas as pd
bins = [118, 125, 135, 160, 200]
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141,132]
bin_names = ['Short Height', 'Average height', 'Good Height','Taller’]
category = pd.cut(height, bins, labels=bin_names)
Category
Output is
[Short Height, Short Height, Short Height, Average height, Short
Height, ..., Average height, Taller, Good Height, Good Height, Average height]
Length: 12
Categories (4, object): [Short Height < Average height < Good
Height < Taller]
pd.value_counts(category)
Exercise
• Suppose you have a DataFrame with a column of ages and you want
to discretize these ages into different age groups. Assign suitable
labels for each bin
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 42, 31, 19, 50]}
df = pd.DataFrame(data)
print(df)
Discretization and Binning
• Equal Width Discretization:
Divides the range of values into equally spaced intervals.
Example:
import pandas as pd
data = pd.DataFrame({'A': [10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54]})
bins = pd.cut(data['A'], bins=3) # Divides data into 3 equal-width bins
• Step 1: Sort the dataset in ascending order: 10, 15, 18, 20, 31, 34, 41, 46, 51, 53, 54.
• Step 2: Determine the width of each bin using the formula: w = (max-min) / N ( N is number of bin)
• BIN 1 : [lower bound , upper bound] = [(min) , (min + w -1)] = [10, 20]
• BIN 2 : [lower bound , upper bound] = [(min + w) , (min + 2w -1)] = [21, 31]
• BIN 3 : [lower bound , upper bound] = [(min + 2w) , (min + 3w -1)] = [32, 42]
• BIN 4 : [lower bound , upper bound] = [(min + 3w) , (max)] = [43, 54]
• The total number of data points is 12, and the number of bins required is 3.
• Therefore, the frequency comes out to be 4
import numpy as np
pd.cut(np.random.rand(40), 5, precision=2)
[(0.81, 0.99], (0.094, 0.27], (0.81, 0.99], (0.45, 0.63], (0.63,0.81], ..., (0.81, 0.99], (0.45, 0.63], (0.45, 0.63],
(0.81, 0.99],
(0.81, 0.99]] Length: 40
Categories (5, interval[float64]): [(0.094, 0.27] < (0.27, 0.45] <(0.45, 0.63] < (0.63, 0.81] < (0.81, 0.99]]
Quantiles are points taken at regular intervals from the cumulative
distribution function of a random variable. They are used to partition
a dataset into intervals with an equal number of data points or to
represent divisions of the probability distribution of a dataset.
• Median (2-Quantile): Divides the data into two equal halves. It's
the value below which 50% of the data falls.
• .
• Several methods can be employed to detect outliers.
• Filtering can involve handling them by removing, capping,
or transforming them.
Outlier Detection Methods
• Statistical Methods:
• Z-Score: Identifying values far from the mean in terms of standard deviations.
• IQR (Interquartile Range): Using quartiles to identify values outside a specific range.
• Visualization Techniques:
• Clustering Algorithms: Detecting data points that don’t cluster well with others.
• Isolation Forest, Local Outlier Factor (LOF): Algorithms specifically designed for outlier detection.
Filtering Outliers
• Removal:
• Capping or Flooring:
• Transformations:
import numpy as np
import pandas as pd
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)
df
numpy.random.permutation() function, we can randomly
select or permute a series of rows in a dataframe
import numpy as np
import pandas as pd
dat = np.arange(80).reshape(10,8) df.take(sampler)
df = pd.DataFrame(dat)
Df
sampler = np.random.permutation(10)
sampler
Output
array([1, 5, 3, 6, 2, 4, 9, 0, 7, 8])
take() function is used to take the orginal data from
dataframe using index value of the array generated by
permutation function
Random sampling without replacement
To compute random sampling without replacement, follow these steps:
1. To perform random sampling without replacement, we first create a
permutation array.
2. Next, we slice off the first n elements of the array where n is the desired size
of the subset you want to sample.
3. Then we use the df.take() method to obtain actual samples:
df.take(np.random.permutation(len(df))[:3])
Random sampling :Sample method
In pandas, the DataFrame.sample method is used to randomly sample
a specified number of rows or a fraction of rows from a DataFrame.
draw = sack.take(sampler)
draw
Output
array([ 7, 7, 4, 5, 4, 4, 8, -2, 8,
5])
Random Sampling using sample() method
import pandas as pd
stacked = dframe1.stack()
stacked
•The preceding series stored unstacked in the variable can be
rearranged into a dataframe using the unstack() method:
stacked.unstack()
Since in series1, there are no fours, fives, and sixes, their values are stored as NaN during the unstacking
process. Similarly, there are no ones, twos, and zeros in series2, so the corresponding values are stored as NaN
References
• https://www.w3resource.com/python-exercises/pandas/joining-and-
merging/index.php
• https://www.geeksforgeeks.org/python-pandas-merging-joining-and-
concatenating/
• https://www.statology.org/pandas-find-duplicates/
• https://www.geeksforgeeks.org/working-with-missing-data-in-panda
s/
• https://medium.com/nerd-for-tech/dealing-with-missing-data-using-
python-3fd785b77a05