Exercise 3
Exercise 3
Introduction to Pandas:
Import Pandas
Once Pandas is installed, import it in your applications by adding
the import keyword: import pandas
Now Pandas is imported and ready to use.
Example
import pandas
mydataset = {'cars': ["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2]}
myvar = pandas.DataFrame(mydataset)
print(myvar)
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or
Example
Create a simple Pandas DataFrame:
a table with rows and columns.
import pandas as pd data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Result
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with
Example
Return row 0:
rows and columns. Pandas use the loc attribute to return one or more
specified row(s)
#refer to the row index:
print(df.loc[0])
Result
calories
420
duration 50
Name: 0, dtype: int64
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
Example
Load the CSV into a DataFrame:
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas. In our examples we will be using a CSV file called 'data.csv'.
https://www.w3schools.com/python/pandas/data.csv
import pandas as pd
df
=pd.read
_csv('da
ta.csv')
print(df.
to_strin
g())
Tip: use to_string() to print the entire DataFrame.
If you have a large DataFrame with many rows, Pandas will only return the first 5
Example
Print the DataFrame without the to_string() method:
rows, and the last 5 rows:
import pandas as pd
df =
pd.read_
csv('data
.csv')
print(df)
max_rows
The number of rows returned is defined in Pandas option settings.
Example
Check the number of maximum returned rows:
You can check your system's maximum rows with the pd.options.display.max_rows
statement.
import pandas as
pd
print(pd.options.
display.max_row
s)
In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and the
first and last 5 rows.
Example
Increase the maximum number of rows to display the entire DataFrame:
You can change the maximum rows number with the same statement.
import pandas as
pd
pd.options.display
.max_rows = 9999
df =
pd.read_csv('data.
csv')
print(df)
Example
Get a quick overview by printing the first 10 rows of the DataFrame:
The head() method returns the headers and a specified number of rows, starting from the
top.
import pandas as pd
df =
pd.read_
csv('data
.csv')
print(df.
head(10)
)
In our examples we will be using a CSV file called 'data.csv'.
https://www.w3schools.com/python/pandas/data.csv.
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df =
pd.read_
csv('data
.csv')
print(df.
head())
There is also a tail() method for viewing the last rows of the DataFrame.
Example
Print the last 5 rows of the DataFrame:
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
print(df.tail())
print(df.info())
Result
<class
'pandas.core.frame.
DataFrame'>
RangeIndex: 169
entries, 0 to 168
Data columns (total
4 columns):
# Column Non-Null Count Dtype
Result Explained
The result tells us there are 169 rows and 4 columns:
RangeIndex: 169
entries, 0 to 168
Data columns
(total 4
columns):
Null Values
The info() method also tells us how many Non-Null values there are present in each
column, and in our data set it seems like there are 164 of 169 Non-Null values in the
"Calories" column.
Which means that there are 5 rows with no value at all, in the "Calories" column, for
whatever reason. Empty values, or Null values, can be bad when analyzing data, and
you should consider removing rows with empty values. This is a step towards what
is called cleaning data.
Pandas - Cleaning Data Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
● Empty cells
● Data in wrong format
● Wrong data
● Duplicates
In this tutorial you will learn how to deal with all of them.
Our Data Set
In the next chapters we will
use this data set: Duration
409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
The data set contains some empty cells ("Date" in row 22, and "Calories" in
row 18 and 28). The data set contains wrong format ("Date" in row 26).
The data set contains wrong data
("Duration" in row 7). The data set
contains duplicates (row 11 and 12).
Pandas -
Cleaning
Empty Cells
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will
not have a big impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df =
pd.read_
csv('data
.csv')
new_df
=
df.dropn
a()
print(ne
w_df.to
_string()
)
Note: By default, the dropna() method returns a new DataFrame, and will not
change the original. If you want to change the original DataFrame, use the
Example
inplace = True argument:
Remove all rows with NULL values:
import pandas as pd
df =
pd.read_
csv('data
.csv')
df.dropn
a(inplac
e=
True)
print(df.
to_strin
g())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows
containing NULL values from the original DataFrame.
Example
Replace NULL values in the "Calories" columns with the number 130:
To only replace empty values for one column, specify the column name for the
DataFrame:
import pandas as pd
df =
pd.read_csv('data.csv')
df["Calories"].fillna(1
30, inplace = True)
Replace Using Mean,
Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of
the column.
Pandas uses the mean() median() and mode() methods to calculate the respective values
Example
Calculate the MEAN, and replace any empty values with it:
for a specified column:
import pandas as pd
df =
pd.read_
csv('data
.csv') x
=
df["Calo
ries"].m
ean()
df["Calories"].fillna(x, inplace = True)
Mean = the average value (the sum of all values divided by number of values).
Example
Calculate the MEDIAN, and replace any empty values with it:
import pandas as pd
df =
pd.read_
csv('data
.csv') x =
df["Calo
ries"].me
dian()
df["Calories"].fillna(x, inplace = True)
Median = the value in the middle, after you have sorted all values ascending.
Example
Calculate the MODE, and replace any empty values with it:
import pandas as pd
df =
pd.read_c
sv('data.cs
v') x =
df["Calori
es"].mode
()[0]
df["Calories"].fillna(x,
inplace = True) Pandas -
Cleaning Data of Wrong
Format Data of Wrong
Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into
the same format.
Result:
As you can see from the result, the date in row 26 was fixed, but the empty date in row
22 got a NaT (Not a Time) value, in other words an empty value. One way to deal with
empty values is simply removing the entire row.
Removing Rows
The result from the converting in the example above gave us a NaT value, which can
be handled as a NULL value, and we can remove the row by using the dropna()
method.
Example
Remove rows with a NULL value in the "Date" column:
df.dropna(subset=['Dat
e'], inplace = True)
Pandas - Fixing Wrong
Data
Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong,
like if someone registered "199" instead of "1.99".
Sometimes you can spot wrong data by looking at the data set, because you have an
expectation of what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for
all the other rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work
out in 450 minutes.
Duration Date Pulse
Maxpulse Calories 0 60
'2020/12/01' 110 130
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
How can we fix wrong values, like the one for "Duration" in row 7?
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and
Example
Set "Duration" = 45 in row 7:
we could just insert "45" in row 7:
df.loc[7, 'Duration'] = 45
For small data sets you might be able to replace the wrong data one by one, but not for
big data sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some
boundaries for legal values, and replace any values that are outside of the boundaries.
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
Removing Rows
Another way of handling wrong data is to remove the rows that contain wrong data.
This way you do not have to find out what to replace them with, and there is a good
chance you do not need them to do your analyses.
Example
Delete rows where "Duration" is higher than 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x,
inplace =
True) Pandas
- Removing
Duplicates
Discovering
Duplicates
Duplicate rows are rows that have been registered more than one time.
Example
Returns True for every row that is a duplicate, othwerwise False:
The duplicated() method returns a Boolean values for each row:
print(df.duplicated())
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example
Remove all duplicates:
df.drop_duplicat
es(inplace =
True) Pandas -
Data
Correlations
Finding Relationships
A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each
column in your data set. The examples in this page uses a CSV
file called: 'data.csv'.
Download data.csv - https://www.w3schools.com/python/pandas/data.csv
Example
Show the relationship between the columns:
df.corr()
Result
Duration
Pulse Maxpulse Calories
Duration 1.000000 -0.155408 0.009403 0.922721
Pulse -0.155408 1.000000 0.786535 0.025120
Maxpulse 0.009403 0.786535 1.000000 0.203814
Calories 0.922721 0.025120 0.203814 1.000000
Note: The corr() method ignores "not numeric" columns.
Result Explained
The Result of the corr() method is a table with a lot of numbers that represents how
well the relationship is between two columns.
The number varies from -1 to 1.
1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set,
each time a value went up in the first column, the other one went up as well.
0.9 is also a good relationship, and if you increase one value, the other will probably
increase as well.
-0.9 would be just as good relationship as 0.9, but if you increase one value, the
other will probably go down.
0.2 means NOT a good relationship, meaning that if one value goes up does not mean that
What is a good correlation? It depends on the use, but I think it is safe to say you have to have at
least 0.6 (or -0.6) to call it a good correlation.
the other will.
Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which makes sense,
each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more calories you
burn, and the other way around: if you burned a lot of calories, you probably had a
long work out.
Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation,
meaning that we can not predict the max pulse by just looking at the duration of the
work out, and vice versa.
Pandas - Plotting
Plotting
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the
screen.
Example
Import pyplot from Matplotlib and visualize our DataFrame:
import pandas as pd
import
matplotlib.p
yplot as plt
df =
pd.read_csv
('data.csv')
df.plot()
plt.show()
The examples in this page uses a CSV
file called: 'data.csv'.
https://www.w3schools.com/python/pan
das/data.csv Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis.
In the example below we will use "Duration" for the x-axis and
"Calories" for the y-axis. Include the x and y arguments like this:
x = 'Duration', y = 'Calories'
Example
import pandas as pd
import
matplotlib.p
yplot as plt
df =
pd.read_csv
('data.csv')
df.plot(kind = 'scatter', x =
'Duration', y = 'Calories')
plt.show()
Result
Remember: In the previous example, we learned that the correlation between "Duration" and "Calories"
was 0.922721, and we conluded with the fact that higher duration means more calories burned.
By looking at the scatterplot, I will agree.
Let's create another scatterplot, where there is a bad relationship between the
columns, like "Duration" and "Maxpulse", with the correlation 0.009403:
Example
A scatterplot where there are no relationship between the columns:
import pandas as pd
import
matplotlib.p
yplot as plt
df =
pd.read_csv
('data.csv')
df.plot(kind = 'scatter', x =
'Duration', y = 'Maxpulse')
plt.show()
Result
Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
A histogram needs only one column.
A histogram shows us the frequency of each interval, e.g. how many workouts lasted
between 50 and 60 minutes?
In the example below we will use the "Duration" column to create the histogram:
Example
df["Duration"].plot(kind = 'hist')
Write a Pandas program to calculate the sum of the examination attempts by the students.
Sample DataFrame:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Python Code :
import pandas as pd import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels) print("\nSum of the examination attempts by
the students:") print(df['attempts'].sum())
Write a Pandas program to select a specific row of given series/dataframe by integer index.
Test Data:
0 s001 V Alberto 15/05/2002 35 street1 t1
Franco
1 s002 V Gino Mcneill 17/05/2002 32 street2 t2
2 s003 V Ryan Parkes 16/02/1999 33 street3 t3
I
3 s001 V Eesha Hinton 25/09/1998 30 street1 t4
I
4 s002 V Gino Mcneill 11/05/2002 31 street2 t5
5 s004 V David Parkes 15/09/1997 32 street4 t6
I
Python Code :
import pandas as pd
ds = pd.Series([1,3,5,7,9,11,13,15], index=[0,1,2,3,4,5,7,8])
print("Original Series:") print(ds)
print("\nPrint specified row from the said series using location based indexing:") print("\
nThird row:")
print(ds.iloc[[2]]) print("\nFifth row:") print(ds.iloc[[4]])
df = pd.DataFrame({
'school_code': ['s001','s002','s003','s001','s002','s004'],
'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],
'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill', 'David
Parkes'], 'date_of_birth':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'], 'weight': [35,
32, 33, 30, 31, 32]})
print("Original DataFrame with single index:") print(df)
print("\nPrint specified row from the said DataFrame using location based indexing:") print("\
nThird row:")
print(df.iloc[[2]]) print("\nFifth row:") print(df.iloc[[4]])
Write a Pandas program to add, subtract, multiple and divide two Pandas Series.
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
Python Code :
import pandas as pd
ds1 = pd.Series([2, 4, 6, 8, 10])
ds2 = pd.Series([1, 3, 5, 7, 9]) ds = ds1 + ds2
print("Add two Series:") print(ds)
print("Subtract two Series:") ds = ds1 - ds2
print(ds)
print("Multiply two Series:") ds = ds1 * ds2
print(ds)
print("Divide Series1 by Series2:") ds = ds1 / ds2
print(ds)
Write a Pandas program to compare the elements of the two Pandas Series.
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 10]
Python Code :
import pandas as pd
ds1 = pd.Series([2, 4, 6, 8, 10])
ds2 = pd.Series([1, 3, 5, 7, 10]) print("Series1:")
print(ds1) print("Series2:") print(ds2)
print("Compare the elements of the said Series:") print("Equals:")
print(ds1 == ds2) print("Greater than:") print(ds1 > ds2) print("Less than:") print(ds1 < ds2)
Create the Excel file with name coalpublic2013.xlsx' and write the program for the following
Ye MSHA Mine_Name Productio Labor_Hou
ar ID n rs
201 103381 Tacoa Highwall Miner 56,004 22,392
3
201 103404 Reid School Mine 28,807 8,447
3
201 100759 North River #1 Underground 14,40,115 4,74,784
3 Min
201 103246 Bear Creek 87,587 29,193
3
201 103451 Knight Mine 1,47,499 46,393
3
201 103433 Crane Central Mine 69,339 47,195
3
201 100329 Concord Mine 0 1,44,002
3
201 100851 Oak Grove Mine 22,69,014 10,01,809
3
201 102901 Shoal Creek Mine 0 12,396
3
201 102901 Shoal Creek Mine 14,53,024 12,37,415
3
201 103180 Sloan Mountain Mine 3,27,780 1,96,963
3
201 103182 Fishtrap 1,75,058 87,314
3
201 103285 Narley Mine 1,54,861 90,584
3
201 103332 Powhatan Mine 1,40,521 61,394
3
201 103375 Johnson Mine 580 1,900
3
201 103419 Maxine-Pratt Mine 1,25,824 1,07,469
3
201 103432 Skelton Creek 8,252 220
3
201 103437 Black Warrior Mine No 11,45,924 70,926
3
201 102976 Piney Woods Preparation Plant 0 14,828
3
201 102976 Piney Woods Preparation Plant 0 23,193
3
201 103380 Calera 0 12,621
3
201 103380 Calera 0 1,402
3
201 103422 Clark No 1 Mine 1,22,727 1,40,250
3
201 103323 Deerlick Mine 1,33,452 46,381
3
201 103364 Brc Alabama No. 7 Llc 0 14,324
3
201 103436 Swann's Crossing 1,37,511 77,190
3
201 100347 Choctaw Mine 5,37,429 2,15,295
3
201 101362 Manchester Mine 2,19,457 1,16,914
3
201 102996 Jap Creek Mine 3,75,715 1,64,093
3
201 103370 Cresent Valley Mine 2,860 621
3
Write a Pandas program to import given excel data into a Pandas dataframe. Python
Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') print(df.head)
Write a Pandas program to import some excel data skipping first twenty rows into a Pandas
dataframe.
Python Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx', skiprows = 20) df
Write a Pandas program to find the sum, mean, max, min value of 'Production
(short tons)' column of excel file.
Python Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx')print("Sum: ",df["Production"].sum())
print("Mean: ",df["Production"].mean())
print("Maximum: ",df["Production"].max())
print("Minimum: ",df["Production"].min())
Write a Pandas program to insert a column in the sixth position of the said
excel sheet and fill it with NaN values.
Python Code :
import pandas as pd
import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') df.insert(3, "column1", np.nan) print(df.head)
Write a Pandas program to import given excel data into a dataframe and
find all records that include two specific MSHA ID.
Python Code :
import pandas as pd
import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') df[df["MSHA
ID"].isin([102976,103380])].head()
Result:
Thus study and work with Pandas data frames are discussed and completed
successfully.