0% found this document useful (0 votes)
13 views

Exercise 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Exercise 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Ex. No. : 3.

Working with Pandas data frames


Date :
Aim
To study and work with pandas data frames

Introduction to Pandas:
Import Pandas
Once Pandas is installed, import it in your applications by adding
the import keyword: import pandas
Now Pandas is imported and ready to use.
Example

import pandas
mydataset = {'cars': ["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2]}
myvar = pandas.DataFrame(mydataset)
print(myvar)

What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or

Example
Create a simple Pandas DataFrame:
a table with rows and columns.
import pandas as pd data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Result

calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with

Example
Return row 0:
rows and columns. Pandas use the loc attribute to return one or more
specified row(s)
#refer to the row index:
print(df.loc[0])
Result
calories
420
duration 50
Name: 0, dtype: int64
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
Example
Load the CSV into a DataFrame:
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas. In our examples we will be using a CSV file called 'data.csv'.
https://www.w3schools.com/python/pandas/data.csv
import pandas as pd
df
=pd.read
_csv('da
ta.csv')
print(df.
to_strin
g())
Tip: use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the first 5
Example
Print the DataFrame without the to_string() method:
rows, and the last 5 rows:
import pandas as pd
df =
pd.read_
csv('data
.csv')
print(df)

max_rows
The number of rows returned is defined in Pandas option settings.

Example
Check the number of maximum returned rows:
You can check your system's maximum rows with the pd.options.display.max_rows
statement.
import pandas as
pd
print(pd.options.
display.max_row
s)

In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and the
first and last 5 rows.

Example
Increase the maximum number of rows to display the entire DataFrame:
You can change the maximum rows number with the same statement.
import pandas as
pd
pd.options.display
.max_rows = 9999
df =
pd.read_csv('data.
csv')
print(df)

Pandas - Analyzing DataFrames Viewing the Data


One of the most used method for getting a quick overview of the DataFrame, is the head()
method.

Example
Get a quick overview by printing the first 10 rows of the DataFrame:
The head() method returns the headers and a specified number of rows, starting from the
top.
import pandas as pd
df =
pd.read_
csv('data
.csv')
print(df.
head(10)
)
In our examples we will be using a CSV file called 'data.csv'.
https://www.w3schools.com/python/pandas/data.csv.
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
Print the first 5 rows of the DataFrame:

import pandas as pd
df =
pd.read_
csv('data
.csv')
print(df.
head())
There is also a tail() method for viewing the last rows of the DataFrame.

Example
Print the last 5 rows of the DataFrame:
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
print(df.tail())

Info About the Data


The DataFrames object has a method called info(), that gives you more information about
the data set.
Example
Print information about the data:

print(df.info())
Result

<class
'pandas.core.frame.
DataFrame'>
RangeIndex: 169
entries, 0 to 168
Data columns (total
4 columns):
# Column Non-Null Count Dtype

0 Duration 169 non-null int64


1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164
non-null
None

Result Explained
The result tells us there are 169 rows and 4 columns:
RangeIndex: 169
entries, 0 to 168
Data columns
(total 4
columns):

And the name of each column, with the data type:


# Column Non-Null Count Dtype

0 Duration 169 non-null int64


1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64

Null Values
The info() method also tells us how many Non-Null values there are present in each
column, and in our data set it seems like there are 164 of 169 Non-Null values in the
"Calories" column.
Which means that there are 5 rows with no value at all, in the "Calories" column, for
whatever reason. Empty values, or Null values, can be bad when analyzing data, and
you should consider removing rows with empty values. This is a step towards what
is called cleaning data.
Pandas - Cleaning Data Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
● Empty cells
● Data in wrong format
● Wrong data
● Duplicates
In this tutorial you will learn how to deal with all of them.
Our Data Set
In the next chapters we will
use this data set: Duration
409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
The data set contains some empty cells ("Date" in row 22, and "Calories" in
row 18 and 28). The data set contains wrong format ("Date" in row 26).
The data set contains wrong data
("Duration" in row 7). The data set
contains duplicates (row 11 and 12).
Pandas -
Cleaning
Empty Cells
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will
not have a big impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df =
pd.read_
csv('data
.csv')
new_df
=
df.dropn
a()
print(ne
w_df.to
_string()
)
Note: By default, the dropna() method returns a new DataFrame, and will not
change the original. If you want to change the original DataFrame, use the
Example
inplace = True argument:
Remove all rows with NULL values:

import pandas as pd
df =
pd.read_
csv('data
.csv')
df.dropn
a(inplac
e=
True)
print(df.
to_strin
g())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows
containing NULL values from the original DataFrame.

Replace Empty Values


Another way of dealing with empty cells is to insert a new value instead.This way you do
not have to delete entire rows just because of some empty cells. The fillna() method
Example
Replace NULL values with the number 130:
allows us to replace empty cells with a value:
import pandas as pd
df =
pd.read_csv('data.
csv')
df.fillna(130,
inplace = True)
Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.

Example
Replace NULL values in the "Calories" columns with the number 130:
To only replace empty values for one column, specify the column name for the
DataFrame:
import pandas as pd
df =
pd.read_csv('data.csv')
df["Calories"].fillna(1
30, inplace = True)
Replace Using Mean,
Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of
the column.
Pandas uses the mean() median() and mode() methods to calculate the respective values

Example
Calculate the MEAN, and replace any empty values with it:
for a specified column:
import pandas as pd
df =
pd.read_
csv('data
.csv') x
=
df["Calo
ries"].m
ean()
df["Calories"].fillna(x, inplace = True)
Mean = the average value (the sum of all values divided by number of values).
Example
Calculate the MEDIAN, and replace any empty values with it:
import pandas as pd
df =
pd.read_
csv('data
.csv') x =
df["Calo
ries"].me
dian()
df["Calories"].fillna(x, inplace = True)
Median = the value in the middle, after you have sorted all values ascending.
Example
Calculate the MODE, and replace any empty values with it:
import pandas as pd
df =
pd.read_c
sv('data.cs
v') x =
df["Calori
es"].mode
()[0]
df["Calories"].fillna(x,
inplace = True) Pandas -
Cleaning Data of Wrong
Format Data of Wrong
Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into
the same format.

Convert Into a Correct Format


In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26,
the 'Date' column should be a string that represents a date:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/ 11 1 409
01' 0 3 .1
0
1 60 '2020/12/ 11 1 479
02' 7 4 .0
5
2 60 '2020/12/ 10 1 340
03' 3 3 .0
5
3 45 '2020/12/ 10 1 282
04' 9 7 .4
5
4 45 '2020/12/ 11 1 406
05' 7 4 .0
8
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Let's try to convert all cells in the 'Date'
column into dates. Pandas has a
to_datetime() method for this:
Example
Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] =
pd.to_datetime(df['
Date'])
print(df.to_string())

Result:

Duration Date Pulse


Maxpulse Calories 0 60
'2020/12/01' 110 130
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaT 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 '2020/12/26' 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

As you can see from the result, the date in row 26 was fixed, but the empty date in row
22 got a NaT (Not a Time) value, in other words an empty value. One way to deal with
empty values is simply removing the entire row.

Removing Rows
The result from the converting in the example above gave us a NaT value, which can
be handled as a NULL value, and we can remove the row by using the dropna()
method.
Example
Remove rows with a NULL value in the "Date" column:

df.dropna(subset=['Dat
e'], inplace = True)
Pandas - Fixing Wrong
Data

Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong,
like if someone registered "199" instead of "1.99".
Sometimes you can spot wrong data by looking at the data set, because you have an
expectation of what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for
all the other rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work
out in 450 minutes.
Duration Date Pulse
Maxpulse Calories 0 60
'2020/12/01' 110 130
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
How can we fix wrong values, like the one for "Duration" in row 7?

Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and

Example
Set "Duration" = 45 in row 7:
we could just insert "45" in row 7:
df.loc[7, 'Duration'] = 45

For small data sets you might be able to replace the wrong data one by one, but not for
big data sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some
boundaries for legal values, and replace any values that are outside of the boundaries.

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120

Removing Rows
Another way of handling wrong data is to remove the rows that contain wrong data.
This way you do not have to find out what to replace them with, and there is a good
chance you do not need them to do your analyses.
Example
Delete rows where "Duration" is higher than 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x,
inplace =
True) Pandas
- Removing
Duplicates
Discovering
Duplicates
Duplicate rows are rows that have been registered more than one time.

Duration Date Pulse Maxpulse Calories


0 60 '2020/12/ 11 1 409
01' 0 3 .1
0
1 60 '2020/12/ 11 1 479
02' 7 4 .0
5
2 60 '2020/12/ 10 1 340
03' 3 3 .0
5
3 45 '2020/12/ 10 1 282
04' 9 7 .4
5
4 45 '2020/12/ 11 1 406
05' 7 4 .0
8
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
By taking a look at our test data set, we can assume that row 11 and
12 are duplicates. To discover duplicates, we can use the
duplicated() method.

Example
Returns True for every row that is a duplicate, othwerwise False:
The duplicated() method returns a Boolean values for each row:
print(df.duplicated())

Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example
Remove all duplicates:
df.drop_duplicat
es(inplace =
True) Pandas -
Data
Correlations

Finding Relationships
A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each
column in your data set. The examples in this page uses a CSV
file called: 'data.csv'.
Download data.csv - https://www.w3schools.com/python/pandas/data.csv
Example
Show the relationship between the columns:
df.corr()
Result
Duration
Pulse Maxpulse Calories
Duration 1.000000 -0.155408 0.009403 0.922721
Pulse -0.155408 1.000000 0.786535 0.025120
Maxpulse 0.009403 0.786535 1.000000 0.203814
Calories 0.922721 0.025120 0.203814 1.000000
Note: The corr() method ignores "not numeric" columns.
Result Explained
The Result of the corr() method is a table with a lot of numbers that represents how
well the relationship is between two columns.
The number varies from -1 to 1.
1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set,
each time a value went up in the first column, the other one went up as well.
0.9 is also a good relationship, and if you increase one value, the other will probably
increase as well.
-0.9 would be just as good relationship as 0.9, but if you increase one value, the
other will probably go down.
0.2 means NOT a good relationship, meaning that if one value goes up does not mean that
What is a good correlation? It depends on the use, but I think it is safe to say you have to have at
least 0.6 (or -0.6) to call it a good correlation.
the other will.
Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which makes sense,
each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more calories you
burn, and the other way around: if you burned a lot of calories, you probably had a
long work out.
Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation,
meaning that we can not predict the max pulse by just looking at the duration of the
work out, and vice versa.
Pandas - Plotting
Plotting
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the
screen.
Example
Import pyplot from Matplotlib and visualize our DataFrame:
import pandas as pd
import
matplotlib.p
yplot as plt
df =
pd.read_csv
('data.csv')
df.plot()
plt.show()
The examples in this page uses a CSV
file called: 'data.csv'.
https://www.w3schools.com/python/pan
das/data.csv Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis.
In the example below we will use "Duration" for the x-axis and
"Calories" for the y-axis. Include the x and y arguments like this:
x = 'Duration', y = 'Calories'
Example
import pandas as pd
import
matplotlib.p
yplot as plt
df =
pd.read_csv
('data.csv')
df.plot(kind = 'scatter', x =
'Duration', y = 'Calories')
plt.show()
Result

Remember: In the previous example, we learned that the correlation between "Duration" and "Calories"
was 0.922721, and we conluded with the fact that higher duration means more calories burned.
By looking at the scatterplot, I will agree.
Let's create another scatterplot, where there is a bad relationship between the
columns, like "Duration" and "Maxpulse", with the correlation 0.009403:
Example
A scatterplot where there are no relationship between the columns:
import pandas as pd
import
matplotlib.p
yplot as plt
df =
pd.read_csv
('data.csv')
df.plot(kind = 'scatter', x =
'Duration', y = 'Maxpulse')
plt.show()
Result

Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
A histogram needs only one column.
A histogram shows us the frequency of each interval, e.g. how many workouts lasted
between 50 and 60 minutes?
In the example below we will use the "Duration" column to create the histogram:
Example
df["Duration"].plot(kind = 'hist')

Pandas - DataFrame Reference


All properties and methods of the DataFrame object, with explanations and examples:
Property/ Description
Method
abs() Return a DataFrame with the absolute value of each value
add() Adds the values of a DataFrame with the specified value(s)
add_prefix() Prefix all labels
add_suffix() Suffix all labels
agg() Apply a function or a function name to one of the axis of the DataFrame
aggregate() Apply a function or a function name to one of the axis of the DataFrame
align() Aligns two DataFrames with a specified join method
all() Return True if all values in the DataFrame are True, otherwise False
any() Returns True if any of the values in the DataFrame are True, otherwise False
append() Append new columns
applymap() Execute a function for each element in the DataFrame
apply() Apply a function to one of the axis of the DataFrame
assign() Assign new columns
astype() Convert the DataFrame into a specified dtype
at Get or set the value of the item with the specified label
axes Returns the labels of the rows and the columns of the DataFrame
bfill() Replaces NULL values with the value from the next row
bool() Returns the Boolean value of the DataFrame
columns Returns the column labels of the DataFrame
combine() Compare the values in two DataFrames, and let a function decide which values
to keep
combine_first() Compare two DataFrames, and if the first DataFrame has a NULL value, it will
be filled with the respective value from the second DataFrame
compare() Compare two DataFrames and return the differences
convert_dtypes() Converts the columns in the DataFrame into new dtypes
corr() Find the correlation (relationship) between each column
count() Returns the number of not empty cells for each column/row
cov() Find the covariance of the columns
copy() Returns a copy of the DataFrame
cummax() Calculate the cumulative maximum values of the DataFrame
cummin() Calculate the cumulative minmum values of the DataFrame
cumprod() Calculate the cumulative product over the DataFrame
cumsum() Calculate the cumulative sum over the DataFrame
describe() Returns a description summary for each column in the DataFrame
diff() Calculate the difference between a value and the value of the same column in
the previous row
div() Divides the values of a DataFrame with the specified value(s)
dot() Multiplies the values of a DataFrame with values from another array-like object,
and add the result

drop() Drops the specified rows/columns from the DataFrame


drop_duplicates( Drops duplicate values from the DataFrame
)
droplevel() Drops the specified index/column(s)
dropna() Drops all rows that contains NULL values
dtypes Returns the dtypes of the columns of the DataFrame
duplicated() Returns True for duplicated rows, otherwise False
empty Returns True if the DataFrame is empty, otherwise False
eq() Returns True for values that are equal to the specified value(s), otherwise False
equals() Returns True if two DataFrames are equal, otherwise False
eval Evaluate a specified string
explode() Converts each element into a row
ffill() Replaces NULL values with the value from the previous row
fillna() Replaces NULL values with the specified value
filter() Filter the DataFrame according to the specified filter
first() Returns the first rows of a specified date selection
floordiv() Divides the values of a DataFrame with the specified value(s), and floor the values
ge() Returns True for values greater than, or equal to the specified value(s),
otherwise False
get() Returns the item of the specified key
groupby() Groups the rows/columns into specified groups
gt() Returns True for values greater than the specified value(s), otherwise False
head() Returns the header row and the first 10 rows, or the specified number of rows
iat Get or set the value of the item in the specified position
idxmax() Returns the label of the max value in the specified axis
idxmin() Returns the label of the min value in the specified axis
iloc Get or set the values of a group of elements in the specified positions
index Returns the row labels of the DataFrame
infer_objects() Change the dtype of the columns in the DataFrame
info() Prints information about the DataFrame
insert() Insert a column in the DataFrame
interpolate() Replaces not-a-number values with the interpolated method
isin() Returns True if each elements in the DataFrame is in the specified value
isna() Finds not-a-number values
isnull() Finds NULL values
items() Iterate over the columns of the DataFrame
iteritems() Iterate over the columns of the DataFrame
iterrows() Iterate over the rows of the DataFrame
itertuples() Iterate over the rows as named tuples
join() Join columns of another DataFrame
last() Returns the last rows of a specified date selection
le() Returns True for values less than, or equal to the specified value(s), otherwise False
loc Get or set the value of a group of elements specified using their labels
lt() Returns True for values less than the specified value(s), otherwise False
keys() Returns the keys of the info axis
kurtosis() Returns the kurtosis of the values in the specified axis
mask() Replace all values where the specified condition is True
max() Return the max of the values in the specified axis
mean() Return the mean of the values in the specified axis
median() Return the median of the values in the specified axis
melt() Reshape the DataFrame from a wide table to a long table
memory_usage() Returns the memory usage of each column
merge() Merge DataFrame objects
min() Returns the min of the values in the specified axis

mod() Modules (find the remainder) of the values of a DataFrame


mode() Returns the mode of the values in the specified axis
mul() Multiplies the values of a DataFrame with the specified value(s)
ndim Returns the number of dimensions of the DataFrame
ne() Returns True for values that are not equal to the specified value(s), otherwise False
nlargest() Sort the DataFrame by the specified columns, descending, and return the specified
number of rows
notna() Finds values that are not not-a-number
notnull() Finds values that are not NULL
nsmallest() Sort the DataFrame by the specified columns, ascending, and return the specified
number of rows
nunique() Returns the number of unique values in the specified axis
pct_change() Returns the percentage change between the previous and the current value
pipe() Apply a function to the DataFrame
pivot() Re-shape the DataFrame
pivot_table() Create a spreadsheet pivot table as a DataFrame
pop() Removes an element from the DataFrame
pow() Raise the values of one DataFrame to the values of another DataFrame
prod() Returns the product of all values in the specified axis
product() Returns the product of the values in the specified axis
quantile() Returns the values at the specified quantile of the specified axis
query() Query the DataFrame
radd() Reverse-adds the values of one DataFrame with the values of another DataFrame
rdiv() Reverse-divides the values of one DataFrame with the values of another DataFrame
reindex() Change the labels of the DataFrame
reindex_like() ??
rename() Change the labels of the axes
rename_axis() Change the name of the axis
reorder_levels() Re-order the index levels
replace() Replace the specified values
reset_index() Reset the index
rfloordiv() Reverse-divides the values of one DataFrame with the values of another DataFrame
rmod() Reverse-modules the values of one DataFrame to the values of another DataFrame
rmul() Reverse-multiplies the values of one DataFrame with the values of another
DataFrame
round() Returns a DataFrame with all values rounded into the specified format
rpow() Reverse-raises the values of one DataFrame up to the values of another DataFrame
rsub() Reverse-subtracts the values of one DataFrame to the values of another DataFrame
rtruediv() Reverse-divides the values of one DataFrame with the values of another DataFrame
sample() Returns a random selection elements
sem() Returns the standard error of the mean in the specified axis
select_dtypes() Returns a DataFrame with columns of selected data types
shape Returns the number of rows and columns of the DataFrame
set_axis() Sets the index of the specified axis
set_flags() Returns a new DataFrame with the specified flags
set_index() Set the Index of the DataFrame
size Returns the number of elements in the DataFrame
skew() Returns the skew of the values in the specified axis
sort_index() Sorts the DataFrame according to the labels
sort_values() Sorts the DataFrame according to the values
squeeze() Converts a single column DataFrame into a Series
stack() Reshape the DataFrame from a wide table to a long table
std() Returns the standard deviation of the values in the specified axis
sum() Returns the sum of the values in the specified axis

sub() Subtracts the values of a DataFrame with the specified value(s)


swaplevel() Swaps the two specified levels
T Turns rows into columns and columns into rows
tail() Returns the headers and the last rows
take() Returns the specified elements
to_xarray() Returns an xarray object
transform() Execute a function for each value in the DataFrame
transpose() Turns rows into columns and columns into rows
truediv() Divides the values of a DataFrame with the specified value(s)
truncate() Removes elements outside of a specified set of values
update() Update one DataFrame with the values from another DataFrame
value_counts() Returns the number of unique rows
values Returns the DataFrame as a NumPy array
var() Returns the variance of the values in the specified axis
where() Replace all values where the specified condition is False
xs() Returns the cross-section of the DataFrame
iter () Returns an iterator of the info axes
Sample Programs
Write a Pandas program to get the powers of an array values element-wise.
Note: First array elements raised to powers from second array
Sample data: {'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]}
Python Code :
import pandas as pd
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]});
print(df)

Write a Pandas program to calculate the sum of the examination attempts by the students.
Sample DataFrame:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Python Code :
import pandas as pd import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels) print("\nSum of the examination attempts by
the students:") print(df['attempts'].sum())

Write a Pandas program to iterate over rows in a DataFrame.


Sample Python dictionary data and list labels:
exam_data = [{'name':'Anastasia', 'score':12.5}, {'name':'Dima','score':9},
{'name':'Katherine','score':16.5}]
Python Code :
import pandas as pd
import numpy as np
exam_data = [{'name':'Anastasia', 'score':12.5}, {'name':'Dima','score':9},
{'name':'Katherine','score':16.5}] df = pd.DataFrame(exam_data) for index, row in
df.iterrows():
print(row['name'], row['score'])

Write a Pandas program to drop a list of rows from a specified DataFrame.


Sample data:
Original DataFrame col1 col2 col3
0147
1458
2369
3470
4581
New DataFrame after removing 2nd & 4th rows:
col1 col2 col3 0 1 4 7
1458
3470
Python Code :
import pandas as pd import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(d) print("Original DataFrame") print(df)
print("New DataFrame after removing 2nd & 4th rows:") df = df.drop(df.index[[2,4]])
print(df)

Write a Pandas program to select a specific row of given series/dataframe by integer index.
Test Data:
0 s001 V Alberto 15/05/2002 35 street1 t1
Franco
1 s002 V Gino Mcneill 17/05/2002 32 street2 t2
2 s003 V Ryan Parkes 16/02/1999 33 street3 t3
I
3 s001 V Eesha Hinton 25/09/1998 30 street1 t4
I
4 s002 V Gino Mcneill 11/05/2002 31 street2 t5
5 s004 V David Parkes 15/09/1997 32 street4 t6
I
Python Code :
import pandas as pd
ds = pd.Series([1,3,5,7,9,11,13,15], index=[0,1,2,3,4,5,7,8])
print("Original Series:") print(ds)
print("\nPrint specified row from the said series using location based indexing:") print("\
nThird row:")
print(ds.iloc[[2]]) print("\nFifth row:") print(ds.iloc[[4]])
df = pd.DataFrame({
'school_code': ['s001','s002','s003','s001','s002','s004'],
'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],
'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill', 'David
Parkes'], 'date_of_birth':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'], 'weight': [35,
32, 33, 30, 31, 32]})
print("Original DataFrame with single index:") print(df)
print("\nPrint specified row from the said DataFrame using location based indexing:") print("\
nThird row:")
print(df.iloc[[2]]) print("\nFifth row:") print(df.iloc[[4]])

Write a Pandas program to add, subtract, multiple and divide two Pandas Series.
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
Python Code :
import pandas as pd
ds1 = pd.Series([2, 4, 6, 8, 10])
ds2 = pd.Series([1, 3, 5, 7, 9]) ds = ds1 + ds2
print("Add two Series:") print(ds)
print("Subtract two Series:") ds = ds1 - ds2
print(ds)
print("Multiply two Series:") ds = ds1 * ds2
print(ds)
print("Divide Series1 by Series2:") ds = ds1 / ds2
print(ds)

Write a Pandas program to compare the elements of the two Pandas Series.
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 10]
Python Code :
import pandas as pd
ds1 = pd.Series([2, 4, 6, 8, 10])
ds2 = pd.Series([1, 3, 5, 7, 10]) print("Series1:")
print(ds1) print("Series2:") print(ds2)
print("Compare the elements of the said Series:") print("Equals:")
print(ds1 == ds2) print("Greater than:") print(ds1 > ds2) print("Less than:") print(ds1 < ds2)

Create the Excel file with name coalpublic2013.xlsx' and write the program for the following
Ye MSHA Mine_Name Productio Labor_Hou
ar ID n rs
201 103381 Tacoa Highwall Miner 56,004 22,392
3
201 103404 Reid School Mine 28,807 8,447
3
201 100759 North River #1 Underground 14,40,115 4,74,784
3 Min
201 103246 Bear Creek 87,587 29,193
3
201 103451 Knight Mine 1,47,499 46,393
3
201 103433 Crane Central Mine 69,339 47,195
3
201 100329 Concord Mine 0 1,44,002
3
201 100851 Oak Grove Mine 22,69,014 10,01,809
3
201 102901 Shoal Creek Mine 0 12,396
3
201 102901 Shoal Creek Mine 14,53,024 12,37,415
3
201 103180 Sloan Mountain Mine 3,27,780 1,96,963
3
201 103182 Fishtrap 1,75,058 87,314
3
201 103285 Narley Mine 1,54,861 90,584
3
201 103332 Powhatan Mine 1,40,521 61,394
3
201 103375 Johnson Mine 580 1,900
3
201 103419 Maxine-Pratt Mine 1,25,824 1,07,469
3
201 103432 Skelton Creek 8,252 220
3
201 103437 Black Warrior Mine No 11,45,924 70,926
3
201 102976 Piney Woods Preparation Plant 0 14,828
3
201 102976 Piney Woods Preparation Plant 0 23,193
3
201 103380 Calera 0 12,621
3
201 103380 Calera 0 1,402
3
201 103422 Clark No 1 Mine 1,22,727 1,40,250
3
201 103323 Deerlick Mine 1,33,452 46,381
3
201 103364 Brc Alabama No. 7 Llc 0 14,324
3
201 103436 Swann's Crossing 1,37,511 77,190
3
201 100347 Choctaw Mine 5,37,429 2,15,295
3
201 101362 Manchester Mine 2,19,457 1,16,914
3
201 102996 Jap Creek Mine 3,75,715 1,64,093
3
201 103370 Cresent Valley Mine 2,860 621
3

Write a Pandas program to import given excel data into a Pandas dataframe. Python
Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') print(df.head)

Write a Pandas program to import some excel data skipping first twenty rows into a Pandas
dataframe.
Python Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx', skiprows = 20) df

Write a Pandas program to find the sum, mean, max, min value of 'Production
(short tons)' column of excel file.
Python Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx')print("Sum: ",df["Production"].sum())
print("Mean: ",df["Production"].mean())
print("Maximum: ",df["Production"].max())
print("Minimum: ",df["Production"].min())
Write a Pandas program to insert a column in the sixth position of the said
excel sheet and fill it with NaN values.
Python Code :
import pandas as pd
import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') df.insert(3, "column1", np.nan) print(df.head)

Write a Pandas program to import excel data into a Pandas dataframe


and display the last ten rows.
Python Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') df.tail(n=10)

Write a Pandas program to import given excel data into a dataframe and
find all records that include two specific MSHA ID.
Python Code :
import pandas as pd
import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx') df[df["MSHA
ID"].isin([102976,103380])].head()

Write a Pandas program to create a subtotal of "Labor Hours" against


MSHA ID from the given excel data
Sample Solution:
Python Code :
import pandas as pd import numpy as np
df = pd.read_excel('E:\coalpublic2013.xlsx')
df_sub=df[["MSHA ID","Labor_Hours"]].groupby('MSHA ID').sum() df_sub

Write a Pandas program to import given excel data (coalpublic2013.xlsx ) into a


dataframe and draw a bar plot comparing year, MSHA ID, Production and
Labor_hours of first ten records.
Python Code :

import pandas as pd import numpy as np


import matplotlib.pyplot as plt
df = pd.read_excel('E:\coalpublic2013.xlsx') df.head(10).plot(kind='bar', figsize=(20,8))
plt.show()

Result:

Thus study and work with Pandas data frames are discussed and completed
successfully.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy