Pandas
Pandas
Pandas
# What is Pandas?
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
Pandas can clean messy data sets, and make them readable and relevant.
Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
1
1. Pandas Getting Started
1.1 Installation of Pandas
If you have Python and PIP already installed on a system, then installation
of Pandas is very easy.
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
OUTPUT :
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
1.3 Pandas as pd
2
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the "as" keyword while importing:
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
OUTPUT :
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
import pandas as pd
print(pd.__version__)
OUTPUT :
1.2.2
2. Pandas Series
2.1 What is a Series?
3
Example 2.1 : Create a simple Pandas Series from a list - int, float, string
import pandas as pd
a = [1, 2, 3]
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 1
1 2
2 3
dtype: int64
Based on the values present in the series, the datatype of the series is
decided.
import pandas as pd
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 1.1
1 2.2
2 3.3
dtype: float64
import pandas as pd
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 apple
1 banana
2 orange
dtype: object
4
import pandas as pd
a = [1, "banana", 3]
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 1
1 banana
2 3
dtype: object
2.2 Labels
If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.
import pandas as pd
a = [1, 2, 3, 4, 5, 6, 7]
myvar = pd.Series(a)
print(myvar[1])
OUTPUT :
2
import pandas as pd
a = [1, 2, 3, 4, 5, 6, 7]
myvar = pd.Series(a)
print(myvar[1:4])
OUTPUT :
1 2
2 3
3 4
dtype: int64
5
2.3 Create Labels
With the "index argument", you can name your own labels.
import pandas as pd
a = [1, 2, 3]
print(myvar)
OUTPUT :
x 1
y 2
z 3
dtype: int64
When you have created labels, you can access an item by referring to the label.
import pandas as pd
a = [1, 2, 3]
print(myvar["y"])
OUTPUT :
2
3. Pandas DataFrames
3.1What is a DataFrame?
6
To create a DataFrame from different sources of data or other Python
datatypes, we can use "DataFrame()" constructor.
import pandas as pd
df = pd.DataFrame()
print(df)
OUTPUT :
Empty DataFrame
Columns: []
Index: []
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
OUTPUT :
calories duration
0 420 50
1 380 40
2 390 45
7
Example 3.3 Create a simple Pandas DataFrame with Lables - Index
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
OUTPUT :
calories duration
day1 420 50
day2 380 40
day3 390 45
To create Pandas DataFrame from list of lists, you can pass this list of
lists as data argument to "pandas.DataFrame()".
Each inner list inside the outer list is transformed to a row in resulting
DataFrame.
#list of lists
df = pd.DataFrame(data)
print(df)
OUTPUT :
0 1 2
0 a1 b1 c1
1 a2 b2 c2
2 a3 b3 c3
8
Example 3.5: Create DataFrame from List of Lists with Column Names & Index
import pandas as pd
#list of lists
OUTPUT :
C1 C2 C3
R1 a1 b1 c1
R2 a2 b2 c2
R3 a3 b3 c3
Example 3.5: Create DataFrame from List of Lists with Different List Lengths
import pandas as pd
#list of lists
data = [['a1', 'b1', 'c1', 'd1'],
['a2', 'b2', 'c2'],
['a3', 'b3', 'c3']]
df = pd.DataFrame(data)
print(df)
OUTPUT :
0 1 2 3
0 a1 b1 c1 d1
1 a2 b2 c2 None
2 a3 b3 c3 None
9
Example 3.6: Create DataFrame from Dictionary
import pandas as pd
OUTPUT :
names physics chemistry algebra
0 raju 68 84 78
1 ramu 74 56 88
2 ravi 77 73 82
3 akash 78 69 87
Also, you can get the number of rows or number of columns using index on the
shape.
import pandas as pd
10
OUTPUT :
The DataFrame is :
C1 C2 C3
R1 a1 b1 c1
R2 a2 b2 c2
R3 a3 b3 c3
R4 a4 b4 c4
Number of rows : 4
Number of columns : 3
The DataFrame.info() method returns nothing but just prints information about
this DataFrame.
import pandas as pd
df = pd.DataFrame(
[['abc', 22],
['xyz', 25],
['pqr', 31]],
columns=['name', 'age'])
print(df)
df.info()
OUTPUT :
name age
0 abc 22
1 xyz 25
2 pqr 31
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
11
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 3 non-null object
1 age 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
import pandas as pd
df.info()
OUTPUT :
C1 C2 C3
R1 a1 b1 c1
R2 a2 b2 c2
R3 a3 b3 c3
R4 a4 b4 c4
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, R1 to R4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C1 4 non-null object
1 C2 4 non-null object
2 C3 4 non-null object
dtypes: object(3)
memory usage: 128.0+ bytes
12
import pandas as pd
print(df_marks)
df_marks.info()
OUTPUT :
names physics chemistry algebra
0 raju 68 84 78
1 ramu 74 56 88
2 ravi 77 73 82
3 akash 78 69 87
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 names 4 non-null object
1 physics 4 non-null int64
2 chemistry 4 non-null int64
3 algebra 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 256.0+ bytes
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
import pandas as pd
#print dataframe
print(df)
13
OUTPUT :
Name maths physics chemisry
0 a 11 21 31
1 b 12 22 32
2 c 13 23 32
3 d 14 24 34
Note : If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:
import pandas as pd
#print dataframe
print(df)
OUTPUT :
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
import pandas as pd
#print dataframe
print(df.to_string())
OUTPUT :
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
14
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.0
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
…
…..
2. Create an Excel Writer with the name of the output excel file, to which
you would like to write our DataFrame.
3. Call to_excel() function on the DataFrame with the Excel Writer passed as
argument.
import pandas as pd
# create dataframe
df_marks = pd.DataFrame({'name': ['raju', 'ramu', 'ravi', 'akash'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})
15
OUTPUT :
DataFrame is written successfully to Excel File.
name physics chemistry algebra
0 raju 68 84 78
1 ramu 74 56 88
2 ravi 77 73 82
3 akash 78 69 87
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('pandas.csv')
print(df)
df.plot()
plt.show()
OUTPUT :
Name maths physics chemisry
0 a 11 21 31
1 b 12 22 32
2 c 13 23 32
3 d 14 24 34
16