0% found this document useful (0 votes)
62 views

Pandas

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Pandas

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

In [1]: 1 import numpy as np

2 import pandas as pd

In [2]: 1 print(pd.__version__)

1.4.2

Introduction to Pandas
Today, Python is considered as the most popular programming language for doing data science work. The reason behind this popularity is that
Python provides great packages for doing data analysis and visualization work.

Pandas is one of those packages that makes analysing data much easier. Pandas is an open source library for data analysis in Python. It was
developed by Wes McKinney in 2008. Over the years, it has become the standard library for data analysis using Python.

According to the Wikipedia page on Pandas,

"Pandas offers data structures and operations for manipulating numerical tables and time series. It is free software released under the
three-clause BSD license. The name is derived from the term 'panel data', an econometrics term for data sets that include observations
over multiple time periods for the same individuals."

In this project, I explore Pandas and various data analysis tools provided by Pandas.
Key features of Pandas
Some key features of Pandas are as follows:-

1. It provides tools for reading and writing data from a wide variety of sources such as CSV files, excel files, databases such as SQL, JSON files.
2. It provides different data structures like series, dataframe and panel for data manipulation and indexing.
3. It can handle wide variety of data sets in different formats – time series, heterogeneous data, tabular and matrix data.
4. It can perform variety of operations on datasets. It includes subsetting, slicing, filtering, merging, joining, groupby, reordering and reshaping
operations.
5. It can deal with missing data by either deleting them or filling them with zeros or a suitable test statistic.
6. It can be used for parsing and conversion of data.
7. It provides data filtration techniques.
8. It provides time series functionality – date range generation, frequency conversion, moving window statistics, data shifting and lagging.
9. It integrates well with other Python libraries such as Scikit-learn, statsmodels and SciPy.
10. It delivers fast performance. Also, it can be speeded up even more by making use of Cython (C extensions to Python).

Advantages of Pandas
Pandas is a core component of the Python data analysis toolkit. Pandas provides data structure and operations facilities, which is particularly useful
for data analysis. There are various advantages of using Pandas for data analysis.

These advantages are as follows:-

Data representation

It represents data in a form that is very much suited for data analysis through its Dataframe and Series data structures.

Data subsetting and filtering

It provides for easy subsetting and filtering of data. It provides procedures that are suited for data analysis.

Concise and clear code

It provides functionality to write clear and concise code. It allows us to focus on the task at hand, rather than have to write tedious code.
Data structures in Pandas
Pandas provide easy to use data structures.

There are two data structures in Pandas. They are:-

• Series
• Dataframe

These data structures are built on top of Numpy array, which means they are fast. I have described these data structures in the following sections

Pandas Series
A Pandas Series is a one-dimensional array like structure with homogeneous data.

The data can be of any type (integer, string, float, etc.). The axis labels are collectively called index.

For example, the following series is a collection of integers 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.

Key Points of Pandas Series


• Homogeneous data
• Size of series immutable
• Values of data mutable

Creating Series
The series is a one-dimensional labeled array. It can accommodate any type of data in it.

Creating an empty Series


In [3]: 1 # Creating empty series
2 ser = pd.Series()
3 ​
4 ser

C:\Users\JAINIL PARIKH\AppData\Local\Temp\ipykernel_1944\1922658513.py:2: FutureWarning: The default dtype for empty Se


ries will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
ser = pd.Series()

Out[3]: Series([], dtype: float64)

Creating a series from array

In [4]: 1 # simple array


2 data = np.array(['A', 'p', 'p', 'l', 'i', 'e', 'd'])
3 ​
4 ser = pd.Series(data)
5 print(ser)
6 ​

0 A
1 p
2 p
3 l
4 i
5 e
6 d
dtype: object

Creating a series from array with index


In [5]: 1 # simple array
2 data = np.array(['A', 'p', 'p', 'l', 'i', 'e', 'd'])
3 ​
4 # providing an index
5 ser = pd.Series(data, index =[10, 11, 12, 13, 14, 15, 16])
6 print(ser)
7 ​

10 A
11 p
12 p
13 l
14 i
15 e
16 d
dtype: object

Creating a series from Lists

In [6]: 1 # a simple list


2 list = ['A', 'p', 'p', 'l', 'i', 'e', 'd']
3 ​
4 # create series form a list
5 ser = pd.Series(list)
6 print(ser)
7 ​

0 A
1 p
2 p
3 l
4 i
5 e
6 d
dtype: object

Creating a series from Dictionary


In [7]: 1 # a simple dictionary {key:value}
2 dict = {'ABC' : 10,
3 'DEF' : 20,
4 'GHI' : 30}
5 ​
6 # create series from dictionary
7 ser = pd.Series(dict)
8 ​
9 print(ser)
10 ​

ABC 10
DEF 20
GHI 30
dtype: int64

Creating a series from Scalar value

In [8]: 1 # giving a scalar value with index


2 ser = pd.Series(10, index =[0, 1, 2, 3, 4, 5])
3 ​
4 print(ser)
5

0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64

Creating a series using NumPy functions


In [9]: 1 # series with numpy linspace()
2 ser1 = pd.Series(np.linspace(3, 33, 3))
3 print(ser1)
4 ​
5 # series with numpy linspace()
6 ser2 = pd.Series(np.linspace(1, 100, 10))
7 print("\n", ser2)
8 ​

0 3.0
1 18.0
2 33.0
dtype: float64

0 1.0
1 12.0
2 23.0
3 34.0
4 45.0
5 56.0
6 67.0
7 78.0
8 89.0
9 100.0
dtype: float64

In [10]: 1 calories = {"day1": 420, "day2": 380, "day3": 390}


2 myvar = pd.Series(calories)
3 print(myvar)
4 ​

day1 420
day2 380
day3 390
dtype: int64

Pandas DataFrame
A Dataframe is a two-dimensional data structure. So, data is aligned in a tabular fashion in rows and columns. Its column types can be
heterogeneous: - that is, of varying types. It is similar to structured arrays in NumPy with mutability added.

Properties of Dataframe are as follows:-


• The dataframe is conceptually analogous to a table or spreadsheet of data.

• Its columns are of different types – float64, int, bool, and so on.

• A Dataframe column is a Series structure.

• Its size is mutable – columns can be inserted and deleted.

• It has labelled axes (rows and columns).

• It can be thought of as a dictionary of Series structures where both the rows and columns are indexed, denoted as index in the case of rows and
columns in the case of columns.

• It can perform arithmetic operations on rows and columns.

Dataframe Creation
A pandas Dataframe can be created using various inputs like −

• Lists

• dict

• Series

• Numpy ndarrays

• Another Dataframe

Creating DataFrame
The dataframe is a two-dimensional data structure. It contains columns.

In [11]: 1 # list of strings


2 lst = ['ball', 'bat', 'cat', 'dog',
3 'blue', 'black', 'green']
4 ​
5 # Calling DataFrame constructor on list
6 df = pd.DataFrame(lst)
7 print(df)
8 ​

0
0 ball
1 bat
2 cat
3 dog
4 blue
5 black
6 green

In [12]: 1 # initialize data of lists.


2 data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
3 ​
4 # Create DataFrame
5 df = pd.DataFrame(data)
6 ​
7 # Print the output.
8 df
9 ​

Out[12]:
Name Age

0 Tom 20

1 nick 21

2 krish 19

3 jack 18
In [13]: 1 # Intialise data of lists.
2 data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
3 ​
4 # Create DataFrame
5 df = pd.DataFrame(data)
6 ​
7 # Print the output.
8 print(df)
9 ​

Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

In [14]: 1 data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32],
2 'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
3 'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
4 ​
5 # Convert the dictionary into DataFrame
6 df = pd.DataFrame(data)
7 df
8 ​

Out[14]:
Name Age Address Qualification

0 Jai 27 Delhi Msc

1 Princi 24 Kanpur MA

2 Gaurav 22 Allahabad MCA

3 Anuj 32 Kannauj Phd


In [15]: 1 # initialize data of lists.
2 data = {'Name':['Tom', 'Jack', 'nick', 'juli'], 'marks':[99, 98, 95, 90]}
3 ​
4 # Creates pandas DataFrame.
5 df = pd.DataFrame(data, index =['rank1','rank2','rank3','rank4'])
6 ​
7 # print the data
8 df
9 ​

Out[15]:
Name marks

rank1 Tom 99

rank2 Jack 98

rank3 nick 95

rank4 juli 90

In [16]: 1 # Initialize data of lists


2 data = [{'b': 2, 'c':3}, {'a': 10, 'b': 20, 'c': 30}]
3 ​
4 # Creates pandas DataFrame by passing # Lists of dictionaries and row index.
5 df = pd.DataFrame(data, index =['first', 'second'])
6 ​
7 # Print the data
8 df
9 ​

Out[16]:
b c a

first 2 3 NaN

second 20 30 10.0
In [17]: 1 # List1
2 Name = ['tom', 'krish', 'nick', 'juli']
3 ​
4 # List2
5 Age = [25, 30, 26, 22]
6 ​
7 # get the list of tuples from two lists. # and merge them by using zip().
8 list_of_tuples = set(zip(Name, Age))
9 print(list_of_tuples)
10 ​
11 # Converting lists of tuples into # pandas Dataframe.
12 df = pd.DataFrame(list_of_tuples,
13 columns = ['Name', 'Age'])
14 ​
15 # Print data.
16 df
17 ​

{('tom', 25), ('juli', 22), ('nick', 26), ('krish', 30)}

Out[17]:
Name Age

0 tom 25

1 juli 22

2 nick 26

3 krish 30
In [18]: 1 listScoville = [50, 5000, 500000]
2 listName = ["Bell pepper", "Espelette pepper", "Chocolate habanero"]
3 listFeeling = ["Not even spicy", "Uncomfortable", "Practically ate pepper spray"]
4 ​
5 dataFrame1 = pd.DataFrame(zip(listScoville, listName, listFeeling), columns = ['Scoville', 'Name', 'Feeling'])
6 ​
7 # Print the dataframe
8 dataFrame1
9 ​

Out[18]:
Scoville Name Feeling

0 50 Bell pepper Not even spicy

1 5000 Espelette pepper Uncomfortable

2 500000 Chocolate habanero Practically ate pepper spray

In [19]: 1 # Initialize data to Dicts of series.


2 d = {'one' : pd.Series([10, 20, 30, 40],
3 index =['a', 'b', 'c', 'd']),
4 'two' : pd.Series([1, 2, 30, 40],
5 index =['a', 'b', 'c', 'd'])}
6 ​
7 # creates Dataframe.
8 df = pd.DataFrame(d)
9 ​
10 # print the data.
11 df
12 ​

Out[19]:
one two

a 10 1

b 20 2

c 30 30

d 40 40
Reading and Writing CSV Files
Comma-Separated Values inherently uses a comma as the delimiter, you can use other delimiters (separators) as well, such as the semicolon (;).
Each row of the table is a new line of the CSV file and it's a very compact and concise way to represent tabular data.

In [20]: 1 titanic_data = pd.read_csv('titanic.csv')


2 titanic_data.tail(10)

Out[20]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S

882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S

C.A./SOTON
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 10.5000 NaN S
34068

SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076

885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S

Johnston, Miss. Catherine Helen


888 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
"Carrie"

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

In [21]: 1 cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])


2 cities

Out[21]:
City State

0 Sacramento California

1 Miami Florida
In [22]: 1 cities.to_csv('Area.csv',index =False)

In [23]: 1 data = pd.read_csv('Area.csv')


2 data.head()

Out[23]:
City State

0 Sacramento California

1 Miami Florida

In [24]: 1 data.columns = ['city','state']


2 data

Out[24]:
city state

0 Sacramento California

1 Miami Florida

Reading and Writing Excel (XLSX) Files


python -m pip install openpyxl xlsxwriter xlrd

1 !pip install openpyxl xlsxwriter xlrd


In [25]: 1 df = pd.DataFrame({'States':['California', 'Florida', 'Montana', 'Colorodo', 'Washington', 'Virginia'], 'Capitals':[
2 df
3 ​

Out[25]:
States Capitals Population

0 California Sacramento 508529

1 Florida Tallahassee 193551

2 Montana Helena 32315

3 Colorodo Denver 619968

4 Washington Olympia 52555

5 Virginia Richmond 227032

In [26]: 1 df.to_excel('states.xlsx')

In [27]: 1 df.to_excel('./states.xlsx', sheet_name='States')

In [28]: 1 df.to_excel('./states.xlsx', sheet_name='States', index=False)


In [29]: 1 states = pd.read_excel('states.xlsx')
2 states

Out[29]:
States Capitals Population

0 California Sacramento 508529

1 Florida Tallahassee 193551

2 Montana Helena 32315

3 Colorodo Denver 619968

4 Washington Olympia 52555

5 Virginia Richmond 227032

Reading and Writing JSON Files


What is a JSON File?

JavaScript Object Notation (JSON) is a data format that stores data in a human-readable form. While it can be technically be used for storage,
JSON files are primarily used for serialization and information exchange between a client and server.

Although it was derived from JavaScript, it's platform-agnostic and is a widely-spread and used format - most prevalently in REST APIs.

In [30]: 1 import json

In [31]: 1 data = {'employees' : [{'name' : 'John Doe','department' : 'Marketing', 'place' : 'Remote'},


2 {'name' : 'Jane Doe','department' : 'Software Engineering', 'place' : 'Remote'},
3 {'name' : 'Don Joe','department' : 'Software Engineering', 'place' : 'Office'}]}
4 ​
5 json_string = json.dumps(data)
6 print(json_string)

{"employees": [{"name": "John Doe", "department": "Marketing", "place": "Remote"}, {"name": "Jane Doe", "department":
"Software Engineering", "place": "Remote"}, {"name": "Don Joe", "department": "Software Engineering", "place": "Offic
e"}]}
In [32]: 1 with open('json_data.json', 'w') as outfile: json.dump(json_string, outfile)

In [33]: 1 python_dictionary = json.loads(json_string)


2 print(python_dictionary)

{'employees': [{'name': 'John Doe', 'department': 'Marketing', 'place': 'Remote'}, {'name': 'Jane Doe', 'department':
'Software Engineering', 'place': 'Remote'}, {'name': 'Don Joe', 'department': 'Software Engineering', 'place': 'Offic
e'}]}

In [34]: 1 with open('json_data.json') as json_file: data = json.load(json_file)


2 print(data)
3 ​

{"employees": [{"name": "John Doe", "department": "Marketing", "place": "Remote"}, {"name": "Jane Doe", "department":
"Software Engineering", "place": "Remote"}, {"name": "Don Joe", "department": "Software Engineering", "place": "Offic
e"}]}

In [35]: 1 data = {'people':[{'name': 'Scott', 'website': 'stackabuse.com', 'from': 'Nebraska'}]}


2 print(json.dumps(data, indent=3))

{
"people": [
{
"name": "Scott",
"website": "stackabuse.com",
"from": "Nebraska"
}
]
}

In [36]: 1 patients = {
2 "Name":{"0":"John","1":"Nick","2":"Ali","3":"Joseph"},
3 "Gender":{"0":"Male","1":"Male","2":"Female","3":"Male"},
4 "Nationality":{"0":"UK","1":"French","2":"USA","3":"Brazil"}, "Age" :{"0":10,"1":25,"2":35,"3":29}
5 }
6 ​
In [37]: 1 with open('patients.json', 'w') as f: json.dump(patients, f)

In [38]: 1 patients_df = pd.read_json('patients.json')


2 patients_df

Out[38]:
Name Gender Nationality Age

0 John Male UK 10

1 Nick Male French 25

2 Ali Female USA 35

3 Joseph Male Brazil 29


In [39]: 1 iris_data = pd.read_json("https://raw.githubusercontent.com/domoritz/maps/master/data/iris.json")
2 iris_data

Out[39]:
sepalLength sepalWidth petalLength petalWidth species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns


In [40]: 1 import seaborn as sns
2 ​
3 dataset = sns.load_dataset('tips')
4 dataset.head()
5 ​

Out[40]:
total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

In [41]: 1 dataset.to_json('tips.json')
In [42]: 1 with open('tips.json') as json_file: data = json.load(json_file)
2 print(data)
3 ​

{'total_bill': {'0': 16.99, '1': 10.34, '2': 21.01, '3': 23.68, '4': 24.59, '5': 25.29, '6': 8.77, '7': 26.88, '8':
15.04, '9': 14.78, '10': 10.27, '11': 35.26, '12': 15.42, '13': 18.43, '14': 14.83, '15': 21.58, '16': 10.33, '17':
16.29, '18': 16.97, '19': 20.65, '20': 17.92, '21': 20.29, '22': 15.77, '23': 39.42, '24': 19.82, '25': 17.81, '26':
13.37, '27': 12.69, '28': 21.7, '29': 19.65, '30': 9.55, '31': 18.35, '32': 15.06, '33': 20.69, '34': 17.78, '35': 2
4.06, '36': 16.31, '37': 16.93, '38': 18.69, '39': 31.27, '40': 16.04, '41': 17.46, '42': 13.94, '43': 9.68, '44': 3
0.4, '45': 18.29, '46': 22.23, '47': 32.4, '48': 28.55, '49': 18.04, '50': 12.54, '51': 10.29, '52': 34.81, '53': 9.
94, '54': 25.56, '55': 19.49, '56': 38.01, '57': 26.41, '58': 11.24, '59': 48.27, '60': 20.29, '61': 13.81, '62': 1
1.02, '63': 18.29, '64': 17.59, '65': 20.08, '66': 16.45, '67': 3.07, '68': 20.23, '69': 15.01, '70': 12.02, '71': 1
7.07, '72': 26.86, '73': 25.28, '74': 14.73, '75': 10.51, '76': 17.92, '77': 27.2, '78': 22.76, '79': 17.29, '80': 1
9.44, '81': 16.66, '82': 10.07, '83': 32.68, '84': 15.98, '85': 34.83, '86': 13.03, '87': 18.28, '88': 24.71, '89':
21.16, '90': 28.97, '91': 22.49, '92': 5.75, '93': 16.32, '94': 22.75, '95': 40.17, '96': 27.28, '97': 12.03, '98':
21.01, '99': 12.46, '100': 11.35, '101': 15.38, '102': 44.3, '103': 22.42, '104': 20.92, '105': 15.36, '106': 20.49,
'107': 25.21, '108': 18.24, '109': 14.31, '110': 14.0, '111': 7.25, '112': 38.07, '113': 23.95, '114': 25.71, '115':
17.31, '116': 29.93, '117': 10.65, '118': 12.43, '119': 24.08, '120': 11.69, '121': 13.42, '122': 14.26, '123': 15.9
5, '124': 12.48, '125': 29.8, '126': 8.52, '127': 14.52, '128': 11.38, '129': 22.82, '130': 19.08, '131': 20.27, '13
2': 11.17, '133': 12.26, '134': 18.26, '135': 8.51, '136': 10.33, '137': 14.15, '138': 16.0, '139': 13.16, '140': 1
7.47, '141': 34.3, '142': 41.19, '143': 27.05, '144': 16.43, '145': 8.35, '146': 18.64, '147': 11.87, '148': 9.78,
'149': 7.51, '150': 14.07, '151': 13.13, '152': 17.26, '153': 24.55, '154': 19.77, '155': 29.85, '156': 48.17, '15
7': 25.0, '158': 13.39, '159': 16.49, '160': 21.5, '161': 12.66, '162': 16.21, '163': 13.81, '164': 17.51, '165': 2
4 52 '166' 20 76 '167' 31 71 '168' 10 59 '169' 10 63 '170' 50 81 '171' 15 81 '172' 7 25 '173' 31 85

Reading and Writing pickle Files


Python pickle module is used for serializing and de-serializing a Python object structure. Any object in Python can be pickled so that it can be saved
on disk. What pickle does is that it “serializes” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a
character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

Why Pickle?: In real world sceanario, the use pickling and unpickling are widespread as they allow us to easily transfer data from one
server/system to another and then store it in a file or database.
In [43]: 1 # dictionary of data
2 dct = {"f1": range(6), "b1": range(6, 12)}
3 ​
4 # forming dataframe
5 data = pd.DataFrame(dct)
6 ​
7 # using to_pickle function to form file # with name 'pickle_data'
8 pd.to_pickle(data,'./pickle_data.pkl')
9 ​

In [44]: 1 # unpickled the data by using the # pd.read_pickle method


2 unpickled_data = pd.read_pickle("./pickle_data.pkl")
3 print(unpickled_data)
4 ​

f1 b1
0 0 6
1 1 7
2 2 8
3 3 9
4 4 10
5 5 11

In [45]: 1 # dictionary of data


2 dct = {'ID': {0: 23, 1: 43, 2: 12,3: 13, 4: 67, 5: 89,6: 90, 7: 56, 8: 34},
3 'Name': {0: 'Ram', 1: 'Deep', 2: 'Yash', 3: 'Aman', 4: 'Arjun', 5: 'Aditya', 6: 'Divya', 7: 'Chalsea',8: 'Aka
4 'Marks': {0: 89, 1: 97, 2: 45, 3: 78, 4: 56, 5: 76, 6: 100, 7: 87, 8: 81},
5 'Grade': {0: 'B', 1: 'A', 2: 'F', 3: 'C', 4: 'E', 5: 'C', 6: 'A', 7: 'B', 8: 'B'}}
6 ​
7 # forming dataframe
8 data = pd.DataFrame(dct)
9 ​
10 # using to_pickle function to form file # with name 'pickle_file'
11 pd.to_pickle(data,'./pickle_file.pkl')
12 ​
In [46]: 1 # unpickled the data by using the # pd.read_pickle method
2 unpickled_data = pd.read_pickle("./pickle_file.pkl")
3 print(unpickled_data)
4 ​

ID Name Marks Grade


0 23 Ram 89 B
1 43 Deep 97 A
2 12 Yash 45 F
3 13 Aman 78 C
4 67 Arjun 56 E
5 89 Aditya 76 C
6 90 Divya 100 A
7 56 Chalsea 87 B
8 34 Akash 81 B

Check the type of df


In [47]: 1 titanic = pd.read_csv(r'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
2 titanic.head()

Out[47]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley (Florence


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath (Lily May


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S


In [48]: 1 type(titanic)

Out[48]: pandas.core.frame.DataFrame

In [49]: 1 titanic[['Age', 'Survived']].head()

Out[49]:
Age Survived

0 22.0 0

1 38.0 1

2 26.0 1

3 35.0 1

4 35.0 0

Check shape of dataframe


In [50]: 1 titanic.shape

Out[50]: (891, 12)

View concise summary of dataframe


We can view the concise summary of dataframe with info() method as follows:-
In [51]: 1 titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [52]: 1 titanic.describe()

Out[52]:
PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200


Check missing values with pandas
We can check the total number of missing values in each column in the dataset with the following command:-

In [53]: 1 titanic.isnull().sum()

Out[53]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

isna() and notna() functions to detect 'NA' values


Pandas provides isna() and notna() functions to detect 'NA' values.

These are also methods on Series and DataFrame objects.

Examples of isna() and notna() commands. detect ‘NA’ values in the dataframe df.isna().sum()

detect ‘NA’ values in a particular column in the dataframe

pd.isna(df[‘col_name’])

df[‘col_name’].notna()
In [54]: 1 titanic.isna().sum()

Out[54]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

In [55]: 1 pd.isna(titanic["Fare"])

Out[55]: 0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Fare, Length: 891, dtype: bool
In [56]: 1 titanic["Age"].isna()

Out[56]: 0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool

In [57]: 1 titanic["Age"].isna()

Out[57]: 0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool
In [58]: 1 titanic["Age"].notna()

Out[58]: 0 True
1 True
2 True
3 True
4 True
...
886 True
887 True
888 False
889 True
890 True
Name: Age, Length: 891, dtype: bool

Handling Missing Data - fillna, dropna


In [59]: 1 df = pd.read_csv("weather_data.csv",parse_dates=['day'])
2 df

Out[59]:
day temperature windspeed event

0 2017-01-01 32.0 6.0 Rain

1 2017-01-04 NaN 9.0 Sunny

2 2017-01-05 28.0 NaN Snow

3 2017-01-06 NaN 7.0 NaN

4 2017-01-07 32.0 NaN Rain

5 2017-01-08 NaN NaN Sunny

6 2017-01-09 NaN NaN NaN

7 2017-01-10 34.0 8.0 Cloudy

8 2017-01-11 40.0 12.0 Sunny


Fill NA

In [60]: 1 new_df = df.fillna(0)


2 new_df

Out[60]:
day temperature windspeed event

0 2017-01-01 32.0 6.0 Rain

1 2017-01-04 0.0 9.0 Sunny

2 2017-01-05 28.0 0.0 Snow

3 2017-01-06 0.0 7.0 0

4 2017-01-07 32.0 0.0 Rain

5 2017-01-08 0.0 0.0 Sunny

6 2017-01-09 0.0 0.0 0

7 2017-01-10 34.0 8.0 Cloudy

8 2017-01-11 40.0 12.0 Sunny

Fill specific Value in each column


In [61]: 1 new_df = df.fillna({'temperature': 0,'windspeed': 0,'event': 'No Event'})
2 new_df

Out[61]:
day temperature windspeed event

0 2017-01-01 32.0 6.0 Rain

1 2017-01-04 0.0 9.0 Sunny

2 2017-01-05 28.0 0.0 Snow

3 2017-01-06 0.0 7.0 No Event

4 2017-01-07 32.0 0.0 Rain

5 2017-01-08 0.0 0.0 Sunny

6 2017-01-09 0.0 0.0 No Event

7 2017-01-10 34.0 8.0 Cloudy

8 2017-01-11 40.0 12.0 Sunny

Drop NA

In [62]: 1 new_df = df.dropna()


2 new_df

Out[62]:
day temperature windspeed event

0 2017-01-01 32.0 6.0 Rain

7 2017-01-10 34.0 8.0 Cloudy

8 2017-01-11 40.0 12.0 Sunny

Handling Missing Data - replace method


Replacing single value

In [63]: 1 df = pd.read_csv("weather_data_missing.csv")
2 df

Out[63]:
day temperature windspeed event

0 1/1/2017 32 6 Rain

1 1/2/2017 -99999 7 Sunny

2 1/3/2017 28 -99999 Snow

3 1/4/2017 -99999 7 0

4 1/5/2017 32 -99999 Rain

5 1/6/2017 31 2 Sunny

6 1/6/2017 34 5 0

In [64]: 1 new_df = df.replace(-99999, value=np.NaN)


2 new_df

Out[64]:
day temperature windspeed event

0 1/1/2017 32.0 6.0 Rain

1 1/2/2017 NaN 7.0 Sunny

2 1/3/2017 28.0 NaN Snow

3 1/4/2017 NaN 7.0 0

4 1/5/2017 32.0 NaN Rain

5 1/6/2017 31.0 2.0 Sunny

6 1/6/2017 34.0 5.0 0

Replacing list with single value


In [65]: 1 new_df = df.replace(to_replace=[-99999,-88888], value=0)
2 new_df

Out[65]:
day temperature windspeed event

0 1/1/2017 32 6 Rain

1 1/2/2017 0 7 Sunny

2 1/3/2017 28 0 Snow

3 1/4/2017 0 7 0

4 1/5/2017 32 0 Rain

5 1/6/2017 31 2 Sunny

6 1/6/2017 34 5 0

Replacing per column


In [66]: 1 new_df = df.replace({
2 'temperature': -99999,
3 'windspeed': -99999,
4 'event': '0'
5 }, np.nan)
6 new_df

Out[66]:
day temperature windspeed event

0 1/1/2017 32.0 6.0 Rain

1 1/2/2017 NaN 7.0 Sunny

2 1/3/2017 28.0 NaN Snow

3 1/4/2017 NaN 7.0 NaN

4 1/5/2017 32.0 NaN Rain

5 1/6/2017 31.0 2.0 Sunny

6 1/6/2017 34.0 5.0 NaN

Indexing and slicing in pandas


In this section, I will discuss how to slice and dice the data and get the subset of pandas dataframe.

Pandas provides three types of Multi-axes indexing. Those three types are mentioned in the following table:-

1. .loc - Label based


2. .iloc - Integer based
3. .ix - Both Label and Integer based

Starting with pandas 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. So, I will not discuss it here and limit the
discussion to .loc and .iloc indexers.

Label based indexing using .loc indexer


Pandas provide .loc indexer to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they
refer to the label and not the position.
.loc indexer has multiple access methods like −

• A single scalar label • A list of labels • A slice object • A Boolean array

Syntax-

.loc takes two single/list/range operator separated by ','.

The first one indicates the row and the second one indicates columns.

Below are the examples of selecting data using .loc indexer:-

In [67]: 1 # select first row of dataframe


2 ​
3 titanic.loc[0]
4 ​

Out[67]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
In [68]: 1 #select first five rows for a specific column
2 ​
3 titanic.loc[:4,'Name']
4 ​

Out[68]: 0 Braund, Mr. Owen Harris


1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object

In [69]: 1 #select first five rows for a specific column


2 ​
3 titanic.loc[:,'Name']
4 ​

Out[69]: 0 Braund, Mr. Owen Harris


1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
In [70]: 1 #Select all rows for multiple columns, say list[]
2 ​
3 titanic.loc[:,['Age','Name']]
4 ​

Out[70]:
Age Name

0 22.0 Braund, Mr. Owen Harris

1 38.0 Cumings, Mrs. John Bradley (Florence Briggs Th...

2 26.0 Heikkinen, Miss. Laina

3 35.0 Futrelle, Mrs. Jacques Heath (Lily May Peel)

4 35.0 Allen, Mr. William Henry

... ... ...

886 27.0 Montvila, Rev. Juozas

887 19.0 Graham, Miss. Margaret Edith

888 NaN Johnston, Miss. Catherine Helen "Carrie"

889 26.0 Behr, Mr. Karl Howell

890 32.0 Dooley, Mr. Patrick

891 rows × 2 columns


In [71]: 1 #Select first five rows for multiple columns, say list[]
2 ​
3 titanic.loc[[0, 1, 2, 3, 4],['Age','Name']]
4 ​

Out[71]:
Age Name

0 22.0 Braund, Mr. Owen Harris

1 38.0 Cumings, Mrs. John Bradley (Florence Briggs Th...

2 26.0 Heikkinen, Miss. Laina

3 35.0 Futrelle, Mrs. Jacques Heath (Lily May Peel)

4 35.0 Allen, Mr. William Henry

In [72]: 1 #Select range of rows for all columns


2 ​
3 titanic.loc[0:4] #df.head()
4 ​

Out[72]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley (Florence


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath (Lily May


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Integer position based indexing using .iloc indexer


Pandas provides .iloc indexer for integer position based indexing.
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a
requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. Allowed inputs of .iloc indexer are:-

• An integer e.g. 5.

• A list or array of integers [4, 3, 0].

• A slice object with ints 1:7.

• A boolean array.

In [73]: 1 # select first row of dataframe


2 ​
3 titanic.iloc[0]
4 ​

Out[73]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
In [74]: 1 # select last row of dataframe
2 ​
3 titanic.iloc[-1]
4 ​

Out[74]: PassengerId 891


Survived 0
Pclass 3
Name Dooley, Mr. Patrick
Sex male
Age 32.0
SibSp 0
Parch 0
Ticket 370376
Fare 7.75
Cabin NaN
Embarked Q
Name: 890, dtype: object

In [75]: 1 # select first column of dataframe


2 ​
3 titanic.iloc[:,0]
4 ​

Out[75]: 0 1
1 2
2 3
3 4
4 5
...
886 887
887 888
888 889
889 890
890 891
Name: PassengerId, Length: 891, dtype: int64
In [76]: 1 #select second last column of dataframe
2 ​
3 titanic.iloc[:,-2]
4 ​

Out[76]: 0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object

In [77]: 1 # select first five rows of dataframe


2 ​
3 titanic.iloc[0:5]
4 ​

Out[77]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley (Florence


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath (Lily May


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S


In [78]: 1 # select first five columns of data frame with all rows
2 ​
3 titanic.iloc[:, 0:3]
4 ​

Out[78]:
PassengerId Survived Pclass

0 1 0 3

1 2 1 1

2 3 1 3

3 4 1 1

4 5 0 3

... ... ... ...

886 887 0 2

887 888 1 1

888 889 0 3

889 890 1 1

890 891 0 3

891 rows × 3 columns


In [79]: 1 # select 1st, 5th and 10th rows with 1st, 4th and 7th columns
2 ​
3 titanic.iloc[[0,4,9], [0,3,6]]
4 ​

Out[79]:
PassengerId Name SibSp

0 1 Braund, Mr. Owen Harris 1

4 5 Allen, Mr. William Henry 0

9 10 Nasser, Mrs. Nicholas (Adele Achem) 1

In [80]: 1 # select first 5 rows and 5th, 6th, 7th columns of data frame
2 ​
3 titanic.iloc[0:5, 5:8]
4 ​

Out[80]:
Age SibSp Parch

0 22.0 1 0

1 38.0 1 0

2 26.0 0 0

3 35.0 1 0

4 35.0 0 0

Indexing a single value with at() and iat()


Pandas provides at() and iat() functions to access a single value for a row and column pair by label or by integer position.
In [81]: 1 # get value at 1st row and Purchase column pair
2 ​
3 titanic.at[1, 'Name']
4 ​

Out[81]: 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [82]: 1 # get value at 1st row and 11th column pair


2 ​
3 titanic.iat[1, 3]
4 ​

Out[82]: 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

Indexing first occurrence of maximum or minimum values with idxmax()


and idxmin()
Pandas provide two functions idxmax() and idxmin() that return index of first occurrence of maximum or minimum values over requested axis.
NA/null values are excluded from the output.

In [83]: 1 # get index of first occurence of maximum value


2 ​
3 titanic['Age'].idxmax()
4 ​

Out[83]: 630

In [84]: 1 titanic.at[630, 'Age']

Out[84]: 80.0

In [85]: 1 titanic['Age'].idxmin()

Out[85]: 803
In [86]: 1 titanic.at[803, 'Age']

Out[86]: 0.42

In [87]: 1 # get the row with the maximum value


2 ​
3 titanic.loc[titanic['Age'].idxmin()] #titanic.loc[630]
4 ​

Out[87]: PassengerId 804


Survived 1
Pclass 3
Name Thomas, Master. Assad Alexander
Sex male
Age 0.42
SibSp 0
Parch 1
Ticket 2625
Fare 8.5167
Cabin NaN
Embarked C
Name: 803, dtype: object

Boolean indexing in pandas


Boolean indexing is the use of boolean vectors to filter and select the data. The operators for boolean indexing are -

1. | for or,
2. & for and,
3. ~ for not.

These must be grouped by using parentheses. Using a boolean vector to index a Series works exactly as in a NumPy ndarray.

Conditional selections with boolean arrays using df.loc[selection] is the most common method to use with Pandas DataFrames. With boolean
indexing or logical selection, we can pass an array or Series of True/False values to the .loc indexer to select the rows where the Series has True
values. Then, we will make selections based on the values of different columns in dataset.
We can use a boolean True/False series to select rows in a pandas dataframe where there are true values. Then, a second argument can be
passed to .loc indexer to select other columns of the dataframe with the same label. The columns are referred to by name for the loc indexer and
can be a single string, a list of columns, or a slice ":" operation.
In [88]: 1 age = titanic["Age"] == 30
2 titanic[age]
3 ​

Out[88]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

79 80 1 3 Dowdell, Miss. Elizabeth female 30.0 0 0 364516 12.4750 NaN S

SOTON/OQ
157 158 0 3 Corn, Mr. Harry male 30.0 0 0 8.0500 NaN S
392090

178 179 0 2 Hale, Mr. Reginald male 30.0 0 0 250653 13.0000 NaN S

213 214 0 2 Givard, Mr. Hans Kristensen male 30.0 0 0 250646 13.0000 NaN S

219 220 0 2 Harris, Mr. Walter male 30.0 0 0 W/C 14208 10.5000 NaN S

244 245 0 3 Attalah, Mr. Sleiman male 30.0 0 0 2694 7.2250 NaN C

253 254 0 3 Lobb, Mr. William Arthur male 30.0 1 0 A/5. 3336 16.1000 NaN S

257 258 1 1 Cherry, Miss. Gladys female 30.0 0 0 110152 86.5000 B77 S

286 287 1 3 de Mulder, Mr. Theodore male 30.0 0 0 345774 9.5000 NaN S

308 309 0 2 Abelson, Mr. Samuel male 30.0 1 0 P/PP 3381 24.0000 NaN C

309 310 1 1 Francatelli, Miss. Laura Mabel female 30.0 0 0 PC 17485 56.9292 E36 C

322 323 1 2 Slayter, Miss. Hilda Mary female 30.0 0 0 234818 12.3500 NaN Q

365 366 0 3 Adahl, Mr. Mauritz Nils Martin male 30.0 0 0 C 7076 7.2500 NaN S

418 419 0 2 Matthews, Mr. William John male 30.0 0 0 28228 13.0000 NaN S

452 453 0 1 Foreman, Mr. Benjamin Laventall male 30.0 0 0 113051 27.7500 C111 C

488 489 0 3 Somerton, Mr. Francis William male 30.0 0 0 A.5. 18509 8.0500 NaN S

520 521 1 1 Perreault, Miss. Anne female 30.0 0 0 12749 93.5000 B73 S

534 535 0 3 Cacic, Miss. Marija female 30.0 0 0 315084 8.6625 NaN S

537 538 1 1 LeRoy, Miss. Bertha female 30.0 0 0 PC 17761 106.4250 NaN C

606 607 0 3 Karaic, Mr. Milan male 30.0 0 0 349246 7.8958 NaN S

Renouf, Mrs. Peter Henry (Lillian


726 727 1 2 female 30.0 3 0 31027 21.0000 NaN S
Jefferys)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

747 748 1 2 Sinkkonen, Miss. Anna female 30.0 0 0 250648 13.0000 NaN S

798 799 0 3 Ibrahim Shawah, Mr. Yousseff male 30.0 0 0 2685 7.2292 NaN C

Van Impe, Mrs. Jean Baptiste


799 800 0 3 female 30.0 1 1 345773 24.1500 NaN S
(Rosalie Paula Go...

842 843 1 1 Serepeca, Miss. Augusta female 30.0 0 0 113798 31.0000 NaN C

In [89]: 1 age = titanic["Age"] > 70


2 titanic[age]

Out[89]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C

116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q

493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C

630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S

851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
In [90]: 1 titanic.loc[((titanic["Age"] > 50) & (titanic["Sex"] == "female") & (titanic["Survived"] == 1))]

Out[90]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S

15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S

PC
195 196 1 1 Lurette, Miss. Elise female 58.0 0 0 146.5208 B80 C
17569

Graham, Mrs. William Thompson (Edith PC


268 269 1 1 female 58.0 0 1 153.4625 C125 S
Junkins) 17582

275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S

Warren, Mrs. Frank Manley (Anna Sophia


366 367 1 1 female 60.0 1 0 110813 75.2500 D37 C
Atkinson)

483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S

496 497 1 1 Eustis, Miss. Elizabeth Mussey female 54.0 1 0 36947 78.2667 D20 C

PC
513 514 1 1 Rothschild, Mrs. Martin (Elizabeth L. Barrett) female 54.0 1 0 59.4000 NaN C
17603

571 572 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.0 2 0 11769 51.4792 C101 S

Stephenson, Mrs. Walter Bertram (Martha


591 592 1 1 female 52.0 1 0 36947 78.2667 D20 C
Eustis)

765 766 1 1 Hogeboom, Mrs. John C (Anna Andrews) female 51.0 1 0 13502 77.9583 D11 S

774 775 1 2 Hocking, Mrs. Elizabeth (Eliza Needs) female 54.0 1 3 29105 23.0000 NaN S

Hays, Mrs. Charles Melville (Clara Jennings


820 821 1 1 female 52.0 1 1 12749 93.5000 B69 S
Gr...

829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN

879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C

Indexing with isin() method


The isin() method of Series, returns a boolean vector. It is true wherever the Series elements exist in the passed list. This allows you to select rows
where one or more columns have values we want to access. The same method is available for Index objects. It is useful for the cases when we
don't know which of the sought labels are in fact present.

DataFrame also has an isin() method. When calling isin, we pass a set of values as either an array or dict. If values is an array, isin returns a
DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

In [91]: 1 # creating a bool series from isin()


2 new = titanic["Sex"].isin(["male"])
3 ​
4 # displaying data with gender = male only
5 titanic[new]
6 ​

Out[91]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S

7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

... ... ... ... ... ... ... ... ... ... ... ... ...

883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S

884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

577 rows × 12 columns


In [92]: 1 # creating filters of bool series from isin()
2 filter1 = titanic["Sex"].isin(["female"])
3 filter2 = titanic["Pclass"].isin([1])
4 # displaying data with both filter applied and mandatory
5 titanic[filter1 & filter2]

Out[92]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Cumings, Mrs. John Bradley (Florence Briggs PC


1 2 1 1 female 38.0 1 0 71.2833 C85 C
Th... 17599

3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S

Spencer, Mrs. William Augustus (Marie PC


31 32 1 1 female NaN 1 0 146.5208 B78 C
Eugenie) 17569

PC
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 76.7292 D33 C
17572

... ... ... ... ... ... ... ... ... ... ... ... ...

856 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0 1 1 36928 164.8667 NaN S

Swift, Mrs. Frederick Joel (Margaret Welles


862 863 1 1 female 48.0 0 0 17466 25.9292 D17 S
Ba...

Beckwith, Mrs. Richard Leonard (Sallie


871 872 1 1 female 47.0 1 1 11751 52.5542 D35 S
Monypeny)

879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C

887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S

94 rows × 12 columns

Indexing with query() method


There is a query() method in the DataFrame objects that allows selection using an expression. This method queries the columns of a DataFrame
with a boolean expression.

In [93]: 1 # filtering with query method


2 titanic.query("Sex == 'male'", inplace = True)
3 ​
4 # display
5 titanic
6 ​

Out[93]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S

7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

... ... ... ... ... ... ... ... ... ... ... ... ...

883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S

884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

577 rows × 12 columns


In [94]: 1 # filtering with query method
2 titanic.query("Sex == 'male'" and "Embarked == 'C'", inplace = True)
3 ​
4 # display
5 titanic
6 ​

Out[94]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C

30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C

34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C

36 37 1 3 Mamee, Mr. Hanna male NaN 0 0 2677 7.2292 NaN C

42 43 0 3 Kraeff, Mr. Theodor male NaN 0 0 349253 7.8958 NaN C

... ... ... ... ... ... ... ... ... ... ... ... ...

839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C

843 844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 NaN C

847 848 0 3 Markoff, Mr. Marin male 35.0 0 0 349213 7.8958 NaN C

859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

95 rows × 12 columns

Indexing and reindexing in pandas


Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a
particular axis.

Multiple operations can be accomplished through indexing like :−

• Reorder the existing data to match a new set of labels.


• Insert missing value (NA) markers in label locations where no data for the label existed.

In [95]: 1 # let's create a new dataframe


2 ​
3 food = pd.DataFrame({'Place':['Home', 'Home', 'Hotel', 'Hotel'],
4 'Time': ['Lunch', 'Dinner', 'Lunch', 'Dinner'],
5 'Food':['Soup', 'Rice', 'Soup', 'Chapati'],
6 'Price($)':[10, 20, 30, 40]})
7 ​
8 food
9 ​

Out[95]:
Place Time Food Price($)

0 Home Lunch Soup 10

1 Home Dinner Rice 20

2 Hotel Lunch Soup 30

3 Hotel Dinner Chapati 40

Set an index

DataFrame has a set_index() method which takes a column name (for a regular Index) or a list of column names (for a MultiIndex). This method sets
the dataframe index using existing columns.

I will create a new, re-indexed DataFrame with set_index() method as follows:-


In [96]: 1 food_indexed1=food.set_index('Place')
2 ​
3 food_indexed1
4 ​

Out[96]:
Time Food Price($)

Place

Home Lunch Soup 10

Home Dinner Rice 20

Hotel Lunch Soup 30

Hotel Dinner Chapati 40

In [97]: 1 food_indexed2=food.set_index(['Place', 'Time'])


2 ​
3 food_indexed2
4 ​

Out[97]:
Food Price($)

Place Time

Home Lunch Soup 10

Dinner Rice 20

Hotel Lunch Soup 30

Dinner Chapati 40

Reset the index

There is a function called reset_index() which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the
inverse operation of set_index().
In [98]: 1 food_indexed2.reset_index()

Out[98]:
Place Time Food Price($)

0 Home Lunch Soup 10

1 Home Dinner Rice 20

2 Hotel Lunch Soup 30

3 Hotel Dinner Chapati 40

Sorting in pandas
Pandas provides two kinds of sorting. They are:-

1. Sorting by label
2. Sorting by actual value They are described below:-

1. Sorting by label
We can use the sort_index() method to sort the object by labels. DataFrame can be sorted by passing the axis arguments and the order of sorting.
By default, sorting is done on row labels in ascending order.

The following examples illustrate the idea of sorting by label.


In [99]: 1 # sort the dataframe df2 by label
2 ​
3 titanic.sort_index()
4 ​

Out[99]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C

30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C

34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C

36 37 1 3 Mamee, Mr. Hanna male NaN 0 0 2677 7.2292 NaN C

42 43 0 3 Kraeff, Mr. Theodor male NaN 0 0 349253 7.8958 NaN C

... ... ... ... ... ... ... ... ... ... ... ... ...

839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C

843 844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 NaN C

847 848 0 3 Markoff, Mr. Marin male 35.0 0 0 349213 7.8958 NaN C

859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

95 rows × 12 columns

Order of sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled.

sort the dataframe df2 by label in reverse order

df2.sort_index(ascending=False)

1. Sorting by columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the row or column labels. The default value of axis=0. In this case,
sorting can be done by rows.

If we set axis=1, sorting is done by columns.

sort the dataframe df2 by columns

df2.sort_index(axis=1)

In [100]: 1 titanic.sort_index(ascending=False)

Out[100]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C

847 848 0 3 Markoff, Mr. Marin male 35.0 0 0 349213 7.8958 NaN C

843 844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 NaN C

839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C

... ... ... ... ... ... ... ... ... ... ... ... ...

42 43 0 3 Kraeff, Mr. Theodor male NaN 0 0 349253 7.8958 NaN C

36 37 1 3 Mamee, Mr. Hanna male NaN 0 0 2677 7.2292 NaN C

34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C

30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C

26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C

95 rows × 12 columns
In [101]: 1 titanic.sort_index(axis=1)

Out[101]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket

26 NaN NaN C 7.2250 Emir, Mr. Farred Chehab 0 27 3 male 0 0 2631

30 40.0 NaN C 27.7208 Uruchurtu, Don. Manuel E 0 31 1 male 0 0 PC 17601

34 28.0 NaN C 82.1708 Meyer, Mr. Edgar Joseph 0 35 1 male 1 0 PC 17604

36 NaN NaN C 7.2292 Mamee, Mr. Hanna 0 37 3 male 0 1 2677

42 NaN NaN C 7.8958 Kraeff, Mr. Theodor 0 43 3 male 0 0 349253

... ... ... ... ... ... ... ... ... ... ... ... ...

839 NaN C47 C 29.7000 Marechal, Mr. Pierre 0 840 1 male 0 1 11774

843 34.5 NaN C 6.4375 Lemberopolous, Mr. Peter L 0 844 3 male 0 0 2683

847 35.0 NaN C 7.8958 Markoff, Mr. Marin 0 848 3 male 0 0 349213

859 NaN NaN C 7.2292 Razi, Mr. Raihed 0 860 3 male 0 0 2629

889 26.0 C148 C 30.0000 Behr, Mr. Karl Howell 0 890 1 male 0 1 111369

95 rows × 12 columns

2. Sorting by values
The second method of sorting is sorting by values. Pandas provides sort_values() method to sort by values. It accepts a 'by' argument which will
use the column name of the DataFrame with which the values are to be sorted.

The following example illustrates the idea:-


In [102]: 1 titanic.sort_values(by=['Age'])

Out[102]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C

827 828 1 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C

731 732 0 3 Hassan, Mr. Houssein G N male 11.00 0 0 2699 18.7875 NaN C

125 126 1 3 Nicola-Yarred, Master. Elias male 12.00 1 0 2651 11.2417 NaN C

352 353 0 3 Elias, Mr. Tannous male 15.00 1 1 2695 7.2292 NaN C

... ... ... ... ... ... ... ... ... ... ... ... ...

773 774 0 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C

793 794 0 1 Hoyt, Mr. William Fisher male NaN 0 0 PC 17600 30.6958 NaN C

832 833 0 3 Saad, Mr. Amin male NaN 0 0 2671 7.2292 NaN C

839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C

859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C

95 rows × 12 columns
In [103]: 1 titanic.sort_values(by=['Age', 'Fare'])

Out[103]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C

827 828 1 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C

731 732 0 3 Hassan, Mr. Houssein G N male 11.00 0 0 2699 18.7875 NaN C

125 126 1 3 Nicola-Yarred, Master. Elias male 12.00 1 0 2651 11.2417 NaN C

352 353 0 3 Elias, Mr. Tannous male 15.00 1 1 2695 7.2292 NaN C

... ... ... ... ... ... ... ... ... ... ... ... ...

295 296 0 1 Lewy, Mr. Ervin G male NaN 0 0 PC 17612 27.7208 NaN C

839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C

793 794 0 1 Hoyt, Mr. William Fisher male NaN 0 0 PC 17600 30.6958 NaN C

766 767 0 1 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C

557 558 0 1 Robbins, Mr. Victor male NaN 0 0 PC 17757 227.5250 NaN C

95 rows × 12 columns

Categorical data in pandas


We can check the data types of variables in the dataset with the following command:-
In [104]: 1 titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 26 to 889
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 95 non-null int64
1 Survived 95 non-null int64
2 Pclass 95 non-null int64
3 Name 95 non-null object
4 Sex 95 non-null object
5 Age 69 non-null float64
6 SibSp 95 non-null int64
7 Parch 95 non-null int64
8 Ticket 95 non-null object
9 Fare 95 non-null float64
10 Cabin 32 non-null object
11 Embarked 95 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 9.6+ KB

In [105]: 1 titanic["Sex"].describe()

Out[105]: count 95
unique 1
top male
freq 95
Name: Sex, dtype: object

In [106]: 1 titanic["Embarked"].describe()

Out[106]: count 95
unique 1
top C
freq 95
Name: Embarked, dtype: object
Unique values in categorical data
We can get the unique values in a series object by unique() method. It returns categories in the order of appearance, and it only includes values
that are actually present.

In [107]: 1 titanic["Embarked"].unique()

Out[107]: array(['C'], dtype=object)

In [108]: 1 titanic["Pclass"].unique()

Out[108]: array([3, 1, 2], dtype=int64)

In [109]: 1 titanic['Age'].unique()

Out[109]: array([ nan, 40. , 28. , 65. , 28.5 , 22. , 26. , 71. , 23. ,
24. , 32.5 , 12. , 33. , 51. , 56. , 45.5 , 30. , 37. ,
36. , 23.5 , 15. , 29. , 25. , 27. , 20. , 49. , 58. ,
18. , 17. , 50. , 60. , 35. , 32. , 48. , 11. , 46. ,
0.42, 31. , 1. , 34.5 ])

In [110]: 1 titanic["Sex"] = titanic["Sex"].astype("string")


In [111]: 1 titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 26 to 889
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 95 non-null int64
1 Survived 95 non-null int64
2 Pclass 95 non-null int64
3 Name 95 non-null object
4 Sex 95 non-null string
5 Age 69 non-null float64
6 SibSp 95 non-null int64
7 Parch 95 non-null int64
8 Ticket 95 non-null object
9 Fare 95 non-null float64
10 Cabin 32 non-null object
11 Embarked 95 non-null object
dtypes: float64(2), int64(5), object(4), string(1)
memory usage: 9.6+ KB

Frequency counts of categorical data


Series methods like Series.value_counts() will return the frequency counts of the categories present in the series.

In [112]: 1 titanic['Sex'].value_counts()

Out[112]: male 95
Name: Sex, dtype: Int64

In [113]: 1 titanic['Survived'].value_counts()

Out[113]: 0 66
1 29
Name: Survived, dtype: int64
In [114]: 1 titanic['Pclass'].value_counts(ascending=True)

Out[114]: 2 10
1 42
3 43
Name: Pclass, dtype: int64
In [115]: 1 titanic['Age'].value_counts(ascending=True)

Out[115]: 15.00 1
31.00 1
0.42 1
46.00 1
11.00 1
48.00 1
32.00 1
60.00 1
50.00 1
18.00 1
1.00 1
23.50 1
37.00 1
45.50 1
34.50 1
32.50 1
51.00 1
28.00 1
12.00 1
65.00 1
28.50 1
33.00 2
29.00 2
58.00 2
56.00 2
17.00 2
24.00 2
23.00 2
71.00 2
22.00 3
26.00 3
40.00 3
49.00 3
20.00 3
27.00 3
36.00 3
35.00 3
25.00 4
30.00 4
Name: Age, dtype: int64

Aggregations in pandas
Apply aggregation on a single column of a dataframe

In [116]: 1 titanic['Pclass'].aggregate(np.sum)

Out[116]: 191

In [117]: 1 titanic['Age'].aggregate(np.sum)

Out[117]: 2276.92

In [118]: 1 titanic['Fare'].aggregate(np.sum)

Out[118]: 4584.9003999999995

Apply multiple functions on a single column of a dataframe

In [119]: 1 titanic['Pclass'].aggregate([np.sum, np.mean])

Out[119]: sum 191.000000


mean 2.010526
Name: Pclass, dtype: float64

Apply aggregation on multiple columns of a dataframe


In [120]: 1 titanic[['Pclass', 'Age', 'Fare']].aggregate(np.mean)

Out[120]: Pclass 2.010526


Age 32.998841
Fare 48.262109
dtype: float64

Apply multiple functions on multiple columns of a dataframe

In [121]: 1 titanic[['Pclass', 'Age', 'Fare']].aggregate([np.sum, np.mean])

Out[121]:
Pclass Age Fare

sum 191.000000 2276.920000 4584.900400

mean 2.010526 32.998841 48.262109

Pandas GroupBy operations


In [122]: 1 titanic.groupby('Sex').groups

Out[122]: {'male': [26, 30, 34, 36, 42, 48, 54, 57, 60, 64, 65, 73, 96, 97, 118, 122, 125, 130, 135, 139, 155, 174, 181, 203, 20
7, 209, 244, 273, 285, 292, 295, 296, 308, 352, 354, 361, 370, 373, 377, 378, 420, 452, 453, 455, 484, 487, 493, 495, 5
05, 522, 524, 531, 532, 544, 547, 550, 553, 557, 568, 583, 584, 587, 598, 599, 604, 620, 622, 632, 645, 647, 659, 661,
679, 681, 685, 693, 698, 709, 731, 737, 762, 766, 773, 789, 793, 798, 803, 817, 827, 832, 839, 843, 847, 859, 889]}
In [123]: 1 titanic.groupby(['Sex', 'Age','Pclass']).groups

Out[123]: {('male', nan, 3): [26, 36, 42, 48, 65, 354, 420, 495, 522, 524, 531, 568, 584, 598, 709, 773, 832, 859], ('male', 0.4
2, 3): [803], ('male', 1.0, 2): [827], ('male', 11.0, 3): [731], ('male', 15.0, 3): [352], ('male', 17.0, 1): [550],
('male', 17.0, 3): [532], ('male', 18.0, 1): [505], ('male', 20.0, 3): [378, 622, 762], ('male', nan, 2): [181, 547],
('male', 12.0, 3): [125], ('male', 22.0, 1): [373], ('male', 22.0, 3): [60, 553], ('male', 23.0, 1): [97], ('male', 23.
0, 2): [135], ('male', 24.0, 1): [118, 139], ('male', 25.0, 2): [685], ('male', 25.0, 3): [693], ('male', nan, 1): [64,
295, 557, 766, 793, 839], ('male', 23.5, 3): [296], ('male', 25.0, 1): [370, 484], ('male', 26.0, 1): [889], ('male', 2
6.0, 3): [73, 207], ('male', 27.0, 1): [377, 681], ('male', 27.0, 3): [620], ('male', 28.0, 1): [34], ('male', 28.5,
3): [57], ('male', 29.0, 2): [361], ('male', 29.0, 3): [455], ('male', 30.0, 1): [452], ('male', 30.0, 2): [308], ('mal
e', 30.0, 3): [244, 798], ('male', 31.0, 2): [817], ('male', 32.0, 1): [632], ('male', 32.5, 2): [122], ('male', 33.0,
3): [130, 285], ('male', 34.5, 3): [843], ('male', 35.0, 1): [604, 737], ('male', 35.0, 3): [847], ('male', 36.0, 1):
[583, 679], ('male', 36.0, 2): [292], ('male', 37.0, 1): [273], ('male', 40.0, 1): [30, 209], ('male', 40.0, 3): [661],
('male', 45.5, 3): [203], ('male', 46.0, 1): [789], ('male', 48.0, 1): [645], ('male', 49.0, 1): [453, 599, 698], ('mal
e', 50.0, 1): [544], ('male', 51.0, 1): [155], ('male', 56.0, 1): [174, 647], ('male', 58.0, 1): [487, 659], ('male', 6
0.0, 1): [587], ('male', 65.0, 1): [54], ('male', 71.0, 1): [96, 493]}

In [124]: 1 titanic.groupby('Sex').sum()

Out[124]:
PassengerId Survived Pclass Age SibSp Parch Fare

Sex

male 42896 29 191 2276.92 25 25 4584.9004

In [125]: 1 titanic.groupby('Sex')['Survived'].agg([np.sum, np.mean])

Out[125]:
sum mean

Sex

male 29 0.305263

Pandas merging and joining


Pandas has full-featured, high performance in-memory join operations that are very similar to relational databases like SQL. These methods perform
significantly better than other open source implementations like base::merge.data.frame in R. The reason for this is careful algorithmic design and
the internal layout of the data in DataFrame.

Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects. The syntax of
the merge function is as follows:-

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, so


rt=True)

The description of the parameters used is as follows−

• left − A DataFrame object.

• right − Another DataFrame object.

• on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.

• left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the
DataFrame.

• right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the
DataFrame.

• left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical),
the number of levels must match the number of join keys from the right DataFrame.

• right_index − Same usage as left_index for the right DataFrame.

• how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner.

• sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance
substantially in many cases.

Now, I will create two different DataFrames and perform the merging operations on them as follows:-
In [126]: 1 # let's create two dataframes
2 ​
3 batsmen = pd.DataFrame({ 'id':[1,2,3,4,5],
4 'Name': ['Rohit', 'Dhawan', 'Virat', 'Dhoni', 'Kedar'],
5 'subject_id':['sub1','sub2','sub4','sub6','sub5']})
6 ​
7 bowler = pd.DataFrame(
8 {'id':[1,2,3,4,5],
9 'Name': ['Kumar', 'Bumrah', 'Shami', 'Kuldeep', 'Chahal'],
10 'subject_id':['sub2','sub4','sub3','sub6','sub5']})
11 ​
12 ​
13 print(batsmen)
14 print("-------------------------")
15 print(bowler)
16 ​

id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
-------------------------
id Name subject_id
0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5
In [127]: 1 # merge two dataframes on a key
2 ​
3 pd.merge(batsmen, bowler, on='id')
4 ​

Out[127]:
id Name_x subject_id_x Name_y subject_id_y

0 1 Rohit sub1 Kumar sub2

1 2 Dhawan sub2 Bumrah sub4

2 3 Virat sub4 Shami sub3

3 4 Dhoni sub6 Kuldeep sub6

4 5 Kedar sub5 Chahal sub5

In [128]: 1 # merge two dataframes on multiple keys


2 ​
3 pd.merge(batsmen, bowler, on=['id', 'subject_id'])
4 ​

Out[128]:
id Name_x subject_id Name_y

0 4 Dhoni sub6 Kuldeep

1 5 Kedar sub5 Chahal

Merge using 'how' argument


The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in
either the left or the right tables, the values in the joined table will be NA.

Here is a summary of the how options and their SQL equivalent names −

• Merge Method - SQL Equivalent - Description


• left - LEFT OUTER JOIN - Use keys from left object

• right - RIGHT OUTER JOIN - Use keys from right object

• outer - FULL OUTER JOIN - Use union of keys

• inner - INNER JOIN - Use intersection of keys

In [129]: 1 # left join


2 ​
3 pd.merge(batsmen, bowler, on='subject_id', how='left')
4 ​

Out[129]:
id_x Name_x subject_id id_y Name_y

0 1 Rohit sub1 NaN NaN

1 2 Dhawan sub2 1.0 Kumar

2 3 Virat sub4 2.0 Bumrah

3 4 Dhoni sub6 4.0 Kuldeep

4 5 Kedar sub5 5.0 Chahal


In [130]: 1 # right join
2 ​
3 pd.merge(batsmen, bowler, on='subject_id', how='right')
4 ​

Out[130]:
id_x Name_x subject_id id_y Name_y

0 2.0 Dhawan sub2 1 Kumar

1 3.0 Virat sub4 2 Bumrah

2 NaN NaN sub3 3 Shami

3 4.0 Dhoni sub6 4 Kuldeep

4 5.0 Kedar sub5 5 Chahal

In [131]: 1 # outer join


2 ​
3 pd.merge(batsmen, bowler, on='subject_id', how='outer')
4 ​

Out[131]:
id_x Name_x subject_id id_y Name_y

0 1.0 Rohit sub1 NaN NaN

1 2.0 Dhawan sub2 1.0 Kumar

2 3.0 Virat sub4 2.0 Bumrah

3 4.0 Dhoni sub6 4.0 Kuldeep

4 5.0 Kedar sub5 5.0 Chahal

5 NaN NaN sub3 3.0 Shami


In [132]: 1 # inner join
2 ​
3 pd.merge(batsmen, bowler, on='subject_id', how='inner')
4 ​

Out[132]:
id_x Name_x subject_id id_y Name_y

0 2 Dhawan sub2 1 Kumar

1 3 Virat sub4 2 Bumrah

2 4 Dhoni sub6 4 Kuldeep

3 5 Kedar sub5 5 Chahal

Pandas concatenation operation


Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects.

The concat() function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or
intersection) of the indexes (if any) on the other axes.

The syntax of the concat() function is as follows:-

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None,


verify_integrity=False, copy=True)

The description of the arguments is as follows:-

objs − This is a sequence or mapping of Series, DataFrame, or Panel objects.

axis − {0, 1, ...}, default 0. This is the axis to concatenate along.

join − {'inner', 'outer'}, default 'outer'. How to handle indexes on other axis(es). Outer for union and inner for intersection.

ignore_index − boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.

join_axes − This is the list of index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic.
j _ j p ( ) p g g
keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If multiple levels passed, should contain
tuples.

levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the
keys.

names : list, default None. Names for the levels in the resulting hierarchical index.

verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the
actual data concatenation.

copy : boolean, default True. If False, do not copy data unnecessarily.

Now, I will create two dataframes and do concatenation:-


In [133]: 1 # let's create two dataframes
2 ​
3 batsmen = pd.DataFrame({ 'id':[1,2,3,4,5],
4 'Name': ['Rohit', 'Dhawan', 'Virat', 'Dhoni', 'Kedar'],
5 'subject_id':['sub1','sub2','sub4','sub6','sub5']})
6 ​
7 bowler = pd.DataFrame(
8 {'id':[1,2,3,4,5],
9 'Name': ['Kumar', 'Bumrah', 'Shami', 'Kuldeep', 'Chahal'],
10 'subject_id':['sub2','sub4','sub3','sub6','sub5']})
11 ​
12 ​
13 print(batsmen)
14 print(" ")
15 print(bowler)
16 ​

id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5

id Name subject_id
0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5
In [134]: 1 # concatenate the dataframes
2 ​
3 ​
4 team=[batsmen, bowler]
5 ​
6 pd.concat([batsmen, bowler])
7 ​

Out[134]:
id Name subject_id

0 1 Rohit sub1

1 2 Dhawan sub2

2 3 Virat sub4

3 4 Dhoni sub6

4 5 Kedar sub5

0 1 Kumar sub2

1 2 Bumrah sub4

2 3 Shami sub3

3 4 Kuldeep sub6

4 5 Chahal sub5
In [135]: 1 # associate keys with the dataframes
2 ​
3 pd.concat(team, keys=['x', 'y'])
4 ​

Out[135]:
id Name subject_id

x 0 1 Rohit sub1

1 2 Dhawan sub2

2 3 Virat sub4

3 4 Dhoni sub6

4 5 Kedar sub5

y 0 1 Kumar sub2

1 2 Bumrah sub4

2 3 Shami sub3

3 4 Kuldeep sub6

4 5 Chahal sub5
In [136]: 1 pd.concat(team, keys=['x', 'y'], ignore_index=True)

Out[136]:
id Name subject_id

0 1 Rohit sub1

1 2 Dhawan sub2

2 3 Virat sub4

3 4 Dhoni sub6

4 5 Kedar sub5

5 1 Kumar sub2

6 2 Bumrah sub4

7 3 Shami sub3

8 4 Kuldeep sub6

9 5 Chahal sub5

In [137]: 1 pd.concat(team, axis=1)

Out[137]:
id Name subject_id id Name subject_id

0 1 Rohit sub1 1 Kumar sub2

1 2 Dhawan sub2 2 Bumrah sub4

2 3 Virat sub4 3 Shami sub3

3 4 Dhoni sub6 4 Kuldeep sub6

4 5 Kedar sub5 5 Chahal sub5


In [138]: 1 batsmen.append(bowler)

C:\Users\JAINIL PARIKH\AppData\Local\Temp\ipykernel_1944\40450914.py:1: FutureWarning: The frame.append method is depre


cated and will be removed from pandas in a future version. Use pandas.concat instead.
batsmen.append(bowler)

Out[138]:
id Name subject_id

0 1 Rohit sub1

1 2 Dhawan sub2

2 3 Virat sub4

3 4 Dhoni sub6

4 5 Kedar sub5

0 1 Kumar sub2

1 2 Bumrah sub4

2 3 Shami sub3

3 4 Kuldeep sub6

4 5 Chahal sub5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy