Pandas
Pandas
2 import pandas as pd
In [2]: 1 print(pd.__version__)
1.4.2
Introduction to Pandas
Today, Python is considered as the most popular programming language for doing data science work. The reason behind this popularity is that
Python provides great packages for doing data analysis and visualization work.
Pandas is one of those packages that makes analysing data much easier. Pandas is an open source library for data analysis in Python. It was
developed by Wes McKinney in 2008. Over the years, it has become the standard library for data analysis using Python.
"Pandas offers data structures and operations for manipulating numerical tables and time series. It is free software released under the
three-clause BSD license. The name is derived from the term 'panel data', an econometrics term for data sets that include observations
over multiple time periods for the same individuals."
In this project, I explore Pandas and various data analysis tools provided by Pandas.
Key features of Pandas
Some key features of Pandas are as follows:-
1. It provides tools for reading and writing data from a wide variety of sources such as CSV files, excel files, databases such as SQL, JSON files.
2. It provides different data structures like series, dataframe and panel for data manipulation and indexing.
3. It can handle wide variety of data sets in different formats – time series, heterogeneous data, tabular and matrix data.
4. It can perform variety of operations on datasets. It includes subsetting, slicing, filtering, merging, joining, groupby, reordering and reshaping
operations.
5. It can deal with missing data by either deleting them or filling them with zeros or a suitable test statistic.
6. It can be used for parsing and conversion of data.
7. It provides data filtration techniques.
8. It provides time series functionality – date range generation, frequency conversion, moving window statistics, data shifting and lagging.
9. It integrates well with other Python libraries such as Scikit-learn, statsmodels and SciPy.
10. It delivers fast performance. Also, it can be speeded up even more by making use of Cython (C extensions to Python).
Advantages of Pandas
Pandas is a core component of the Python data analysis toolkit. Pandas provides data structure and operations facilities, which is particularly useful
for data analysis. There are various advantages of using Pandas for data analysis.
Data representation
It represents data in a form that is very much suited for data analysis through its Dataframe and Series data structures.
It provides for easy subsetting and filtering of data. It provides procedures that are suited for data analysis.
It provides functionality to write clear and concise code. It allows us to focus on the task at hand, rather than have to write tedious code.
Data structures in Pandas
Pandas provide easy to use data structures.
• Series
• Dataframe
These data structures are built on top of Numpy array, which means they are fast. I have described these data structures in the following sections
Pandas Series
A Pandas Series is a one-dimensional array like structure with homogeneous data.
The data can be of any type (integer, string, float, etc.). The axis labels are collectively called index.
For example, the following series is a collection of integers 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.
Creating Series
The series is a one-dimensional labeled array. It can accommodate any type of data in it.
0 A
1 p
2 p
3 l
4 i
5 e
6 d
dtype: object
10 A
11 p
12 p
13 l
14 i
15 e
16 d
dtype: object
0 A
1 p
2 p
3 l
4 i
5 e
6 d
dtype: object
ABC 10
DEF 20
GHI 30
dtype: int64
0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64
0 3.0
1 18.0
2 33.0
dtype: float64
0 1.0
1 12.0
2 23.0
3 34.0
4 45.0
5 56.0
6 67.0
7 78.0
8 89.0
9 100.0
dtype: float64
day1 420
day2 380
day3 390
dtype: int64
Pandas DataFrame
A Dataframe is a two-dimensional data structure. So, data is aligned in a tabular fashion in rows and columns. Its column types can be
heterogeneous: - that is, of varying types. It is similar to structured arrays in NumPy with mutability added.
• Its columns are of different types – float64, int, bool, and so on.
• It can be thought of as a dictionary of Series structures where both the rows and columns are indexed, denoted as index in the case of rows and
columns in the case of columns.
Dataframe Creation
A pandas Dataframe can be created using various inputs like −
• Lists
• dict
• Series
• Numpy ndarrays
• Another Dataframe
Creating DataFrame
The dataframe is a two-dimensional data structure. It contains columns.
0
0 ball
1 bat
2 cat
3 dog
4 blue
5 black
6 green
Out[12]:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
In [13]: 1 # Intialise data of lists.
2 data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
3
4 # Create DataFrame
5 df = pd.DataFrame(data)
6
7 # Print the output.
8 print(df)
9
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
In [14]: 1 data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32],
2 'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
3 'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
4
5 # Convert the dictionary into DataFrame
6 df = pd.DataFrame(data)
7 df
8
Out[14]:
Name Age Address Qualification
1 Princi 24 Kanpur MA
Out[15]:
Name marks
rank1 Tom 99
rank2 Jack 98
rank3 nick 95
rank4 juli 90
Out[16]:
b c a
first 2 3 NaN
second 20 30 10.0
In [17]: 1 # List1
2 Name = ['tom', 'krish', 'nick', 'juli']
3
4 # List2
5 Age = [25, 30, 26, 22]
6
7 # get the list of tuples from two lists. # and merge them by using zip().
8 list_of_tuples = set(zip(Name, Age))
9 print(list_of_tuples)
10
11 # Converting lists of tuples into # pandas Dataframe.
12 df = pd.DataFrame(list_of_tuples,
13 columns = ['Name', 'Age'])
14
15 # Print data.
16 df
17
Out[17]:
Name Age
0 tom 25
1 juli 22
2 nick 26
3 krish 30
In [18]: 1 listScoville = [50, 5000, 500000]
2 listName = ["Bell pepper", "Espelette pepper", "Chocolate habanero"]
3 listFeeling = ["Not even spicy", "Uncomfortable", "Practically ate pepper spray"]
4
5 dataFrame1 = pd.DataFrame(zip(listScoville, listName, listFeeling), columns = ['Scoville', 'Name', 'Feeling'])
6
7 # Print the dataframe
8 dataFrame1
9
Out[18]:
Scoville Name Feeling
Out[19]:
one two
a 10 1
b 20 2
c 30 30
d 40 40
Reading and Writing CSV Files
Comma-Separated Values inherently uses a comma as the delimiter, you can use other delimiters (separators) as well, such as the semicolon (;).
Each row of the table is a new line of the CSV file and it's a very compact and concise way to represent tabular data.
Out[20]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
C.A./SOTON
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 10.5000 NaN S
34068
SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
Out[21]:
City State
0 Sacramento California
1 Miami Florida
In [22]: 1 cities.to_csv('Area.csv',index =False)
Out[23]:
City State
0 Sacramento California
1 Miami Florida
Out[24]:
city state
0 Sacramento California
1 Miami Florida
Out[25]:
States Capitals Population
In [26]: 1 df.to_excel('states.xlsx')
Out[29]:
States Capitals Population
JavaScript Object Notation (JSON) is a data format that stores data in a human-readable form. While it can be technically be used for storage,
JSON files are primarily used for serialization and information exchange between a client and server.
Although it was derived from JavaScript, it's platform-agnostic and is a widely-spread and used format - most prevalently in REST APIs.
{"employees": [{"name": "John Doe", "department": "Marketing", "place": "Remote"}, {"name": "Jane Doe", "department":
"Software Engineering", "place": "Remote"}, {"name": "Don Joe", "department": "Software Engineering", "place": "Offic
e"}]}
In [32]: 1 with open('json_data.json', 'w') as outfile: json.dump(json_string, outfile)
{'employees': [{'name': 'John Doe', 'department': 'Marketing', 'place': 'Remote'}, {'name': 'Jane Doe', 'department':
'Software Engineering', 'place': 'Remote'}, {'name': 'Don Joe', 'department': 'Software Engineering', 'place': 'Offic
e'}]}
{"employees": [{"name": "John Doe", "department": "Marketing", "place": "Remote"}, {"name": "Jane Doe", "department":
"Software Engineering", "place": "Remote"}, {"name": "Don Joe", "department": "Software Engineering", "place": "Offic
e"}]}
{
"people": [
{
"name": "Scott",
"website": "stackabuse.com",
"from": "Nebraska"
}
]
}
In [36]: 1 patients = {
2 "Name":{"0":"John","1":"Nick","2":"Ali","3":"Joseph"},
3 "Gender":{"0":"Male","1":"Male","2":"Female","3":"Male"},
4 "Nationality":{"0":"UK","1":"French","2":"USA","3":"Brazil"}, "Age" :{"0":10,"1":25,"2":35,"3":29}
5 }
6
In [37]: 1 with open('patients.json', 'w') as f: json.dump(patients, f)
Out[38]:
Name Gender Nationality Age
0 John Male UK 10
Out[39]:
sepalLength sepalWidth petalLength petalWidth species
Out[40]:
total_bill tip sex smoker day time size
In [41]: 1 dataset.to_json('tips.json')
In [42]: 1 with open('tips.json') as json_file: data = json.load(json_file)
2 print(data)
3
{'total_bill': {'0': 16.99, '1': 10.34, '2': 21.01, '3': 23.68, '4': 24.59, '5': 25.29, '6': 8.77, '7': 26.88, '8':
15.04, '9': 14.78, '10': 10.27, '11': 35.26, '12': 15.42, '13': 18.43, '14': 14.83, '15': 21.58, '16': 10.33, '17':
16.29, '18': 16.97, '19': 20.65, '20': 17.92, '21': 20.29, '22': 15.77, '23': 39.42, '24': 19.82, '25': 17.81, '26':
13.37, '27': 12.69, '28': 21.7, '29': 19.65, '30': 9.55, '31': 18.35, '32': 15.06, '33': 20.69, '34': 17.78, '35': 2
4.06, '36': 16.31, '37': 16.93, '38': 18.69, '39': 31.27, '40': 16.04, '41': 17.46, '42': 13.94, '43': 9.68, '44': 3
0.4, '45': 18.29, '46': 22.23, '47': 32.4, '48': 28.55, '49': 18.04, '50': 12.54, '51': 10.29, '52': 34.81, '53': 9.
94, '54': 25.56, '55': 19.49, '56': 38.01, '57': 26.41, '58': 11.24, '59': 48.27, '60': 20.29, '61': 13.81, '62': 1
1.02, '63': 18.29, '64': 17.59, '65': 20.08, '66': 16.45, '67': 3.07, '68': 20.23, '69': 15.01, '70': 12.02, '71': 1
7.07, '72': 26.86, '73': 25.28, '74': 14.73, '75': 10.51, '76': 17.92, '77': 27.2, '78': 22.76, '79': 17.29, '80': 1
9.44, '81': 16.66, '82': 10.07, '83': 32.68, '84': 15.98, '85': 34.83, '86': 13.03, '87': 18.28, '88': 24.71, '89':
21.16, '90': 28.97, '91': 22.49, '92': 5.75, '93': 16.32, '94': 22.75, '95': 40.17, '96': 27.28, '97': 12.03, '98':
21.01, '99': 12.46, '100': 11.35, '101': 15.38, '102': 44.3, '103': 22.42, '104': 20.92, '105': 15.36, '106': 20.49,
'107': 25.21, '108': 18.24, '109': 14.31, '110': 14.0, '111': 7.25, '112': 38.07, '113': 23.95, '114': 25.71, '115':
17.31, '116': 29.93, '117': 10.65, '118': 12.43, '119': 24.08, '120': 11.69, '121': 13.42, '122': 14.26, '123': 15.9
5, '124': 12.48, '125': 29.8, '126': 8.52, '127': 14.52, '128': 11.38, '129': 22.82, '130': 19.08, '131': 20.27, '13
2': 11.17, '133': 12.26, '134': 18.26, '135': 8.51, '136': 10.33, '137': 14.15, '138': 16.0, '139': 13.16, '140': 1
7.47, '141': 34.3, '142': 41.19, '143': 27.05, '144': 16.43, '145': 8.35, '146': 18.64, '147': 11.87, '148': 9.78,
'149': 7.51, '150': 14.07, '151': 13.13, '152': 17.26, '153': 24.55, '154': 19.77, '155': 29.85, '156': 48.17, '15
7': 25.0, '158': 13.39, '159': 16.49, '160': 21.5, '161': 12.66, '162': 16.21, '163': 13.81, '164': 17.51, '165': 2
4 52 '166' 20 76 '167' 31 71 '168' 10 59 '169' 10 63 '170' 50 81 '171' 15 81 '172' 7 25 '173' 31 85
Why Pickle?: In real world sceanario, the use pickling and unpickling are widespread as they allow us to easily transfer data from one
server/system to another and then store it in a file or database.
In [43]: 1 # dictionary of data
2 dct = {"f1": range(6), "b1": range(6, 12)}
3
4 # forming dataframe
5 data = pd.DataFrame(dct)
6
7 # using to_pickle function to form file # with name 'pickle_data'
8 pd.to_pickle(data,'./pickle_data.pkl')
9
f1 b1
0 0 6
1 1 7
2 2 8
3 3 9
4 4 10
5 5 11
Out[47]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
Out[48]: pandas.core.frame.DataFrame
Out[49]:
Age Survived
0 22.0 0
1 38.0 1
2 26.0 1
3 35.0 1
4 35.0 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [52]: 1 titanic.describe()
Out[52]:
PassengerId Survived Pclass Age SibSp Parch Fare
In [53]: 1 titanic.isnull().sum()
Out[53]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Examples of isna() and notna() commands. detect ‘NA’ values in the dataframe df.isna().sum()
pd.isna(df[‘col_name’])
df[‘col_name’].notna()
In [54]: 1 titanic.isna().sum()
Out[54]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [55]: 1 pd.isna(titanic["Fare"])
Out[55]: 0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Fare, Length: 891, dtype: bool
In [56]: 1 titanic["Age"].isna()
Out[56]: 0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool
In [57]: 1 titanic["Age"].isna()
Out[57]: 0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool
In [58]: 1 titanic["Age"].notna()
Out[58]: 0 True
1 True
2 True
3 True
4 True
...
886 True
887 True
888 False
889 True
890 True
Name: Age, Length: 891, dtype: bool
Out[59]:
day temperature windspeed event
Out[60]:
day temperature windspeed event
Out[61]:
day temperature windspeed event
Drop NA
Out[62]:
day temperature windspeed event
In [63]: 1 df = pd.read_csv("weather_data_missing.csv")
2 df
Out[63]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
3 1/4/2017 -99999 7 0
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0
Out[64]:
day temperature windspeed event
Out[65]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 0 7 Sunny
2 1/3/2017 28 0 Snow
3 1/4/2017 0 7 0
4 1/5/2017 32 0 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0
Out[66]:
day temperature windspeed event
Pandas provides three types of Multi-axes indexing. Those three types are mentioned in the following table:-
Starting with pandas 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. So, I will not discuss it here and limit the
discussion to .loc and .iloc indexers.
Syntax-
The first one indicates the row and the second one indicates columns.
Out[67]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
In [68]: 1 #select first five rows for a specific column
2
3 titanic.loc[:4,'Name']
4
Out[70]:
Age Name
Out[71]:
Age Name
Out[72]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
• An integer e.g. 5.
• A boolean array.
Out[73]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
In [74]: 1 # select last row of dataframe
2
3 titanic.iloc[-1]
4
Out[75]: 0 1
1 2
2 3
3 4
4 5
...
886 887
887 888
888 889
889 890
890 891
Name: PassengerId, Length: 891, dtype: int64
In [76]: 1 #select second last column of dataframe
2
3 titanic.iloc[:,-2]
4
Out[76]: 0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
Out[77]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
Out[78]:
PassengerId Survived Pclass
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Out[79]:
PassengerId Name SibSp
In [80]: 1 # select first 5 rows and 5th, 6th, 7th columns of data frame
2
3 titanic.iloc[0:5, 5:8]
4
Out[80]:
Age SibSp Parch
0 22.0 1 0
1 38.0 1 0
2 26.0 0 0
3 35.0 1 0
4 35.0 0 0
Out[83]: 630
Out[84]: 80.0
In [85]: 1 titanic['Age'].idxmin()
Out[85]: 803
In [86]: 1 titanic.at[803, 'Age']
Out[86]: 0.42
1. | for or,
2. & for and,
3. ~ for not.
These must be grouped by using parentheses. Using a boolean vector to index a Series works exactly as in a NumPy ndarray.
Conditional selections with boolean arrays using df.loc[selection] is the most common method to use with Pandas DataFrames. With boolean
indexing or logical selection, we can pass an array or Series of True/False values to the .loc indexer to select the rows where the Series has True
values. Then, we will make selections based on the values of different columns in dataset.
We can use a boolean True/False series to select rows in a pandas dataframe where there are true values. Then, a second argument can be
passed to .loc indexer to select other columns of the dataframe with the same label. The columns are referred to by name for the loc indexer and
can be a single string, a list of columns, or a slice ":" operation.
In [88]: 1 age = titanic["Age"] == 30
2 titanic[age]
3
Out[88]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
SOTON/OQ
157 158 0 3 Corn, Mr. Harry male 30.0 0 0 8.0500 NaN S
392090
178 179 0 2 Hale, Mr. Reginald male 30.0 0 0 250653 13.0000 NaN S
213 214 0 2 Givard, Mr. Hans Kristensen male 30.0 0 0 250646 13.0000 NaN S
219 220 0 2 Harris, Mr. Walter male 30.0 0 0 W/C 14208 10.5000 NaN S
244 245 0 3 Attalah, Mr. Sleiman male 30.0 0 0 2694 7.2250 NaN C
253 254 0 3 Lobb, Mr. William Arthur male 30.0 1 0 A/5. 3336 16.1000 NaN S
257 258 1 1 Cherry, Miss. Gladys female 30.0 0 0 110152 86.5000 B77 S
286 287 1 3 de Mulder, Mr. Theodore male 30.0 0 0 345774 9.5000 NaN S
308 309 0 2 Abelson, Mr. Samuel male 30.0 1 0 P/PP 3381 24.0000 NaN C
309 310 1 1 Francatelli, Miss. Laura Mabel female 30.0 0 0 PC 17485 56.9292 E36 C
322 323 1 2 Slayter, Miss. Hilda Mary female 30.0 0 0 234818 12.3500 NaN Q
365 366 0 3 Adahl, Mr. Mauritz Nils Martin male 30.0 0 0 C 7076 7.2500 NaN S
418 419 0 2 Matthews, Mr. William John male 30.0 0 0 28228 13.0000 NaN S
452 453 0 1 Foreman, Mr. Benjamin Laventall male 30.0 0 0 113051 27.7500 C111 C
488 489 0 3 Somerton, Mr. Francis William male 30.0 0 0 A.5. 18509 8.0500 NaN S
520 521 1 1 Perreault, Miss. Anne female 30.0 0 0 12749 93.5000 B73 S
534 535 0 3 Cacic, Miss. Marija female 30.0 0 0 315084 8.6625 NaN S
537 538 1 1 LeRoy, Miss. Bertha female 30.0 0 0 PC 17761 106.4250 NaN C
606 607 0 3 Karaic, Mr. Milan male 30.0 0 0 349246 7.8958 NaN S
747 748 1 2 Sinkkonen, Miss. Anna female 30.0 0 0 250648 13.0000 NaN S
798 799 0 3 Ibrahim Shawah, Mr. Yousseff male 30.0 0 0 2685 7.2292 NaN C
842 843 1 1 Serepeca, Miss. Augusta female 30.0 0 0 113798 31.0000 NaN C
Out[89]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
In [90]: 1 titanic.loc[((titanic["Age"] > 50) & (titanic["Sex"] == "female") & (titanic["Survived"] == 1))]
Out[90]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PC
195 196 1 1 Lurette, Miss. Elise female 58.0 0 0 146.5208 B80 C
17569
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
496 497 1 1 Eustis, Miss. Elizabeth Mussey female 54.0 1 0 36947 78.2667 D20 C
PC
513 514 1 1 Rothschild, Mrs. Martin (Elizabeth L. Barrett) female 54.0 1 0 59.4000 NaN C
17603
571 572 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.0 2 0 11769 51.4792 C101 S
765 766 1 1 Hogeboom, Mrs. John C (Anna Andrews) female 51.0 1 0 13502 77.9583 D11 S
774 775 1 2 Hocking, Mrs. Elizabeth (Eliza Needs) female 54.0 1 3 29105 23.0000 NaN S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
DataFrame also has an isin() method. When calling isin, we pass a set of values as either an array or dict. If values is an array, isin returns a
DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.
Out[91]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
Out[92]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
PC
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 76.7292 D33 C
17572
... ... ... ... ... ... ... ... ... ... ... ... ...
856 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0 1 1 36928 164.8667 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
94 rows × 12 columns
Out[93]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
Out[94]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
... ... ... ... ... ... ... ... ... ... ... ... ...
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
843 844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 NaN C
847 848 0 3 Markoff, Mr. Marin male 35.0 0 0 349213 7.8958 NaN C
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
95 rows × 12 columns
Out[95]:
Place Time Food Price($)
Set an index
DataFrame has a set_index() method which takes a column name (for a regular Index) or a list of column names (for a MultiIndex). This method sets
the dataframe index using existing columns.
Out[96]:
Time Food Price($)
Place
Out[97]:
Food Price($)
Place Time
Dinner Rice 20
Dinner Chapati 40
There is a function called reset_index() which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the
inverse operation of set_index().
In [98]: 1 food_indexed2.reset_index()
Out[98]:
Place Time Food Price($)
Sorting in pandas
Pandas provides two kinds of sorting. They are:-
1. Sorting by label
2. Sorting by actual value They are described below:-
1. Sorting by label
We can use the sort_index() method to sort the object by labels. DataFrame can be sorted by passing the axis arguments and the order of sorting.
By default, sorting is done on row labels in ascending order.
Out[99]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
... ... ... ... ... ... ... ... ... ... ... ... ...
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
843 844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 NaN C
847 848 0 3 Markoff, Mr. Marin male 35.0 0 0 349213 7.8958 NaN C
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
95 rows × 12 columns
Order of sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled.
df2.sort_index(ascending=False)
1. Sorting by columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the row or column labels. The default value of axis=0. In this case,
sorting can be done by rows.
df2.sort_index(axis=1)
In [100]: 1 titanic.sort_index(ascending=False)
Out[100]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
847 848 0 3 Markoff, Mr. Marin male 35.0 0 0 349213 7.8958 NaN C
843 844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 NaN C
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
... ... ... ... ... ... ... ... ... ... ... ... ...
95 rows × 12 columns
In [101]: 1 titanic.sort_index(axis=1)
Out[101]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
... ... ... ... ... ... ... ... ... ... ... ... ...
839 NaN C47 C 29.7000 Marechal, Mr. Pierre 0 840 1 male 0 1 11774
843 34.5 NaN C 6.4375 Lemberopolous, Mr. Peter L 0 844 3 male 0 0 2683
847 35.0 NaN C 7.8958 Markoff, Mr. Marin 0 848 3 male 0 0 349213
859 NaN NaN C 7.2292 Razi, Mr. Raihed 0 860 3 male 0 0 2629
889 26.0 C148 C 30.0000 Behr, Mr. Karl Howell 0 890 1 male 0 1 111369
95 rows × 12 columns
2. Sorting by values
The second method of sorting is sorting by values. Pandas provides sort_values() method to sort by values. It accepts a 'by' argument which will
use the column name of the DataFrame with which the values are to be sorted.
Out[102]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
827 828 1 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
731 732 0 3 Hassan, Mr. Houssein G N male 11.00 0 0 2699 18.7875 NaN C
125 126 1 3 Nicola-Yarred, Master. Elias male 12.00 1 0 2651 11.2417 NaN C
352 353 0 3 Elias, Mr. Tannous male 15.00 1 1 2695 7.2292 NaN C
... ... ... ... ... ... ... ... ... ... ... ... ...
773 774 0 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C
793 794 0 1 Hoyt, Mr. William Fisher male NaN 0 0 PC 17600 30.6958 NaN C
832 833 0 3 Saad, Mr. Amin male NaN 0 0 2671 7.2292 NaN C
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
95 rows × 12 columns
In [103]: 1 titanic.sort_values(by=['Age', 'Fare'])
Out[103]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
827 828 1 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
731 732 0 3 Hassan, Mr. Houssein G N male 11.00 0 0 2699 18.7875 NaN C
125 126 1 3 Nicola-Yarred, Master. Elias male 12.00 1 0 2651 11.2417 NaN C
352 353 0 3 Elias, Mr. Tannous male 15.00 1 1 2695 7.2292 NaN C
... ... ... ... ... ... ... ... ... ... ... ... ...
295 296 0 1 Lewy, Mr. Ervin G male NaN 0 0 PC 17612 27.7208 NaN C
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
793 794 0 1 Hoyt, Mr. William Fisher male NaN 0 0 PC 17600 30.6958 NaN C
766 767 0 1 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C
557 558 0 1 Robbins, Mr. Victor male NaN 0 0 PC 17757 227.5250 NaN C
95 rows × 12 columns
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 26 to 889
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 95 non-null int64
1 Survived 95 non-null int64
2 Pclass 95 non-null int64
3 Name 95 non-null object
4 Sex 95 non-null object
5 Age 69 non-null float64
6 SibSp 95 non-null int64
7 Parch 95 non-null int64
8 Ticket 95 non-null object
9 Fare 95 non-null float64
10 Cabin 32 non-null object
11 Embarked 95 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 9.6+ KB
In [105]: 1 titanic["Sex"].describe()
Out[105]: count 95
unique 1
top male
freq 95
Name: Sex, dtype: object
In [106]: 1 titanic["Embarked"].describe()
Out[106]: count 95
unique 1
top C
freq 95
Name: Embarked, dtype: object
Unique values in categorical data
We can get the unique values in a series object by unique() method. It returns categories in the order of appearance, and it only includes values
that are actually present.
In [107]: 1 titanic["Embarked"].unique()
In [108]: 1 titanic["Pclass"].unique()
In [109]: 1 titanic['Age'].unique()
Out[109]: array([ nan, 40. , 28. , 65. , 28.5 , 22. , 26. , 71. , 23. ,
24. , 32.5 , 12. , 33. , 51. , 56. , 45.5 , 30. , 37. ,
36. , 23.5 , 15. , 29. , 25. , 27. , 20. , 49. , 58. ,
18. , 17. , 50. , 60. , 35. , 32. , 48. , 11. , 46. ,
0.42, 31. , 1. , 34.5 ])
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 26 to 889
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 95 non-null int64
1 Survived 95 non-null int64
2 Pclass 95 non-null int64
3 Name 95 non-null object
4 Sex 95 non-null string
5 Age 69 non-null float64
6 SibSp 95 non-null int64
7 Parch 95 non-null int64
8 Ticket 95 non-null object
9 Fare 95 non-null float64
10 Cabin 32 non-null object
11 Embarked 95 non-null object
dtypes: float64(2), int64(5), object(4), string(1)
memory usage: 9.6+ KB
In [112]: 1 titanic['Sex'].value_counts()
Out[112]: male 95
Name: Sex, dtype: Int64
In [113]: 1 titanic['Survived'].value_counts()
Out[113]: 0 66
1 29
Name: Survived, dtype: int64
In [114]: 1 titanic['Pclass'].value_counts(ascending=True)
Out[114]: 2 10
1 42
3 43
Name: Pclass, dtype: int64
In [115]: 1 titanic['Age'].value_counts(ascending=True)
Out[115]: 15.00 1
31.00 1
0.42 1
46.00 1
11.00 1
48.00 1
32.00 1
60.00 1
50.00 1
18.00 1
1.00 1
23.50 1
37.00 1
45.50 1
34.50 1
32.50 1
51.00 1
28.00 1
12.00 1
65.00 1
28.50 1
33.00 2
29.00 2
58.00 2
56.00 2
17.00 2
24.00 2
23.00 2
71.00 2
22.00 3
26.00 3
40.00 3
49.00 3
20.00 3
27.00 3
36.00 3
35.00 3
25.00 4
30.00 4
Name: Age, dtype: int64
Aggregations in pandas
Apply aggregation on a single column of a dataframe
In [116]: 1 titanic['Pclass'].aggregate(np.sum)
Out[116]: 191
In [117]: 1 titanic['Age'].aggregate(np.sum)
Out[117]: 2276.92
In [118]: 1 titanic['Fare'].aggregate(np.sum)
Out[118]: 4584.9003999999995
Out[121]:
Pclass Age Fare
Out[122]: {'male': [26, 30, 34, 36, 42, 48, 54, 57, 60, 64, 65, 73, 96, 97, 118, 122, 125, 130, 135, 139, 155, 174, 181, 203, 20
7, 209, 244, 273, 285, 292, 295, 296, 308, 352, 354, 361, 370, 373, 377, 378, 420, 452, 453, 455, 484, 487, 493, 495, 5
05, 522, 524, 531, 532, 544, 547, 550, 553, 557, 568, 583, 584, 587, 598, 599, 604, 620, 622, 632, 645, 647, 659, 661,
679, 681, 685, 693, 698, 709, 731, 737, 762, 766, 773, 789, 793, 798, 803, 817, 827, 832, 839, 843, 847, 859, 889]}
In [123]: 1 titanic.groupby(['Sex', 'Age','Pclass']).groups
Out[123]: {('male', nan, 3): [26, 36, 42, 48, 65, 354, 420, 495, 522, 524, 531, 568, 584, 598, 709, 773, 832, 859], ('male', 0.4
2, 3): [803], ('male', 1.0, 2): [827], ('male', 11.0, 3): [731], ('male', 15.0, 3): [352], ('male', 17.0, 1): [550],
('male', 17.0, 3): [532], ('male', 18.0, 1): [505], ('male', 20.0, 3): [378, 622, 762], ('male', nan, 2): [181, 547],
('male', 12.0, 3): [125], ('male', 22.0, 1): [373], ('male', 22.0, 3): [60, 553], ('male', 23.0, 1): [97], ('male', 23.
0, 2): [135], ('male', 24.0, 1): [118, 139], ('male', 25.0, 2): [685], ('male', 25.0, 3): [693], ('male', nan, 1): [64,
295, 557, 766, 793, 839], ('male', 23.5, 3): [296], ('male', 25.0, 1): [370, 484], ('male', 26.0, 1): [889], ('male', 2
6.0, 3): [73, 207], ('male', 27.0, 1): [377, 681], ('male', 27.0, 3): [620], ('male', 28.0, 1): [34], ('male', 28.5,
3): [57], ('male', 29.0, 2): [361], ('male', 29.0, 3): [455], ('male', 30.0, 1): [452], ('male', 30.0, 2): [308], ('mal
e', 30.0, 3): [244, 798], ('male', 31.0, 2): [817], ('male', 32.0, 1): [632], ('male', 32.5, 2): [122], ('male', 33.0,
3): [130, 285], ('male', 34.5, 3): [843], ('male', 35.0, 1): [604, 737], ('male', 35.0, 3): [847], ('male', 36.0, 1):
[583, 679], ('male', 36.0, 2): [292], ('male', 37.0, 1): [273], ('male', 40.0, 1): [30, 209], ('male', 40.0, 3): [661],
('male', 45.5, 3): [203], ('male', 46.0, 1): [789], ('male', 48.0, 1): [645], ('male', 49.0, 1): [453, 599, 698], ('mal
e', 50.0, 1): [544], ('male', 51.0, 1): [155], ('male', 56.0, 1): [174, 647], ('male', 58.0, 1): [487, 659], ('male', 6
0.0, 1): [587], ('male', 65.0, 1): [54], ('male', 71.0, 1): [96, 493]}
In [124]: 1 titanic.groupby('Sex').sum()
Out[124]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
Out[125]:
sum mean
Sex
male 29 0.305263
Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects. The syntax of
the merge function is as follows:-
• on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.
• left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the
DataFrame.
• right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the
DataFrame.
• left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical),
the number of levels must match the number of join keys from the right DataFrame.
• sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance
substantially in many cases.
Now, I will create two different DataFrames and perform the merging operations on them as follows:-
In [126]: 1 # let's create two dataframes
2
3 batsmen = pd.DataFrame({ 'id':[1,2,3,4,5],
4 'Name': ['Rohit', 'Dhawan', 'Virat', 'Dhoni', 'Kedar'],
5 'subject_id':['sub1','sub2','sub4','sub6','sub5']})
6
7 bowler = pd.DataFrame(
8 {'id':[1,2,3,4,5],
9 'Name': ['Kumar', 'Bumrah', 'Shami', 'Kuldeep', 'Chahal'],
10 'subject_id':['sub2','sub4','sub3','sub6','sub5']})
11
12
13 print(batsmen)
14 print("-------------------------")
15 print(bowler)
16
id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
-------------------------
id Name subject_id
0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5
In [127]: 1 # merge two dataframes on a key
2
3 pd.merge(batsmen, bowler, on='id')
4
Out[127]:
id Name_x subject_id_x Name_y subject_id_y
Out[128]:
id Name_x subject_id Name_y
Here is a summary of the how options and their SQL equivalent names −
Out[129]:
id_x Name_x subject_id id_y Name_y
Out[130]:
id_x Name_x subject_id id_y Name_y
Out[131]:
id_x Name_x subject_id id_y Name_y
Out[132]:
id_x Name_x subject_id id_y Name_y
The concat() function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or
intersection) of the indexes (if any) on the other axes.
join − {'inner', 'outer'}, default 'outer'. How to handle indexes on other axis(es). Outer for union and inner for intersection.
ignore_index − boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.
join_axes − This is the list of index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic.
j _ j p ( ) p g g
keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If multiple levels passed, should contain
tuples.
levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the
keys.
names : list, default None. Names for the levels in the resulting hierarchical index.
verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the
actual data concatenation.
id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
id Name subject_id
0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5
In [134]: 1 # concatenate the dataframes
2
3
4 team=[batsmen, bowler]
5
6 pd.concat([batsmen, bowler])
7
Out[134]:
id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5
In [135]: 1 # associate keys with the dataframes
2
3 pd.concat(team, keys=['x', 'y'])
4
Out[135]:
id Name subject_id
x 0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
y 0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5
In [136]: 1 pd.concat(team, keys=['x', 'y'], ignore_index=True)
Out[136]:
id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
5 1 Kumar sub2
6 2 Bumrah sub4
7 3 Shami sub3
8 4 Kuldeep sub6
9 5 Chahal sub5
Out[137]:
id Name subject_id id Name subject_id
Out[138]:
id Name subject_id
0 1 Rohit sub1
1 2 Dhawan sub2
2 3 Virat sub4
3 4 Dhoni sub6
4 5 Kedar sub5
0 1 Kumar sub2
1 2 Bumrah sub4
2 3 Shami sub3
3 4 Kuldeep sub6
4 5 Chahal sub5