0% found this document useful (0 votes)

20 views40 pages

Unit 4 PPT Part2 - Pandas

Unit IV of the document covers Python libraries for data wrangling, focusing on NumPy and Pandas. It discusses data manipulation techniques using Pandas, including data indexing, selection, handling missing data, and operations on DataFrames. The document also highlights the advantages of Pandas over NumPy, particularly in managing structured data and performing complex data operations.

Uploaded by

Hari Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views40 pages

Unit 4 PPT Part2 - Pandas

Uploaded by

Hari Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Unit-4 PANDAS

JAL1301 DATA
SCIENCE USING
PYTHON

UNIT IV
PYTHON
LIBRARIES FOR
DATA WRANGLING

UNIT IV PYTHON LIBRARIES

FOR DATA WRANGLING
SYLLABUS
Basics of Numpy arrays – Aggregations
– Computations on arrays –
Comparisons masks boolean logic –
Fancy indexing – Sorting arrays -
Structured data
Data manipulation with Pandas – Data
indexing and selection – Operating on
data – Handling missing data –
Hierarchical indexing – Combining
datasets – Aggregation and grouping –
Pivot tables.
2

1
Unit-4 PANDAS

DATA
MANIPULATION
WITH PANDAS

2
Unit-4 PANDAS

PANDAS
 NumPy and its ndarray object provides efficient storage and manipulation
of dense typed arrays in Python.
 NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks.
 Limitations:
  less flexibility (attaching labels to data, working with missing data, etc.)
and
  when attempting operations that do not map well to element-wise
broadcasting (groupings, pivots, etc.),
 Above mentioned are important piece of analyzing the less structured data
available in many forms in the world around us.
5

PANDAS
 Pandas library provides efficient data structures.
 Pandas  a newer package built on top of NumPy  provides an
efficient implementation of a DataFrame.
 DataFrames  essentially multidimensional arrays with attached
row and column labels, and often with heterogeneous types
and/or missing data.
 Pandas and its Series and DataFrame objects, builds on the
NumPy array structure
 Provides efficient access to these sorts of data munging (data
wrangling) tasks that occupy much of a data scientist’s time.
6

3
Unit-4 PANDAS

DATA
INDEXING
AND
SELECTION

DATA • Data Selection in Series

• Series as dictionary

INDEXING •• Series as 1-d array

Indexers: loc, iloc, and ix

AND • Data Selection in

DataFrame
SELECTION •• DataFrame as a dictionary
DataFrame as 2 – d array
• Additional indexing
conventions

4
Unit-4 PANDAS

DATA INDEXING AND SELECTION

• Methods and tools to access, set, and modify values in NumPy arrays
 indexing, slicing, masking, fancy indexing and combinations.
• Similar means of accessing and modifying values in Pandas Series and
DataFrame objects are available.
• Simpler case  Data selection in 1-d series object
• Complicated case  Data selection in 2-d data frame object

DATA SELECTION IN SERIES - SERIES AS

DICTIONARY
Series object
• Provides a mapping from a collection of keys to a collection of values
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data) Output:
a 0.25
print(data[‘b’]) b 0.50
c 0.75
d 1.00
dtype: float64
10
0.5

5
Unit-4 PANDAS

DATA SELECTION IN SERIES - SERIES AS

DICTIONARY
Series object
• Dictionary-like Python expressions and methods are available to examine
the keys/indices and values
• A Series can be extended by assigning to a new index value:

print(‘b’ in data) Output:

print(data.keys()) True
print(list(data.items())) Index(['a', 'b', 'c', 'd'], dtype='object')
data[‘e’] = 1.62 [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

• Pandas is making decisions about memory layout and data copying

DATA SELECTION IN SERIES - SERIES AS 1-

DIMENSIONAL ARRAY
Series object
• Can also be used for slicing, masking and fancy indexing
print(data['a‘ : 'c']) # slicing by explicit integer index
print(data[0 : 2]) # slicing by implicit integer index
print(data[(data > 0.3) & (data < 0.8)]) # masking
print(data[['a', 'e']]) # fancy indexing
Output 1: Output 2: Output 3: Output 4:
a 0.25 a 0.25 b 0.50 a 0.25
b 0.50 b 0.50 c 0.75 e 1.62
c 0.75 dtype: float64 dtype: float64 dtype: float64 12
dtype: float64

6
Unit-4 PANDAS

DATA SELECTION IN SERIES - INDEXERS:

LOC, ILOC, AND IX
Series object
• If Series has integer index, it creates confusion in indexing and slicing.

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])

print(data[1]) # explicit index when indexing
print(data[0 : 2]) # implicit index when slicing

Output 2:
Output 1: 1 0.25
0.25 2 0.50
dtype: float64 13

DATA SELECTION IN SERIES - INDEXERS: LOC,

ILOC, AND IX
Series object
• To avoid confusion, loc (for explicit indexing) and iloc (for implicit indexing) can be
used. (ix indexer – used in the context of DataFrame objects)
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])
print(data.loc[1]) # explicit index when indexing
print(data.loc[1 : 3]) # explicit index when slicing
print(data.iloc[1]) # implicit index when indexing
print(data.iloc[1 : 3]) # implicit index when slicing
Output 2: Output 4:
1 0.25
Output 1: Output 3: 2 0.50
2 0.50
0.25 3 0.75 0.50 3 0.75
dtype: float64 dtype: float64 14

7
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A DICTIONARY
DataFrame object
• DataFrame acts like a two-dimensional or structured array  like a
dictionary of Series structures sharing the same index.
import pandas as pd
area = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
pop = pd.Series([100, 150, 125, 85], index=['a', 'b', 'c', 'd'])
print(data) = pd.DataFrame({‘AREA’ : area, ‘POPULATION’ : pop})
Output:
AREA POPULATION
a 0.25 100
b 0.50 150
c 0.75 125
d 1.00 85 15

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A DICTIONARY
DataFrame object
• DataFrame acts like a two-dimensional or structured array  like a
dictionary of Series structures sharing the same index.
import pandas as pd
area = pd.Series({‘a’ : 0.25, ‘b’ : 0.5, ‘c’ : 0.75, ‘d’ : 1.0})
pop = pd.Series({‘a’ : 100, ‘b’ : 150, ‘c’ : 125, ‘d’ : 85})
data = pd.DataFrame({‘AREA’ : area, ‘POPULATION’ : pop})
Output:
AREA POPULATION
a 0.25 100
b 0.50 150
c 0.75 125
d 1.00 85 16

8
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A DICTIONARY
DataFrame object
• Accessing the individual Series

print(data[‘AREA’]) (or) print(data.AREA)

Output for both cases:

a 0.25
b 0.50
c 0.75
d 1.00
Name: AREA, dtype: float64
17

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A DICTIONARY
DataFrame object
• Modifying the object  eg. adding a new column
data['DENSITY'] = data['POPULATION'] / data['AREA']
print(data)

Output for both cases:

AREA POPULATION DENSITY
a 0.25 100 400.000000
b 0.50 150 300.000000
c 0.75 125 166.666667
d 1.00 85 85.000000 18

9
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A 2-D ARRAY
DataFrame object
• Full DataFrame transposed to swap rows and columns:
print(data)
print(data.T)
Output 2:
Output 1:
AREA POPULATION DENSITY
a 0.25 100 400.000000
b 0.50 150 300.000000
c 0.75 125 166.666667
d 1.00 85 85.000000 19

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A 2-D ARRAY
DataFrame object
• Raw underlying data array can be accessed using the values attribute
print(data.values)
print(data.values[0]) # similar to array indexing

10
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A 2-D ARRAY
DataFrame object
print(data[‘AREA’]) #a single “index” passed to a DataFrame to
access a column
print(data.iloc[:3, :2]) # array slicing – using implicit indexing
print(data.loc[‘a’ : ‘c’]) # array slicing – using explicit indexing
Output 1:
Output 2: Output 3:
a 0.25
AREA POPULATION AREA POPULATION DENSITY
b 0.50
a 0.25 100 a 0.25 100 400.000000
c 0.75
b 0.50 150 b 0.50 150 300.000000
d 1.00
c 0.75 125 c 0.75 125 166.666667
Name: AREA, dtype: float64 21

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A 2-D ARRAY
DataFrame object

print(data.ix[:3, :’POPULATION’]) # array slicing – using HYBRID indexing

Rows 0 to 2, columns ‘AREA’ TO ‘POPULATION’

Output:
AREA POPULATION
a 0.25 100
b 0.50 150
c 0.75 125
22

11
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -

DATAFRAME AS A 2-D ARRAY
DataFrame object
• If you need the indices with population > 100..
print(data.loc[data.POPULATION >= 100]) # array masking
print(data.loc[data.POPULATION >= 100, ['AREA', 'POPULATION']])

Output 1: Output 2:
AREA POPULATION DENSITY AREA POPULATION
a 0.25 100 400.000000 a 0.25 100
b 0.50 150 300.000000 b 0.50 150
c 0.75 125 166.666667 c 0.75 125
23

OPERATING
ON DATA

12
Unit-4 PANDAS

OPERATING
ON DATA
• Ufuncs: Index Preservation
• Ufuncs: Index alignment
• Index alignment in Series
• Index alignment in dataframe

OPERATING ON DATA
• Pandas inherits much of ufuncs from NumPy perform quick element-wise to
perform operations, both with basic arithmetic (addition, subtraction,
multiplication, etc.) and with more sophisticated operations (trigonometric
functions, exponential and logarithmic functions, etc.).
• Pandas will preserve index and column labels in the output.
• Pandas will automatically align indices when passing the objects to the
ufunc.

13
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX

PRESERVATION Output 1:
• Consider an example Pandas Series and DataFrame objects. 0 6
import pandas as pd 1 3
import numpy as np 2 7
range1 = np.random.RandomState(42) 3 4
series1 = pd.Series(range1.randint(0, 10, 4)) dtype: int32
series2 = pd.Series(np.random.randint(0, 10, 4)) Output 2:
print(series1) # same set of values for every execution 0 3
print(series2) # different set of values for each execution 1 2
2 9
3 3
dtype: int32
27

OPERATING ON DATA - UFUNCS: INDEX

PRESERVATION
• Consider an example Pandas Series and DataFrame objects.
Output:
A B C D
0 6 9 2 6
import pandas as pd 1 7 4 3 7
import numpy as np 2 7 2 5 4
range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 10, (3, 4)), columns=['A', 'B',
'C', 'D'])

14
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX

PRESERVATION
• Ufunc applied on those objects - the result will be another Pandas object
with the indices preserved – indices will appear in the o/p.
print(np.exp(series1) # np.log, np.log, np.log10 can also be used
print(np.sin(df1))
Output 1:
Output 2:
0 403.428793
A B C D
1 20.085537
0 -0.279415 0.412118 0.909297 -0.279415
2 1096.633158
1 0.656987 -0.756802 0.141120 0.656987
3 54.598150
2 0.656987 0.909297 -0.958924 -0.756802
dtype: float64
29

OPERATING ON DATA - UFUNCS: INDEX

ALIGNMENT – IN SERIES
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
area = pd.Series({‘a’ : 0.25, ‘b’ : 0.5, ‘c’ : 0.75, ‘d’ : 1.0}, name = ‘AREA’)
pop = pd.Series({‘a’ : 100, ‘b’ : 150, ‘c’ : 125, ‘d’ : 85}, name = ‘POPULATION’)
print(pop / area)
area1 = pd.Series({'a' : 0.25, 'b' : 0.5, 'd' : 1.0}, name = 'AREA')
pop1 = pd.Series({'b' : 150, 'c' : 125, 'a' : 100}, name = 'POP')
print(pop1 / area1)
Output 1: Output 2:
a 400.000000 a 400.0 • Result has the union
b 300.000000 b 300.0 of all indices
c 166.666667 c NaN
• NaN  Not a Number
d 85.000000 d NaN
dtype: float64 dtype: float64  missing data 30

15
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX

ALIGNMENT – IN SERIES
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
A = pd.Series([2, 4, 6, 5], index=[1, 2, 3, 4])
B = pd.Series([1, 3, 5, 7], index=[1, 2, 3, 4])
print(A + B)
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A + B)
Output 1: Output 2:
1 3 0 NaN • Result has the union
2 7 1 5.0 of all indices
3 11 2 9.0
• NaN  Not a Number
4 12 3 NaN
dtype: int64 dtype: float64  missing data 31

OPERATING ON DATA - UFUNCS: INDEX

ALIGNMENT – IN SERIES
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A.add(B, fill_value=0)) # explicit specification of the fill value for any
elements in A or B that might be missing
Output 1:
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64 32

16
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX

ALIGNMENT – IN DATAFRAME
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 20, (2, 2)), columns=['A', 'B'])
df2 = pd.DataFrame(range1.randint(0, 10, (3, 3)), columns=list('BAC'))
print(df1, df2)
print(df1 + df2)
Output 1: Output 2:
Output 1: B A C A B C
A B 0 7 4 6 0 10.0 26.0 NaN
0 6 19 1 9 2 6 1 16.0 19.0 NaN
1 14 10 2 7 4 3 2 NaN NaN NaN 33

OPERATING ON DATA - UFUNCS: INDEX

ALIGNMENT – IN DATAFRAME
• Fill the missing values with the mean of all values in A

range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 20, (2, 2)), columns=['A', 'B'])
df2 = pd.DataFrame(range1.randint(0, 10, (3, 3)), columns=list('BAC'))
fill = df1.stack().mean() # fill = 12.25 for the output
print(df1.add(df2, fill_value=fill))
Output : Output 2:
Output 1: B A C A B C
A B 0 7 4 6 0 10.00 26.00 18.25
0 6 19 1 9 2 6 1 16.00 19.00 18.25
1 14 10 2 7 4 3 2 16.25 19.25 15.25 34

17
Unit-4 PANDAS

HANDLING
MISSING
DATA

HANDLING
MISSING DATA
• Trade-Offs in Missing Data
Conventions
• Missing Data in Pandas
• None: Pythonic missing data
• NaN: Missing numerical data
• NaN and None in Pandas
• Operating on Null Values
• Detecting null values
• Dropping null values
• filling null values
36

18
Unit-4 PANDAS

HANDLING MISSING DATA

• Real-world data is rarely clean and homogeneous.
• Will have some amount of data missing
• Different data sources may indicate missing data in different ways
• Pandas chooses to represent it as null, NaN, or NA values
• built-in Pandas tools available for handling missing data in Python

TRADE-OFFS IN MISSING DATA

CONVENTIONS
• Two strategies to indicate the presence of missing data in a table or
DataFrame.:
1. Masking approach - Using a mask that globally indicates
missing values – a Boolean array or one bit in the data
representation to locally indicate the null status of a value
2. Sentinel approach - Choosing a sentinel value that indicates a
missing entry - some data-specific convention, such as
indicating a missing integer value with –9999 or some rare bit
pattern, or it could be a more global convention, such as
indicating a missing floating-point value with NaN (Not a
38

Number), a special value.

19
Unit-4 PANDAS

TRADE-OFFS IN MISSING DATA

CONVENTIONS
• Trade-offs in the two strategies:
• Masking approach - use of a separate mask array requires
allocation of an additional Boolean array  adds overhead in
both storage and computation.
• Sentinel approach - A sentinel value reduces the range of valid
values and special values like NaN are not available for all data
types.

MISSING DATA IN PANDAS - NONE: PYTHONIC

MISSING DATA
• None  a Python singleton object often used for missing data in
Python code  can be used only in arrays with data type 'object'
(i.e., arrays of Python objects)
import numpy as np
import pandas as pd
vals1 = np.array([1, None, 3, 4])
print(vals1)
print(sum(vals1)

Output:
array([1, None, 3, 4], dtype=object)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' 40

20
Unit-4 PANDAS

MISSING DATA IN PANDAS - NAN: MISSING

NUMERICAL DATA
• NAN  acronym for Not a Number  a special floating-point value
recognized by all systems that use the standard floating point representation.
• NaN is a bit like a data virus  it infects any other object it touches  this
means that aggregates over the values are well defined (i.e., they don’t
result in an error) but result in nan  some special aggregations that will
ignore these missing values
import numpy as np
import pandas as pd Output:
vals2 = np.array([1, np.nan, 3, 4]) [ 1. nan 3. 4.]
print(vals2) float64
print(vals2.dtype) nan
print(sum(vals2)) Nan
print(1 + np.nan)
8.0 41
print(np.nansum(vals2))

MISSING DATA IN PANDAS - NAN AND NONE IN

PANDAS
• Pandas is built to handle NAN and NONE nearly interchangeably, converting
where appropriate
• For example, if we set a value in an integer array to np.nan, it will
automatically be upcast to a floating-point type to accommodate the NA
• Sometimes, Pandas automatically converts the None to a NaN value.

print(pd.Series([1, np.nan, 2, None])) Output1:

0 1.0 Output2: Output3:
x = pd.Series(range(2), dtype=int) 1 NaN 00 0 NaN
print(x) 2 2.0 11 1 1.0
3 NaN dtype: dtype:
x[0] = None dtype: int64 float64
print(x) float64 42

21
Unit-4 PANDAS

OPERATING ON NULL VALUES

• Pandas methods for detecting, removing, and replacing null values in Pandas
data structures.
• isnull()  Generate a Boolean mask indicating missing values
• notnull()  Opposite of isnull()
• dropna()  Return a filtered version of the data

# Detecting null values

Output1:
data =pd.Series([10, np.nan, 20, None])) Output2: Output3:
0 False
print(data.isnull()) 0 10 0 10
1 True
print(data.notnull()) 2 20 2 20
2 False
dtype: dtype:
3 True
# Dropping null values in a series object object
dtype: bool
print(data.dropna()) 43

OPERATING ON NULL VALUES

Output:
# Dropping null values in a dataframe
0 1 2
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
1 2.0 3.0 5
print(df.dropna()) # entire row with nan dropped
2
print(df.dropna(axis = 1)) # entire column with nan dropped
0 2
df.dropna(axis='columns', how='all') #drops column if all are NAN
1 5
df.dropna(axis='columns', how='any') #drops column if any is NAN
2 6 44

22
Unit-4 PANDAS

OPERATING ON NULL VALUES

• fillna()  Return a copy of the data with missing values filled
# Filling null values in a series
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print(data.fillna(0))
print(data.fillna(method='ffill')) # forward-fill
print(data.fillna(method='bfill')) # backward-fill
Output 1: Output 2: Output 1:
a 1.0 a 1.0 a 1.0
b 0.0 b 1.0 b 2.0
c 2.0 c 2.0 c 2.0
d 0.0 d 2.0 d 3.0
e 3.0 e 3.0 e 3.0
dtype: float64 dtype: float64 dtype: float64 45

COMBINING
DATASETS

23
Unit-4 PANDAS

COMBINING
DATASETS
• Concat and append
• Np.concatenate function
• Pd.concat method
• Pd.concat with inner join
• Append method
• Merge and join
• One-to-one join
• Many-to-one join

COMBINING DATASETS
• Combining different data sources
• Simpler concatenation of two different datasets
• Complicated database-style joins and merges - handling any
overlaps between the datasets
• Pandas includes functions and methods that make this sort of
data wrangling fast and straightforward – on Series and
DataFrames
• Function used - pd.concat

24
Unit-4 PANDAS

COMBINING DATASETS
• A DataFrame created

import numpy as np
import pandas as pd Output:
cols = pd.Series(['A', 'B', 'C']) A B C
ind = np.arange(3) 0 A0 B0 C0
vals = {c: [str(c) + str(i) for i in ind] for c in cols} 1 A1 B1 C1
data = pd.DataFrame(vals, ind) 2 A2 B2 C2
print(data)

COMBINING DATASETS
• Concatenation of NumPy arrays  done via the np.concatenate
function
Output:
[1, 2, 3, 4, 5, 6]
x = [1, 2, 3]
y = [4, 5, 6] [[1 2 1 2]
data = np.concatenate([x, y]) [3 4 3 4]]
print(data)
x = [[1, 2],[3, 4]] [[1 2]
data1 = np.concatenate([x, x], axis=0) [3 4]
data2 = np.concatenate([x, x], axis=1) [1 2]
print(data1, data2) [3 4]] 52

25
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND

• Concatenation of Series and DataFrame objects is very similar to
concatenation of NumPy arrays 
• Done via the pd.concat function
• Done via append method

pd.concat(objs, axis=0, join='outer', join_axes=None,

ignore_index=False, keys=None, levels=None, names=None,
verify_integrity=False, copy=True)

COMBINING DATASETS – CONCATENATE & APPEND

• To use pd.concat function, for row-wise concatenation, column
names must be the same in both dataframes

df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])

df2 = pd.DataFrame({'COL1':['A3', 'A4'],'COL2':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2)
Output 3:
print(pd.concat([df1, df2])) # row-wise concatenation
COL1 COL2
Output 1: Output 2: 1 A1 B1
COL1 COL2 COL1 COL2 2 A2 B2
1 A1 B1 3 A3 B3 3 A3 B3
2 A2 B2 4 A4 B4 4 A4 B454

26
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND

• To use pd.concat function, for column-wise concatenation (with
axis=1), row indices names must be the same in both dataframes

df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']}, index=[1,2])

df2 = pd.DataFrame({'COL3':['A3', 'A4'],'COL4':['B3', 'B4']}, index=[1,2])
print(df1)
print(df2)
print(pd.concat([df1, df2], axis=1)) # column-wise concatenation
Output 1: Output 2: Output 3:
COL1 COL2 COL1 COL2 COL1 COL2 COL3 COL4
1 A1 B1 3 A3 B3 1 A1 B1 A3 B3
2 A2 B2 4 A4 B4 2 A2 B2 A4 B4 55

COMBINING DATASETS – CONCATENATE & APPEND

• To use pd.concat function, for row-wise concatenation, if column
names are not the same in both dataframes, union of col names
done, and NaN used for missing values
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2) Output 3:
print(pd.concat([df1, df2])) # row-wise concatenationCOL1 COL2 COL3
Output 1: Output 2: 1 A1 B1 NaN
COL1 COL2 COL2 COL3 2 A2 B2 NaN
1 A1 B1 3 A3 B3 3 NaN A3 B3
2 A2 B2 4 A4 B4 4 NaN A4 B4 56

27
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND

• To avoid Nan, an intersection of the columns using join='inner‘ in
pd.concat can be used
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2)
print(pd.concat([df1, df2], join='inner')) # row-wise inner Output
join 3:
COL2
Output 1: Output 2: 1 B1
COL1 COL2 COL2 COL3 2 B2
1 A1 B1 3 A3 B3 3 A3
2 A2 B2 4 A4 B4 4 A4 57

COMBINING DATASETS – CONCATENATE & APPEND

28
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND

• Concatenation using append method –returns a dataframe object – not so
efficient as it creates a new index and data buffer
• Multiple append() = = pd concat()  to concatenate more dataframes
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2) Output 3:
print(df1.pd.append(df2)) COL1 COL2
Output 1: Output 2: 1 A1 B1
COL1 COL2 COL1 COL2 2 A2 B2
1 A1 B1 3 A3 B3 3 A3 B3
2 A2 B2 4 A4 B4 4 A4 B4 59

COMBINING DATASETS – CONCATENATE & APPEND

29
Unit-4 PANDAS

COMBINING DATASETS – MERGE & JOIN

• pd.merge()  a subset of what is known as relational algebra,
• A formal set of rules for manipulating relational data, and forms the
conceptual foundation of operations available in most databases
• Several primitive operations, whichbecome the building blocks of more
complicated operations on any dataset
• pd.merge() function and the related join() method of Series and DataFrames.
• The pd.merge() function implements a number of types of joins: the one-to-
one, many-to-one, and many-to-many joins

COMBINING DATASETS – MERGE & JOIN

• One – to – one join

df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']})

df2 = pd.DataFrame({'COL1':['A2', 'A1'],'COL3':['B3', 'B4']})
print(df1)
print(df2)
print(pd.merge(df1, df2)) Output 3:
Output 1: Output 2: COL1 COL2 COL3
COL1 COL2 COL1 COL3 0 A1 B1 B4
1 A1 B1 0 A2 B3 1 A2 B2 B3
2 A2 B2 1 A1 B4 62

30
Unit-4 PANDAS

COMBINING DATASETS – MERGE & JOIN

• One – to – one join  similar to the column-wise concatenation seen
• No. of values in both dataframes to be merged are the same
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
print(df1)
print(df2) Output 3:
df3 = pd.merge(df1, df2) COL1 COL2 COL3
print(df3) 0 A1 B1 B4
1 A2 B2 B3
63

COMBINING DATASETS – MERGE & JOIN

• One – to – one join

31
Unit-4 PANDAS

COMBINING DATASETS – MERGE & JOIN

• Many – to – one join  one of the two key columns contains duplicate
entries.
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],

'hire_date': [2004, 2008, 2012, 2014]})

df3 = pd.merge(df1, df2)

print(df3)
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
print(pd.merge(df3, df4)) 65

COMBINING DATASETS – MERGE & JOIN

• Many – to – one join  one of the two key columns contains duplicate
entries.

Df3, df4 merged

32
Unit-4 PANDAS

AGGREGATION
AND
GROUPING

AGGREGATION
AND
GROUPING
• Simple Aggregation in Pandas
• GroupBy: Split, Apply, Combine

33
Unit-4 PANDAS

SIMPLE AGGREGATION IN PANDAS

• Efficient summarization -
computing aggregations like
sum(), mean(), median(),
min(), and max() a single
number gives insight into
the nature of a potentially
large dataset.

• For a Series object 

GROUPBY: SPLIT, APPLY, COMBINE

• The GroupBy accomplishes:
 The split step involves breaking up
and grouping a DataFrame
depending on the value of the
specified key.
 The apply step involves computing
some function, usually an aggregate,
transformation, or filtering, within
the individual groups.
 The combine step merges the
results of these operations into an
output array. 70

34
Unit-4 PANDAS

PIVOT TABLES

• pivot tables
• Multilevel pivot tables
• Additional pivot table options

35
Unit-4 PANDAS

PIVOT TABLES
• The pivot table  takes simple columnwise data as input, and
groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.
• Pivot tables  multidimensional version of GroupBy aggregation
 split and the combine happen across not a one-dimensional
index, but across a two-dimensional grid.
• Method used  pivot_table()

PIVOT TABLES
• Sample database used  database of passengers on the Titanic
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

36
Unit-4 PANDAS

PIVOT TABLES
• Example Groupby operation  to find the survival rate by gender:
titanic.groupby('sex')[['survived']].mean()
• Finds different groups based on ‘sex’ – then finds the mean of
‘survived’ for different values in ‘sex’ (female, male) in the
database
Output: Three of every four (3/4
survived = 0.75) females on board
sex survived, while only one
female 0.742038 in five males (1/5 = 0.2)
male 0.188908 survived 75

PIVOT TABLES
• Example Groupby operation  to find the survival rate by gender
and class
 group by class and gender, select survival, apply a mean
aggregate, combine the resulting groups, and then unstack to reveal
the hidden multidimensionality.
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

Output: Class-wise gender-

class First Second Third wise survivals 
sex multidimensional
female 0.968085 0.921053 0.500000 aggregation
male 0.368852 0.157407 0.135447 operation 76

37
Unit-4 PANDAS

PIVOT TABLES
• Multidimensional aggregation can be handled simply using ‘pivot table’
 method used is pivot_table()
DataFrame.pivot_table(data, values=None, index=None,
columns=None, aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All‘, observed=False, sort=True)
• data  the data (column) on which pivot table is to be generated
• index  row titles
• columns  column titles
• aggfunc  type of aggregation applied  ‘mean’ by default
• fill_value & dropna  to handle missing values
• margins  if True  totals computed along each grouping (row-wise and column-
wise)
• observed  if True, shows observed values for categorical groupers
• sort  if True, results are sorted
77

PIVOT TABLES
• The multidimensional aggregation operation using groupby
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
• The equivalent operation using pivot_table()
titanic.pivot_table('survived', index='sex', columns='class')

Output:
class First Second Third Class-wise gender-
sex wise survivals 
female 0.968085 0.921053 0.500000 2-dimensional
operation
male 0.368852 0.157407 0.135447 78

38
Unit-4 PANDAS

MULTILEVEL PIVOT TABLES

• The grouping in pivot tables can be specified with multiple
levels, and via a number of options.
• Example – let age be the 3rd dimension (apart from class & sex)
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
• pd.cut() will split the age data into 2 sets / bins  from 0 to 18
(inclusive) and 19 to 80
Output:
class First Second Third Class-wise gender-
sex age wise age-wise
female (0, 18] 0.909091 1.000000 0.511628 survivals  3-
(18, 80] 0.972973 0.900000 0.423729 dimensional
male (0, 18] 0.800000 0.600000 0.215686 operation 79
(18, 80] 0.375000 0.071429 0.133663

MULTILEVEL PIVOT TABLES

• Example – let fare be the 4th dimension (apart from sex, age,
class) fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
• pd.qcut() will split the ‘fare’ data into 2 sets / bins  based on
quantile.
• Output: 4-dimensional aggregation

39
Unit-4 PANDAS

ADDITIONAL PIVOT TABLE OPTIONS

• Example: aggregation functions applied

titanic.pivot_table (index='sex', columns='class', aggfunc =

{'survived': sum, 'fare': ‘mean'})

ADDITIONAL PIVOT TABLE OPTIONS

• Example: computation of totals along each grouping – using
margins=True  margin labels (row-name and column-name
for totals) are ‘All’ by default.
• Instead of ‘All’, margin label can be specified with the
margins_name keyword.
titanic.pivot_table('survived', index='sex', columns='class', margins
= True)

Injectomat TIVA Agilia en
100% (1)
Injectomat TIVA Agilia en
136 pages
IpNotes
No ratings yet
IpNotes
72 pages
Data Manipulation With Pandas (1)
No ratings yet
Data Manipulation With Pandas (1)
138 pages
14_Pandas
No ratings yet
14_Pandas
25 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
Unit III - Pandas - Data Manipulation Using Python
No ratings yet
Unit III - Pandas - Data Manipulation Using Python
15 pages
P Unit-4 NP
No ratings yet
P Unit-4 NP
30 pages
Python Pandas
No ratings yet
Python Pandas
96 pages
MODULE-6
No ratings yet
MODULE-6
48 pages
Data Science - Unit-3-Part-2
No ratings yet
Data Science - Unit-3-Part-2
32 pages
Unit 2
No ratings yet
Unit 2
81 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
UNIT 3(Chapter 2) Pandas
No ratings yet
UNIT 3(Chapter 2) Pandas
43 pages
Unit_III_part_2_1725700061785
No ratings yet
Unit_III_part_2_1725700061785
85 pages
IP Slybuss
No ratings yet
IP Slybuss
21 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
25 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Data Handling using Pandas-1
No ratings yet
Data Handling using Pandas-1
23 pages
Unit 4
No ratings yet
Unit 4
36 pages
09_Pandas slides
No ratings yet
09_Pandas slides
33 pages
ip study
No ratings yet
ip study
18 pages
Pandas
No ratings yet
Pandas
21 pages
Introduction to Pandas & Data Structures
No ratings yet
Introduction to Pandas & Data Structures
11 pages
Pandas
No ratings yet
Pandas
82 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
Pandas
No ratings yet
Pandas
163 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
48 pages
Panda
No ratings yet
Panda
46 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Pandas
No ratings yet
Pandas
13 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
DFF
No ratings yet
DFF
22 pages
lecture-9-pandas
No ratings yet
lecture-9-pandas
176 pages
NumPy Arrays and Pandas Series Object
No ratings yet
NumPy Arrays and Pandas Series Object
18 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
No ratings yet
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
47 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Week 4.1
No ratings yet
Week 4.1
16 pages
XII-IP-QuickRevision 2 in 1
No ratings yet
XII-IP-QuickRevision 2 in 1
13 pages
Data Science Notes Unit-1 Part -2
No ratings yet
Data Science Notes Unit-1 Part -2
22 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Exp 25_26
No ratings yet
Exp 25_26
17 pages
Mohit
No ratings yet
Mohit
19 pages
Pandas python
No ratings yet
Pandas python
11 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
Data Handling Using Pandas-I-ORG
No ratings yet
Data Handling Using Pandas-I-ORG
44 pages
XII-IP-QuickRevision
No ratings yet
XII-IP-QuickRevision
26 pages
NumPy Arrays and Pandas Series Object
No ratings yet
NumPy Arrays and Pandas Series Object
19 pages
Pandas
No ratings yet
Pandas
57 pages
Pandas
No ratings yet
Pandas
7 pages
12ip 22 23
No ratings yet
12ip 22 23
188 pages
Dataframes UNIT 1 PART 2
No ratings yet
Dataframes UNIT 1 PART 2
33 pages
Ch 1 Python Pandas-I
No ratings yet
Ch 1 Python Pandas-I
13 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Proposal Euro Art 27-11-2024(2)
No ratings yet
Proposal Euro Art 27-11-2024(2)
32 pages
Databus V3
No ratings yet
Databus V3
7 pages
Shodh PPT Shristi Yadav & Muskan Garg
No ratings yet
Shodh PPT Shristi Yadav & Muskan Garg
8 pages
Professional Scrum Master™: PSM Training
No ratings yet
Professional Scrum Master™: PSM Training
4 pages
LM-1st Quarter - Stage 3
No ratings yet
LM-1st Quarter - Stage 3
26 pages
How To Cancel or Freeze Your Youfit Membership
No ratings yet
How To Cancel or Freeze Your Youfit Membership
2 pages
Module 2 Test - Revisión Del Intento
No ratings yet
Module 2 Test - Revisión Del Intento
7 pages
PS_DA2
No ratings yet
PS_DA2
2 pages
GSM Based Home Automation: M.Tech (ECE) BY Aadil Rashid Bhat
No ratings yet
GSM Based Home Automation: M.Tech (ECE) BY Aadil Rashid Bhat
8 pages
Budget Disability Program
No ratings yet
Budget Disability Program
3 pages
Jotter Lab1 SKT
0% (1)
Jotter Lab1 SKT
1 page
SiPlace - Osram LED Guide
No ratings yet
SiPlace - Osram LED Guide
14 pages
Java Unit4 ExceptionHandling
No ratings yet
Java Unit4 ExceptionHandling
5 pages
Stage 1 Shroud For FS7001B
100% (1)
Stage 1 Shroud For FS7001B
1 page
State-of-Healthcare-Life-Sciences-GCCs-in-India-PDF
No ratings yet
State-of-Healthcare-Life-Sciences-GCCs-in-India-PDF
29 pages
Color wheel, a color palette generator Adobe Color
No ratings yet
Color wheel, a color palette generator Adobe Color
1 page
FDB-QAD-2022.0959.01-LV - Conformity Document - EXHIBIT
No ratings yet
FDB-QAD-2022.0959.01-LV - Conformity Document - EXHIBIT
3 pages
Sonim XP5plus Brochure ATT 070422 Final
No ratings yet
Sonim XP5plus Brochure ATT 070422 Final
4 pages
Hand-Over List
No ratings yet
Hand-Over List
88 pages
MULTICOLLINEARITY
No ratings yet
MULTICOLLINEARITY
8 pages
Anna LAMEK Dropshipping
No ratings yet
Anna LAMEK Dropshipping
12 pages
Engine Starting Systems
0% (1)
Engine Starting Systems
19 pages
Implementation of A New Primality Test
No ratings yet
Implementation of A New Primality Test
23 pages
Smart Door Lock
No ratings yet
Smart Door Lock
14 pages
Data Distribution Service (DDS) : OMG Document Number: Formal/2015-04-10 Standard Document URL: Machine Consumable Files
No ratings yet
Data Distribution Service (DDS) : OMG Document Number: Formal/2015-04-10 Standard Document URL: Machine Consumable Files
180 pages
Fluid Faucet
No ratings yet
Fluid Faucet
2 pages
UTM 4.0: The University of Future SITE Transformation Action Plan (TAP)
No ratings yet
UTM 4.0: The University of Future SITE Transformation Action Plan (TAP)
8 pages
In Media Habits in Includes Activities Such As Interacting With New Media Like Reading Books and Magazines
No ratings yet
In Media Habits in Includes Activities Such As Interacting With New Media Like Reading Books and Magazines
1 page
QUESTION PAPER (May 2024)[1]
No ratings yet
QUESTION PAPER (May 2024)[1]
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.