Unit 4 PPT Part2 - Pandas
Unit 4 PPT Part2 - Pandas
JAL1301 DATA
SCIENCE USING
PYTHON
UNIT IV
PYTHON
LIBRARIES FOR
DATA WRANGLING
1
Unit-4 PANDAS
DATA
MANIPULATION
WITH PANDAS
2
Unit-4 PANDAS
PANDAS
NumPy and its ndarray object provides efficient storage and manipulation
of dense typed arrays in Python.
NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks.
Limitations:
less flexibility (attaching labels to data, working with missing data, etc.)
and
when attempting operations that do not map well to element-wise
broadcasting (groupings, pivots, etc.),
Above mentioned are important piece of analyzing the less structured data
available in many forms in the world around us.
5
PANDAS
Pandas library provides efficient data structures.
Pandas a newer package built on top of NumPy provides an
efficient implementation of a DataFrame.
DataFrames essentially multidimensional arrays with attached
row and column labels, and often with heterogeneous types
and/or missing data.
Pandas and its Series and DataFrame objects, builds on the
NumPy array structure
Provides efficient access to these sorts of data munging (data
wrangling) tasks that occupy much of a data scientist’s time.
6
3
Unit-4 PANDAS
DATA
INDEXING
AND
SELECTION
4
Unit-4 PANDAS
5
Unit-4 PANDAS
6
Unit-4 PANDAS
Output 2:
Output 1: 1 0.25
0.25 2 0.50
dtype: float64 13
7
Unit-4 PANDAS
8
Unit-4 PANDAS
9
Unit-4 PANDAS
20
10
Unit-4 PANDAS
11
Unit-4 PANDAS
Output 1: Output 2:
AREA POPULATION DENSITY AREA POPULATION
a 0.25 100 400.000000 a 0.25 100
b 0.50 150 300.000000 b 0.50 150
c 0.75 125 166.666667 c 0.75 125
23
OPERATING
ON DATA
12
Unit-4 PANDAS
OPERATING
ON DATA
• Ufuncs: Index Preservation
• Ufuncs: Index alignment
• Index alignment in Series
• Index alignment in dataframe
25
OPERATING ON DATA
• Pandas inherits much of ufuncs from NumPy perform quick element-wise to
perform operations, both with basic arithmetic (addition, subtraction,
multiplication, etc.) and with more sophisticated operations (trigonometric
functions, exponential and logarithmic functions, etc.).
• Pandas will preserve index and column labels in the output.
• Pandas will automatically align indices when passing the objects to the
ufunc.
26
13
Unit-4 PANDAS
28
14
Unit-4 PANDAS
15
Unit-4 PANDAS
16
Unit-4 PANDAS
range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 20, (2, 2)), columns=['A', 'B'])
df2 = pd.DataFrame(range1.randint(0, 10, (3, 3)), columns=list('BAC'))
fill = df1.stack().mean() # fill = 12.25 for the output
print(df1.add(df2, fill_value=fill))
Output : Output 2:
Output 1: B A C A B C
A B 0 7 4 6 0 10.00 26.00 18.25
0 6 19 1 9 2 6 1 16.00 19.00 18.25
1 14 10 2 7 4 3 2 16.25 19.25 15.25 34
17
Unit-4 PANDAS
HANDLING
MISSING
DATA
HANDLING
MISSING DATA
• Trade-Offs in Missing Data
Conventions
• Missing Data in Pandas
• None: Pythonic missing data
• NaN: Missing numerical data
• NaN and None in Pandas
• Operating on Null Values
• Detecting null values
• Dropping null values
• filling null values
36
18
Unit-4 PANDAS
37
19
Unit-4 PANDAS
39
Output:
array([1, None, 3, 4], dtype=object)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' 40
20
Unit-4 PANDAS
21
Unit-4 PANDAS
Output:
# Dropping null values in a dataframe
0 1 2
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
1 2.0 3.0 5
print(df.dropna()) # entire row with nan dropped
2
print(df.dropna(axis = 1)) # entire column with nan dropped
0 2
df.dropna(axis='columns', how='all') #drops column if all are NAN
1 5
df.dropna(axis='columns', how='any') #drops column if any is NAN
2 6 44
22
Unit-4 PANDAS
COMBINING
DATASETS
23
Unit-4 PANDAS
COMBINING
DATASETS
• Concat and append
• Np.concatenate function
• Pd.concat method
• Pd.concat with inner join
• Append method
• Merge and join
• One-to-one join
• Many-to-one join
49
COMBINING DATASETS
• Combining different data sources
• Simpler concatenation of two different datasets
• Complicated database-style joins and merges - handling any
overlaps between the datasets
• Pandas includes functions and methods that make this sort of
data wrangling fast and straightforward – on Series and
DataFrames
• Function used - pd.concat
50
24
Unit-4 PANDAS
COMBINING DATASETS
• A DataFrame created
import numpy as np
import pandas as pd Output:
cols = pd.Series(['A', 'B', 'C']) A B C
ind = np.arange(3) 0 A0 B0 C0
vals = {c: [str(c) + str(i) for i in ind] for c in cols} 1 A1 B1 C1
data = pd.DataFrame(vals, ind) 2 A2 B2 C2
print(data)
51
COMBINING DATASETS
• Concatenation of NumPy arrays done via the np.concatenate
function
Output:
[1, 2, 3, 4, 5, 6]
x = [1, 2, 3]
y = [4, 5, 6] [[1 2 1 2]
data = np.concatenate([x, y]) [3 4 3 4]]
print(data)
x = [[1, 2],[3, 4]] [[1 2]
data1 = np.concatenate([x, x], axis=0) [3 4]
data2 = np.concatenate([x, x], axis=1) [1 2]
print(data1, data2) [3 4]] 52
25
Unit-4 PANDAS
53
26
Unit-4 PANDAS
27
Unit-4 PANDAS
28
Unit-4 PANDAS
29
Unit-4 PANDAS
61
30
Unit-4 PANDAS
64
31
Unit-4 PANDAS
66
32
Unit-4 PANDAS
AGGREGATION
AND
GROUPING
AGGREGATION
AND
GROUPING
• Simple Aggregation in Pandas
• GroupBy: Split, Apply, Combine
68
33
Unit-4 PANDAS
69
34
Unit-4 PANDAS
PIVOT TABLES
PIVOT TABLES
• pivot tables
• Multilevel pivot tables
• Additional pivot table options
72
35
Unit-4 PANDAS
PIVOT TABLES
• The pivot table takes simple columnwise data as input, and
groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.
• Pivot tables multidimensional version of GroupBy aggregation
split and the combine happen across not a one-dimensional
index, but across a two-dimensional grid.
• Method used pivot_table()
73
PIVOT TABLES
• Sample database used database of passengers on the Titanic
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
74
36
Unit-4 PANDAS
PIVOT TABLES
• Example Groupby operation to find the survival rate by gender:
titanic.groupby('sex')[['survived']].mean()
• Finds different groups based on ‘sex’ – then finds the mean of
‘survived’ for different values in ‘sex’ (female, male) in the
database
Output: Three of every four (3/4
survived = 0.75) females on board
sex survived, while only one
female 0.742038 in five males (1/5 = 0.2)
male 0.188908 survived 75
PIVOT TABLES
• Example Groupby operation to find the survival rate by gender
and class
group by class and gender, select survival, apply a mean
aggregate, combine the resulting groups, and then unstack to reveal
the hidden multidimensionality.
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
37
Unit-4 PANDAS
PIVOT TABLES
• Multidimensional aggregation can be handled simply using ‘pivot table’
method used is pivot_table()
DataFrame.pivot_table(data, values=None, index=None,
columns=None, aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All‘, observed=False, sort=True)
• data the data (column) on which pivot table is to be generated
• index row titles
• columns column titles
• aggfunc type of aggregation applied ‘mean’ by default
• fill_value & dropna to handle missing values
• margins if True totals computed along each grouping (row-wise and column-
wise)
• observed if True, shows observed values for categorical groupers
• sort if True, results are sorted
77
PIVOT TABLES
• The multidimensional aggregation operation using groupby
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
• The equivalent operation using pivot_table()
titanic.pivot_table('survived', index='sex', columns='class')
Output:
class First Second Third Class-wise gender-
sex wise survivals
female 0.968085 0.921053 0.500000 2-dimensional
operation
male 0.368852 0.157407 0.135447 78
38
Unit-4 PANDAS
80
39
Unit-4 PANDAS
81
82
40