0% found this document useful (0 votes)

8 views

Pandas- Data manipulation and Analysis Library - Educative

The document provides an overview of the pandas library in Python, highlighting its capabilities for data manipulation, analysis, and visualization. It covers key concepts such as Series and DataFrame structures, data import/export methods, data cleaning, transformation, and visualization techniques using examples from the Titanic dataset. Additionally, it explains how to perform operations like filtering, grouping, and applying functions to data.

Uploaded by

Jitendra Arethiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Pandas- Data manipulation and Analysis Library - Educative

Uploaded by

Jitendra Arethiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Pandas for Data Manipulation and Analysis

Introduction to pandas
1: What is pandas?
Pandas is an open-source library in Python used for data cleaning, manipulation and analysis. It
provides the data structures and functions needed to work eﬃciently with tabular or structured
data (such as CSV ﬁles, Excel sheets, SQL databases, etc.).

2: Pandas installation
pip install pandas

3: Using Python modules:

import pandas as pd # To import pandas

Example: Titanic dataset

We will use the Titanic dataset from pandas as an example. It contains information about
passengers aboard the Titanic, including their survival status, age, gender, ticket class, and
fare.
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone

0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False

1 1 female 38.0 1 0 71.2833 C First Woman False C Cherbourg yes False

1 3 female 26.0 0 0 7.9250 S Third Woman False NaN Southampton yes True

1 1 female 35.0 1 0 53.1000 S First Woman False C Southampton yes False

0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton yes True

Pandas Data Structures

1: Series:
A series is a one-dimensional labeled array capable of holding data of any type (integer,
ﬂoat, string, etc.).

Creating series:
From lists, arrays, dictionaries pd.Series(data)

Accessing elements: By position, indexes/labels series.loc[label]

By position: series.iloc[index]
By indexes/labels: series.loc[label]

# Creating a series from a list

age = [22.0, 38.0, 26.0, 35.0, 35.0]
s = pd.Series(age)

# Accessing elements by position

print(s.iloc[0]) # Output: 22.0
print(s.iloc[2]) # Output: 26.0

# Accessing elements by index/labels

print(s.loc[0]) # Output: 22.0
print(s.loc[2]) # Output: 26.0

# Creating a series from a dictionary

data_dict = {'survived': 0, 'pclass': 3, 'sex': ‘male’, 'age': 22.0, 'sibsp': 1,
'parch':0}
series_from_dict = pd.Series(data_dict)
Pandas for Data Manipulation and Analysis
2: DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially
diﬀerent types.

Example of a DataFrame

axis=1 or columns

0 survived pclass sex age sibsp pd.Index

parch fare em
axis=0 or index

0 3 male 22.0 1 0 7.2500

1
1 1 female 38.0 1 0 71.2833
2
1 3 female 26.0 0 0 7.9250
3 1 1 35.0
female 1 0 53.1000

4 0 3 male 35.0 0 0 8.0500

pd.Index pd.Series

Creating DataFrames:
From a dictionary: pd.DataFrame(data)
CSV ﬁle: pd.read_csv('Titanic.csv')
A list of lists: pd.DataFrame(data, columns=['survived', 'pclass', 'sex' 'age',...])
Accessing elements:
By position: df.iloc[row_index, col_index]
By indexes/labels: df.loc[row_label, col_label]

Example
# Creating a DataFrame from a dictionary
data = {
'survived': [0, 1, 1, 1, 0],
'pclass': [3, 1, 3, 1, 3],
'sex': ['male', 'female', 'female', 'female', 'male'],
'age': [22.0, 38.0, 26.0, 35.0, 35.0],
'sibsp': [1, 1, 0, 1, 0],
'parch': [0, 0, 0, 0, 0]
}

df = pd.DataFrame(data)

# Accessing elements by position

print("Accessing by position:")
print(df.iloc[0, 1]) # Output: 25 (Age of first row)
print(df.iloc[2, 2]) # Output: Chicago (City of third row)
print()

# Accessing elements by index/labels

print("Accessing by index/labels:")
print(df.loc[0, 'Name']) # Output: Alice (Name of first row)
print(df.loc[3, 'City']) # Output: Houston (City of fourth row)
Pandas for Data Manipulation and Analysis

Importing and Storing Data

We can import and export data using eﬃcient pandas functions.

Reading and Writing Data

To a CSV ﬁle CSV [Reading and writing data]
df = pd.read_csv('Titanic.csv')
df.to_csv('Titanic.csv')
d class who adult_male deck embark_town alive alone
To a SQL database SQL [Reading and Writing Data]
Third man True NaN Southampton no False
pd.read_sql('SELECT * FROM table', connection)
First Woman False
df.to_sql('table', Cconnection)
Cherbourg yes False

Third Woman False NaN Southampton

Parquet [Reading and Writing Data] yes True

First df
Woman= pd.read_parquet('Titanic.parquet')
False C Southampton yes False
df.to_parquet('Titanic.parquet')
Third man True NaN Southampton yes True
HDF5 [Reading and Writing Data]

df = pd.read_hdf('Titanic.h5')
df.to_hdf('Titanic.h5', key='df', mode='w')

(key to specify the key of the HDF5 group to write to, and mode to specify the mode of
writing ('w' for write, ‘a’ for append))
To an Excel ﬁle Excel [Reading and Writing Data]

pd.read_excel('Titanic.xlsx', sheet_name='Sheet1')
df.to_excel('Titanic.xlsx', sheet_name='Sheet1')

To a JSON ﬁle JSON [Reading and Writing Data]

pd.read_json('Titanic.json')
df.to_json('Titanic.json')

Viewing and Other Operations

We can view the head and tail of the DataFrame by calling the following functions:
print(df.head()) # View the first few rows
print(df.tail()) # View the last few rows

We can obtain statistical information, the length and shape of the DataFrame, information
about columns, and a summary of the DataFrame by calling the following functions:

print(df.describe()) # Summary statistics

print(df.info()) # Information about the DataFrame
print(len(df)) # Length of the DataFrame
print(df.shape) # Shape of the DataFrame
print(df.columns) # Column names
print(df['column_name'].value_counts()) # Counts of unique
Pandas for Data Manipulation and Analysis

Selecting and Filtering Data

We can select and ﬁlter data using pandas functions.
Selecting columns:
Df['age'] # Selecting one column
df[['survived', 'pclass', 'name']] #selecting multiple columns

Selecting rows:

df.iloc[row_index], df.loc[row_label]
df.iloc[10] # 11th row of DataFrame
df.loc[df['age'] > 18] # Rows where age is greater than 18

Filtering data based on speciﬁc criteria:

df[df['column'] > value], df.query('column > value')
print(df[df['age'] < 30])

We can call advanced statistics and aggregation functions to analyze the data:
df.mean(), df.median(), df.mode(), df.var(), df.std()
print(df.groupby('sex').mean())

Data Cleaning
We can clean and preprocess data using pandas functions.

Handling missing values:

df['deck'].fillna('Unknown') # Filling missing values in the 'deck’
column of a DataFrame df with the string 'Unknown'.

Removing unnecessary columns/indices:

df.dropna(subset=['deck'], inplace=True) # Drop rows

where the 'deck’ column has NaN values
df.dropna(subset=['embarked']) # Drop ‘embarked’ column

Handling duplicates:
df.drop_duplicates()

Data type conversion:

# Convert 'Fare' column to float

df['Fare'] = df['Fare'].astype(float)

Data normalization:

def normalize_column(col):
return (col - col.min()) / (col.max() - col.min())
# Normalize the 'Age' and 'Fare' columns
df['Age'] = normalize_column(df['Age'])
df['Fare'] = normalize_column(df['Fare'])

Renaming columns:
df.rename(columns={'Pclass': 'PassengerClass', 'SibSp': 'SiblingsSpouses',
'Parch': 'ParentsChildren'}, inplace=True)
Pandas for Data Manipulation and Analysis
Data Transformation
We can transform data using various pandas functions.

Sorting data:
df.sort_values('col'), df.sort_index()

Merging and joining:

pd.merge(df1, df2, on='key')
other_df = pd.read_csv('other_data.csv')
pd.merge(df, other_df, on='Ticket')

Grouping by a column:

df.groupby('col').sum(), df.groupby('col').mean()
df.groupby('pclass')['survived'].mean()

# The result is a series where the index is the unique values in the 'pclass'
column (1, 2, 3 in this case), and the values are the average survival rates for
each class.

Output:

Pclass
1 0.65
2 0.45
3 0.25

Arithmetic operations:
df['new_column'] = df['column1'] + df['column2']
df['FamilySize'] = df['sibSp'] + df['parch']

Reshaping:
df.pivot(index='index_col', columns='col')
df.pivot_table(values='survived', index='pclass', columns='sex', aggfunc='mean')

Lambda Functions
We can apply customized functions with Lambda expressions to modify data in the DataFrame.
The apply() function on one column

# Applying a Lambda function to one column

df['Fare_doubled'] = df['fare'].apply(lambda x: x * 2)

The apply() function on multiple columns

# Applying a Lambda function to multiple columns

df['Family_size'] = df.apply(lambda x: x['sibsp'] + x['parch'] + 1, axis=1)

The apply() function on one one row

# Applying a Lambda function to one row
row = df.iloc[0]
result = row.apply(lambda x: x * 2)
Pandas for Data Manipulation and Analysis
The apply() function on multiple rows

# Applying a Lambda function to multiple rows

rows = df.iloc[1:3]
result = rows.apply(lambda x: x['fare'] * x['age'], axis=1)

The assign() function on one column

# Using assign() to create a new column based on an existing column
df = df.assign(Age_doubled=df['age'] * 2)

The assign() function on multiple columns

# Using assign() to create multiple new columns
df = df.assign(Fare_age_ratio=df['fare'] / df['age'],
Fare_class_ratio=df['fare'] / df['pclass'])

Visualization
We can use pandas functions for data visualization.
hist()

df['age'].hist()

We used the hist() function to plot a histogram for the distribution of ages among
Titanic passengers.

175

150

125

100

0
0 10 20 20 40 50 60 70 80

df.plot(kind='scatter', x='age', y='fare')

We used the plot() function to draw a scatter plot for presenting the relationship between age and
fare paid by Titanic passengers.
Pandas for Data Manipulation and Analysis

500

400

300
Fare
Fare

200

100

0 10 20 30 40 50 60 70 80
Age

The code below groups the DataFrame df by the 'embarked' column, calculates the occurrences of
each unique value in the 'embarked' column (value_counts() method), and plots a horizontal bar
chart.

df.groupby('embarked').value_counts().plot(kind='barh', color=colors)

Passenger Count by Embarked

S
Embarked

0 100 200 300 400 500 600

Count

We can blend diﬀerent functions, from pandas to create complex visualizations. Here, we used the
plot()function to plot the bar chart in horizontal alignment to present the count of passengers
at diﬀerent locations ('C', 'Q', 'S') .

Lab3 - SWT301
No ratings yet
Lab3 - SWT301
72 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Morris-Raine Real Estate Co.
No ratings yet
Morris-Raine Real Estate Co.
22 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
Class 6 Pandas
No ratings yet
Class 6 Pandas
13 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
13 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Pandas
No ratings yet
Pandas
12 pages
Titanic
No ratings yet
Titanic
22 pages
1745516832930-Pandas-Handbook
No ratings yet
1745516832930-Pandas-Handbook
33 pages
UNIT -4 -PART 2
No ratings yet
UNIT -4 -PART 2
36 pages
Pandas Basics for Data Science
No ratings yet
Pandas Basics for Data Science
2 pages
Unit 4 Pandas
No ratings yet
Unit 4 Pandas
8 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Pandas
No ratings yet
Pandas
27 pages
Python For DA
100% (2)
Python For DA
47 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Python-for-Data-Analysis-edgar
No ratings yet
Python-for-Data-Analysis-edgar
49 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Pandas
No ratings yet
Pandas
5 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Pandas
No ratings yet
Pandas
26 pages
Python Unit 4&5 Que
No ratings yet
Python Unit 4&5 Que
33 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Python for ML
No ratings yet
Python for ML
41 pages
ip study
No ratings yet
ip study
18 pages
Panda Cheatsheet
No ratings yet
Panda Cheatsheet
17 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Pandas
No ratings yet
Pandas
13 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Pandas Complete Notes
No ratings yet
Pandas Complete Notes
105 pages
Pandas From Basic To Advanced
No ratings yet
Pandas From Basic To Advanced
78 pages
Week 3 Laboratory Activity
No ratings yet
Week 3 Laboratory Activity
7 pages
ainotes
No ratings yet
ainotes
5 pages
DOC-20250315-WA0005.
No ratings yet
DOC-20250315-WA0005.
29 pages
EDA-Unit2
No ratings yet
EDA-Unit2
99 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Ms Access 2007: Step by Step
From Everand
Ms Access 2007: Step by Step
Asim Abbasi
5/5 (1)
Data Sheet - Industrial Security
No ratings yet
Data Sheet - Industrial Security
2 pages
ITC4321 WEB DEVELOPMENT USING CONTENT MANAGEMENT
No ratings yet
ITC4321 WEB DEVELOPMENT USING CONTENT MANAGEMENT
105 pages
Course Outline of CSE 309 - Section A PDF
No ratings yet
Course Outline of CSE 309 - Section A PDF
5 pages
Fundamentals of Data Science and Analytics - AD3491 - Important Questions 2 Marks with Answer - Unit 4 - Analysis of Variance
No ratings yet
Fundamentals of Data Science and Analytics - AD3491 - Important Questions 2 Marks with Answer - Unit 4 - Analysis of Variance
11 pages
Datasheet of DS 7608NI K2 NVRD - V4.31.610 - 20220412
No ratings yet
Datasheet of DS 7608NI K2 NVRD - V4.31.610 - 20220412
4 pages
Search Engine For The Internet of Everything: Network Monitoring Made Easy
No ratings yet
Search Engine For The Internet of Everything: Network Monitoring Made Easy
3 pages
WaterBase SWAT in An Open Source GIS
No ratings yet
WaterBase SWAT in An Open Source GIS
6 pages
Factors Influencing Consumer's Purchase Intention On Video Games (Autosaved)
No ratings yet
Factors Influencing Consumer's Purchase Intention On Video Games (Autosaved)
13 pages
AI Powered Software Testing The Impact of Large Language Models on Testing Methodologies
No ratings yet
AI Powered Software Testing The Impact of Large Language Models on Testing Methodologies
4 pages
Fundamentals of Social Media Advertising
No ratings yet
Fundamentals of Social Media Advertising
53 pages
Address Decoder
No ratings yet
Address Decoder
11 pages
22 PLC15 B
No ratings yet
22 PLC15 B
5 pages
Supermetrics Brand Communications Guidelines
No ratings yet
Supermetrics Brand Communications Guidelines
35 pages
21st CENT-Lesson 5
No ratings yet
21st CENT-Lesson 5
18 pages
Fraz Resume
No ratings yet
Fraz Resume
2 pages
13.56 MHZ Mifare DESFire RFID Reader 223017
No ratings yet
13.56 MHZ Mifare DESFire RFID Reader 223017
2 pages
1.introduction and Use of Advanced Functions of IP Phones
No ratings yet
1.introduction and Use of Advanced Functions of IP Phones
21 pages
Ensayo Sobre La Naturaleza vs. Crianza
100% (1)
Ensayo Sobre La Naturaleza vs. Crianza
8 pages
ECS - ECS Miscellaneous How To Service Procedures-ECS Troubleshooting Procedures
100% (1)
ECS - ECS Miscellaneous How To Service Procedures-ECS Troubleshooting Procedures
87 pages
Field List Icons in Power BI Desktop - POWER BI WITH PRASAD
No ratings yet
Field List Icons in Power BI Desktop - POWER BI WITH PRASAD
3 pages
Problem Solving 5 RT Scheduling Analysis Soalan 2023
No ratings yet
Problem Solving 5 RT Scheduling Analysis Soalan 2023
3 pages
ER Diagram
No ratings yet
ER Diagram
23 pages
RetroMagazine 21 Eng
No ratings yet
RetroMagazine 21 Eng
68 pages
Answer Math M1-3
No ratings yet
Answer Math M1-3
40 pages
Calibrador Fluke 743b
No ratings yet
Calibrador Fluke 743b
133 pages
MZU BCA Syllabus 2018
No ratings yet
MZU BCA Syllabus 2018
69 pages
19
No ratings yet
19
74 pages
PRICE CATALOGUE For PPMP 2021
No ratings yet
PRICE CATALOGUE For PPMP 2021
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pandas- Data manipulation and Analysis Library - Educative

Uploaded by

Pandas- Data manipulation and Analysis Library - Educative

Uploaded by

Pandas for Data Manipulation and Analysis

3: Using Python modules:

Example: Titanic dataset

0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False

1 1 female 38.0 1 0 71.2833 C First Woman False C Cherbourg yes False

1 1 female 35.0 1 0 53.1000 S First Woman False C Southampton yes False

Pandas Data Structures

Accessing elements: By position, indexes/labels series.loc[label]

# Creating a series from a list

# Accessing elements by position

# Accessing elements by index/labels

# Creating a series from a dictionary

0 survived pclass sex age sibsp pd.Index

0 3 male 22.0 1 0 7.2500

4 0 3 male 35.0 0 0 8.0500

# Accessing elements by position

# Accessing elements by index/labels

Importing and Storing Data

Reading and Writing Data

Third Woman False NaN Southampton

To a JSON ﬁle JSON [Reading and Writing Data]

Viewing and Other Operations

print(df.describe()) # Summary statistics

Selecting and Filtering Data

Filtering data based on speciﬁc criteria:

Handling missing values:

Removing unnecessary columns/indices:

df.dropna(subset=['deck'], inplace=True) # Drop rows

Data type conversion:

# Convert 'Fare' column to float

Merging and joining:

# Applying a Lambda function to one column

The apply() function on multiple columns

# Applying a Lambda function to multiple columns

The apply() function on one one row

# Applying a Lambda function to multiple rows

The assign() function on one column

The assign() function on multiple columns

df.plot(kind='scatter', x='age', y='fare')

Passenger Count by Embarked

0 100 200 300 400 500 600

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.