Pandas
Pandas
Pandas is a Python library that makes it really easy to work with data. Think of it as
a tool that helps you organize, clean, and analyze data, especially when that data is
in the form of a table, like what you'd see in Excel.
Imagine you have a table of data, like a list of people's names, ages, and where
they live. Pandas helps you work with that data efficiently. It’s very useful when you
have lots of data that you want to sort, filter, or do calculations with. You can also
use Pandas to read data from files like CSV (Comma-Separated Values), Excel
spreadsheets, and even databases.
Data Frame
A DataFrame is like a full table of data with rows and columns. If a Series is
like a single column, then a DataFrame is like several columns (each column
is a Series). It’s essentially a collection of Series that are aligned by their
index.
Key Points About DataFrame:
It is two-dimensional (like a table).
Each column is a Series, and the columns are aligned based on their
index (row labels).
You can think of it as an Excel sheet where each cell belongs to a row
and a column.
A data structure in Pandas that has more than one column (or even just
one column, but it's still considered a DataFrame because it's designed to
handle multiple columns). It is two-dimensional, like a table with rows and
columns.
How to use Data Frame in Python pandas
import pandas as pd
data = {
"Name":["Ritik Kumar","Roshan Kumar","Jitendra Kumar","Asha
Devi","Krishna Kumar","Kajal Kumari"],
"Age":[21,19,47,38,28,19]
}
df = pd.DataFrame(data)
df
You can also fill missing values with the mean or median of a column:
4. Handling Outliers
Outliers are data points that differ significantly from other observations in the
dataset. They can distort your analysis.
Identifying Outliers
Using Statistics:
o Outliers can be identified using summary statistics like the mean and
standard deviation or using visualizations such as box plots.
# Identify outliers using a simple method (values greater than mean +
3*std)
threshold = df['Age'].mean() + 3 * df['Age'].std()
outliers = df[df['Age'] > threshold]
print(outliers)
Removing or Transforming Outliers
Transforming or Capping Outliers:
o You can choose to transform outliers (e.g., using log transformation) or
cap them at a certain threshold.
# Cap Age values at a maximum of 40
df['Age'] = df['Age'].apply(lambda x: min(x, 40))
print(df)
5. String Manipulation
Text data often needs cleaning to remove unwanted characters or correct
formatting.
Removing Unwanted Characters
str.replace():
o str.replace() replaces a specific pattern or character in strings.
6. Renaming Columns
Sometimes, columns may have unclear or inconsistent names that need to be
corrected.
Renaming Columns
rename():
o rename() changes the names of columns.
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
# Rename columns
df = df.rename(columns={'col1': 'First Column', 'col2': 'Second Column'})
print(df)
Merging, Joining, and Concatenating DataFrames
1. Concatenating DataFrames
Concatenation means combining two or more DataFrames along a particular axis
(either rows or columns). You can think of it like stacking data on top of each other
or side by side.
pd.concat(): This function concatenates DataFrames either vertically (by
default) or horizontally. You can concatenate along rows (axis=0) or columns
(axis=1).
Example 1: Concatenating Along Rows (Vertical Concatenation)
import pandas as pd
df1 = pd.DataFrame({'Name': ['John', 'Alice'],
'Age': [28, 30]})
df2 = pd.DataFrame({'Name': ['Bob', 'Eve'],
'Age': [35, 22]})
# Concatenating DataFrames vertically
df_vertical_concat = pd.concat([df1, df2], axis=0)
print(df_vertical_concat)
2. Merging DataFrames
Merging is similar to SQL joins. You merge DataFrames based on one or more
common columns (known as "keys"). This operation allows you to combine rows
from two DataFrames where there is a match on the key(s).
pd.merge(): This function merges DataFrames using a common column or
index. The type of merge can be controlled using the how parameter.
The main types of merges are:
Inner Join (default): Returns rows that have matching values in both
DataFrames.
Left Join: Returns all rows from the left DataFrame, and matching rows from
the right DataFrame.
Right Join: Returns all rows from the right DataFrame, and matching rows
from the left DataFrame.
Outer Join: Returns all rows from both DataFrames, filling missing values
with NaN where there is no match.
Example 1: Inner Join
df1 = pd.DataFrame({'Name': ['John', 'Alice', 'Bob'],
'Age': [28, 30, 35]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob', 'Eve'],
'Score': [88, 90, 92]})
# Merging DataFrames using an inner join (default)
df_inner_merge = pd.merge(df1, df2, on='Name')
print(df_inner_merge)
Example 2: Left Join
# Merging DataFrames using a left join
df_left_merge = pd.merge(df1, df2, on='Name', how='left')
print(df_left_merge)
Concatenation (pd.concat()): Stacking DataFrames either vertically or
horizontally.
Merging (pd.merge()): Combining DataFrames based on one or more
common columns (like SQL joins).
Joining (df.join()): Combining DataFrames based on their index.