Pandas- Data manipulation and Analysis Library - Educative
Pandas- Data manipulation and Analysis Library - Educative
Introduction to pandas
1: What is pandas?
Pandas is an open-source library in Python used for data cleaning, manipulation and analysis. It
provides the data structures and functions needed to work efficiently with tabular or structured
data (such as CSV files, Excel sheets, SQL databases, etc.).
2: Pandas installation
pip install pandas
1 3 female 26.0 0 0 7.9250 S Third Woman False NaN Southampton yes True
0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton yes True
Creating series:
From lists, arrays, dictionaries pd.Series(data)
Example of a DataFrame
axis=1 or columns
pd.Index pd.Series
Creating DataFrames:
From a dictionary: pd.DataFrame(data)
CSV file: pd.read_csv('Titanic.csv')
A list of lists: pd.DataFrame(data, columns=['survived', 'pclass', 'sex' 'age',...])
Accessing elements:
By position: df.iloc[row_index, col_index]
By indexes/labels: df.loc[row_label, col_label]
Example
# Creating a DataFrame from a dictionary
data = {
'survived': [0, 1, 1, 1, 0],
'pclass': [3, 1, 3, 1, 3],
'sex': ['male', 'female', 'female', 'female', 'male'],
'age': [22.0, 38.0, 26.0, 35.0, 35.0],
'sibsp': [1, 1, 0, 1, 0],
'parch': [0, 0, 0, 0, 0]
}
df = pd.DataFrame(data)
First df
Woman= pd.read_parquet('Titanic.parquet')
False C Southampton yes False
df.to_parquet('Titanic.parquet')
Third man True NaN Southampton yes True
HDF5 [Reading and Writing Data]
df = pd.read_hdf('Titanic.h5')
df.to_hdf('Titanic.h5', key='df', mode='w')
(key to specify the key of the HDF5 group to write to, and mode to specify the mode of
writing ('w' for write, ‘a’ for append))
To an Excel file Excel [Reading and Writing Data]
pd.read_excel('Titanic.xlsx', sheet_name='Sheet1')
df.to_excel('Titanic.xlsx', sheet_name='Sheet1')
pd.read_json('Titanic.json')
df.to_json('Titanic.json')
We can obtain statistical information, the length and shape of the DataFrame, information
about columns, and a summary of the DataFrame by calling the following functions:
Selecting rows:
df.iloc[row_index], df.loc[row_label]
df.iloc[10] # 11th row of DataFrame
df.loc[df['age'] > 18] # Rows where age is greater than 18
We can call advanced statistics and aggregation functions to analyze the data:
df.mean(), df.median(), df.mode(), df.var(), df.std()
print(df.groupby('sex').mean())
Data Cleaning
We can clean and preprocess data using pandas functions.
Handling duplicates:
df.drop_duplicates()
Data normalization:
def normalize_column(col):
return (col - col.min()) / (col.max() - col.min())
# Normalize the 'Age' and 'Fare' columns
df['Age'] = normalize_column(df['Age'])
df['Fare'] = normalize_column(df['Fare'])
Renaming columns:
df.rename(columns={'Pclass': 'PassengerClass', 'SibSp': 'SiblingsSpouses',
'Parch': 'ParentsChildren'}, inplace=True)
Pandas for Data Manipulation and Analysis
Data Transformation
We can transform data using various pandas functions.
Sorting data:
df.sort_values('col'), df.sort_index()
Grouping by a column:
df.groupby('col').sum(), df.groupby('col').mean()
df.groupby('pclass')['survived'].mean()
# The result is a series where the index is the unique values in the 'pclass'
column (1, 2, 3 in this case), and the values are the average survival rates for
each class.
Output:
Pclass
1 0.65
2 0.45
3 0.25
Arithmetic operations:
df['new_column'] = df['column1'] + df['column2']
df['FamilySize'] = df['sibSp'] + df['parch']
Reshaping:
df.pivot(index='index_col', columns='col')
df.pivot_table(values='survived', index='pclass', columns='sex', aggfunc='mean')
Lambda Functions
We can apply customized functions with Lambda expressions to modify data in the DataFrame.
The apply() function on one column
Visualization
We can use pandas functions for data visualization.
hist()
df['age'].hist()
We used the hist() function to plot a histogram for the distribution of ages among
Titanic passengers.
175
150
125
100
75
50
25
0
0 10 20 20 40 50 60 70 80
We used the plot() function to draw a scatter plot for presenting the relationship between age and
fare paid by Titanic passengers.
Pandas for Data Manipulation and Analysis
500
400
300
Fare
Fare
200
100
0 10 20 30 40 50 60 70 80
Age
The code below groups the DataFrame df by the 'embarked' column, calculates the occurrences of
each unique value in the 'embarked' column (value_counts() method), and plots a horizontal bar
chart.
df.groupby('embarked').value_counts().plot(kind='barh', color=colors)
S
Embarked
Count
We can blend different functions, from pandas to create complex visualizations. Here, we used the
plot()function to plot the bar chart in horizontal alignment to present the count of passengers
at different locations ('C', 'Q', 'S') .