Data Cleaning in Python
Data Cleaning in Python
PYTHON
Working with Python made easier
Introduction
• Data cleaning is a vital step in data analysis, and Python, with libraries
like `pandas`, offers powerful tools for this process. Below is a guide to
common data cleaning tasks using Python:
1. Import Libraries and Load Data
• Python code
• import pandas as pd
• import numpy as np
•
• # Load data
• df = pd.read_csv('data.csv') # Replace with your dataset path
2. Handle Missing Data
• Python code
• df.drop_duplicates(inplace=True) # Remove duplicate rows
4. Standardize Data Formats
• Using IQR:
• Python code
• Q1 = df['numeric_column'].quantile(0.25)
• Q3 = df['numeric_column'].quantile(0.75)
• IQR = Q3 Q1
• df = df[(df['numeric_column'] >= Q1 1.5 IQR) & (df['numeric_column'] <= Q3 + 1.5
IQR)]
10. Save Cleaned Data
• Python code
• df.to_csv('cleaned_data.csv', index=False) # Save cleaned data to a new file
Example Workflow
# Full Example # Identify and # Remove # Standardize # Handle # Save the
fill missing duplicates case outliers cleaned data
values
df = df['age'].fillna(df[ df.drop_duplicat df['name'] = Q1 = df.to_csv('cleane
pd.read_csv('dat 'age'].mean(), es(inplace=True) df['name'].str.lo df['income'].qua d_data.csv',
a.csv') inplace=True) wer() ntile(0.25) index=False)
Q3 =
df['income'].qua
ntile(0.75)
IQR = Q3 Q1
df =
df[(df['income']
>= Q1 1.5 IQR) &
(df['income'] <=
Q3 + 1.5 IQR)]
Conclusion
• These Python tools ensure clean, structured, and consistent data for
analysis.