0% found this document useful (0 votes)
4 views14 pages

Data Cleaning in Python

The document provides a comprehensive guide on data cleaning using Python, particularly with the `pandas` library. It covers essential tasks such as handling missing data, removing duplicates, standardizing formats, correcting invalid data, and saving cleaned data. Each task includes Python code snippets for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Data Cleaning in Python

The document provides a comprehensive guide on data cleaning using Python, particularly with the `pandas` library. It covers essential tasks such as handling missing data, removing duplicates, standardizing formats, correcting invalid data, and saving cleaned data. Each task includes Python code snippets for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DATA CLEANING IN

PYTHON
Working with Python made easier
Introduction

• Data cleaning is a vital step in data analysis, and Python, with libraries
like `pandas`, offers powerful tools for this process. Below is a guide to
common data cleaning tasks using Python:
1. Import Libraries and Load Data

• Python code
• import pandas as pd
• import numpy as np

• # Load data
• df = pd.read_csv('data.csv') # Replace with your dataset path
2. Handle Missing Data

Identify Missing Values: Fill Missing Values: Drop Missing Values:


Python Code Python code Python code
print(df.isnull().sum()) df['column_name'].fillna('Default df.dropna(inplace=True) # Drop
# Count missing values per Value', inplace=True) rows with missing values
column # Fill with a default value df.dropna(axis=1, inplace=True)
print(df[df.isnull().any(axis=1)]) df['column_name'].fillna(df['colu # Drop columns with missing
# Display rows with missing mn_name'].mean(), values
values inplace=True)
# Fill with mean
3. Remove Duplicates

• Python code
• df.drop_duplicates(inplace=True) # Remove duplicate rows
4. Standardize Data Formats

Trim Whitespace: Change Case: Format Dates:


Python code Python code Python code
df['column_name'] = df['column_name'] = df['date_column'] =
df['column_name'].str.strip() df['column_name'].str.lower() # pd.to_datetime(df['date_column'
Convert to lowercase ], format='%Y%m%d')
5. Correct Invalid Data

Replace Invalid Values: Remove Outliers:


Python code Python code
df['column_name'] = # Using ZScore
df['column_name'].replace(['Invalid Value'], 'Valid from scipy.stats import zscore
Value') df = df[(np.abs(zscore(df['numeric_column'])) < 3)]
6. Handle Inconsistent Data

Unify Categories: Split and Combine Columns:


Python code Python code
df['category_column'] = # Split a column
df['category_column'].replace({ df[['first_name', 'last_name']] =
'Variation1': 'Standardized Value', df['full_name'].str.split(' ', expand=True)
'Variation2': 'Standardized Value'
}) # Combine columns
df['full_name'] = df['first_name'] + ' ' +
df['last_name']
7. Drop Unnecessary Columns or Rows

Drop Columns: Drop Rows:


Python code Python code
df.drop(['unnecessary_column'], axis=1, df = df[df['column_name'] != 'Unwanted Value']
inplace=True)
8. Validate and Clean Data Types

Convert Data Types: Check for Invalid Types:


Python code Python code
df['numeric_column'] = print(df.dtypes)
pd.to_numeric(df['numeric_column'],
errors='coerce') # Coerce invalid values to NaN
df['string_column'] =
df['string_column'].astype(str)
9. Handle Outliers

• Using IQR:
• Python code
• Q1 = df['numeric_column'].quantile(0.25)
• Q3 = df['numeric_column'].quantile(0.75)
• IQR = Q3 Q1
• df = df[(df['numeric_column'] >= Q1 1.5 IQR) & (df['numeric_column'] <= Q3 + 1.5
IQR)]
10. Save Cleaned Data

• Python code
• df.to_csv('cleaned_data.csv', index=False) # Save cleaned data to a new file
Example Workflow
# Full Example # Identify and # Remove # Standardize # Handle # Save the
fill missing duplicates case outliers cleaned data
values
df = df['age'].fillna(df[ df.drop_duplicat df['name'] = Q1 = df.to_csv('cleane
pd.read_csv('dat 'age'].mean(), es(inplace=True) df['name'].str.lo df['income'].qua d_data.csv',
a.csv') inplace=True) wer() ntile(0.25) index=False)
Q3 =
df['income'].qua
ntile(0.75)
IQR = Q3 Q1
df =
df[(df['income']
>= Q1 1.5 IQR) &
(df['income'] <=
Q3 + 1.5 IQR)]
Conclusion

• These Python tools ensure clean, structured, and consistent data for
analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy