0% found this document useful (0 votes)
1 views5 pages

3-Introduction To Data Cleaning Outlires

The document introduces data cleaning, focusing on outlier detection methods such as the Interquartile Range (IQR) Method, Z-score Method, and Boxplot Visualization. Each method has its own pros and cons, with IQR being effective for skewed data and Z-score suitable for normally distributed data. Additionally, it explains Label Encoding as a technique for converting categorical data into numerical values for machine learning models.

Uploaded by

mymopop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views5 pages

3-Introduction To Data Cleaning Outlires

The document introduces data cleaning, focusing on outlier detection methods such as the Interquartile Range (IQR) Method, Z-score Method, and Boxplot Visualization. Each method has its own pros and cons, with IQR being effective for skewed data and Z-score suitable for normally distributed data. Additionally, it explains Label Encoding as a technique for converting categorical data into numerical values for machine learning models.

Uploaded by

mymopop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction to Data Cleaning

• Outliers Detection Methods


Outliers are data points that deviate significantly from the dataset. Detecting
them is crucial to prevent distortion in analysis and model performance.

1. Interquartile Range (IQR) Method


Concept: Uses quartiles to define a range for normal data points.
Formula:

• IQR = Q3 - Q1

• Lower Bound = Q1 - 1.5 × IQR

• Upper Bound = Q3 + 1.5 × IQR

• Outliers lie outside this range.

Pros: Works well for skewed data.


Cons: Threshold (1.5 × IQR) is empirical.
2. Z-score Method
Concept: Measures of how far a data point is from the mean in terms of standard
deviations.
Formula:

Z=(X−μ)/σ
• Data points with |Z-score| > 3 are considered outliers.

Pros: Best for normally distributed data.


Cons: Not reliable for skewed data.

3. Boxplot Visualization
Concept: Graphical method to detect outliers. Outliers appear as points
outside the whiskers.
Steps:

1. Create a boxplot for numerical features.

2. Identify points outside the whiskers.


Label Encoding
What is Label Encoding?

• Label Encoding is a method of converting categorical (text-based) data


into numerical values.

• This technique is used when machine learning models cannot process


text data directly.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy