0% found this document useful (0 votes)
7 views

Data Collection Cleaning Preprocessing Presentation

The document outlines the essential steps in data science, focusing on data collection, cleaning, and transformation. It emphasizes the importance of accurate data collection for informed decision-making and discusses various data sources and methods. Additionally, it covers the necessity of data cleaning to ensure reliability and introduces data transformation techniques like normalization and standardization.

Uploaded by

Anish Patnaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Collection Cleaning Preprocessing Presentation

The document outlines the essential steps in data science, focusing on data collection, cleaning, and transformation. It emphasizes the importance of accurate data collection for informed decision-making and discusses various data sources and methods. Additionally, it covers the necessity of data cleaning to ensure reliability and introduces data transformation techniques like normalization and standardization.

Uploaded by

Anish Patnaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Collection,

Cleaning and
Transformatio
n
INTRODUCTION TO ESSENTIAL DATA
SCIENCE STEPS
Agenda
- Data Collection
- Data Cleaning
- Data Transformation
Importance of Data
Collection
•Why data collection is crucial?
Data collection is crucial because it forms the foundation for informed
decision-making in any field. By gathering accurate and relevant data,
organizations can identify trends, measure performance, and gain insights into
customer behavior, market dynamics, and operational efficiency

•Impact of good data collection on analysis and results


Good data collection enhances the accuracy and reliability of analysis, leading
to more precise and actionable results. It ensures that insights are based on
solid evidence, reducing the risk of errors and improving decision-making
outcomes.
Types of Data
- Structured Data

- Unstructured Data
Common Data Sources
- Surveys and Questionnaires
- Databases and Data Warehouses
- Web Scraping
- APIs and Public Data Sets
Data Collection Methods
Manual Data Collection
◦ Pros – ◦ Cons –
◦ Flexibility and Customization ◦ Time Consuming
◦ Human Insight ◦ Prone to Human Error
◦ Cost-Effective for Small-Scale Projects ◦ Scalability Issues

Automated Data Collection


◦ Pros – ◦ Cons –
◦ Speed and Efficiency ◦ High Initial Costs
◦ Accuracy and Consistency ◦ Lack of Flexibility
◦ Scalability ◦ Technical Issues
Introduction to Data
Cleaning
The necessity of cleaning data before analysis
◦ Data cleaning is essential to remove inaccuracies, inconsistencies, and errors
from datasets, ensuring the reliability of analysis. Clean data leads to more
accurate insights and better decision-making, preventing misleading
conclusions.

Brief overview of common issues in raw data


◦ Missing Data
◦ Duplicate Entries
◦ Inconsistent Formats
◦ Outliers
Handling Missing Values
Types of missing data
◦ Missing Completely at Random (MCAR)
◦ Missing at Random (MAR)
◦ Missing Not at Random (MNAR)

Techniques for handling missing values (e.g., removal, imputation)


◦ Deletion Methods
◦ Listwise Deletion
◦ Pairwise Deletion
◦ Imputation Methods
◦ Mean/Median/Mode Imputation
◦ Predictive Imputation
◦ Multiple Imputation
◦ Time Series Imputation
Dealing with Outliers
Definition of outliers
◦ Outliers are data points that significantly deviate from the rest of the
dataset. They can be much higher or lower than the other values and can
skew or mislead statistical analyses.

Handling Outliers
◦ Identification
◦ Transformation
◦ Removal
◦ Imputation
◦ Segmentation
◦ Modeling
Data Transformation
Aspect Normalization Standardization

Rescales data to a fixed range, usually [0, 1] or [-1, Transforms data to have a mean of 0 and a standard
Definition
1]. deviation of 1.

Does not alter the shape of the distribution; only Alters the distribution by centering it around 0 and
Effect on Distribution
scales it. scaling by standard deviation.

Sensitive to Outliers More sensitive to outliers as they can skew the Less sensitive; outliers may still be present but are
range. scaled differently.

Commonly used in scenarios where data needs to fit Preferred


in statistical analyses and machine
Use Case learning algorithms that assume normally
within a bounded range, e.g., image processing.
distributed data, e.g., linear regression.

Assumes data is within a known range and is Assumes data is normally distributed and is
Assumption
bounded. unbounded.
Example Workflow
Tools for Data Cleaning
and Preprocessing
Python Libraries:
• Pandas
• NumPy
• SciPy
• Scikit-learn

•SQL-Based Tools:
• SQL
• Apache Hive

•Data Visualization Tools:


• Tableau Prep
• Power BI
Q&A

Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy