Data Collection Cleaning Preprocessing Presentation
Data Collection Cleaning Preprocessing Presentation
Cleaning and
Transformatio
n
INTRODUCTION TO ESSENTIAL DATA
SCIENCE STEPS
Agenda
- Data Collection
- Data Cleaning
- Data Transformation
Importance of Data
Collection
•Why data collection is crucial?
Data collection is crucial because it forms the foundation for informed
decision-making in any field. By gathering accurate and relevant data,
organizations can identify trends, measure performance, and gain insights into
customer behavior, market dynamics, and operational efficiency
- Unstructured Data
Common Data Sources
- Surveys and Questionnaires
- Databases and Data Warehouses
- Web Scraping
- APIs and Public Data Sets
Data Collection Methods
Manual Data Collection
◦ Pros – ◦ Cons –
◦ Flexibility and Customization ◦ Time Consuming
◦ Human Insight ◦ Prone to Human Error
◦ Cost-Effective for Small-Scale Projects ◦ Scalability Issues
Handling Outliers
◦ Identification
◦ Transformation
◦ Removal
◦ Imputation
◦ Segmentation
◦ Modeling
Data Transformation
Aspect Normalization Standardization
Rescales data to a fixed range, usually [0, 1] or [-1, Transforms data to have a mean of 0 and a standard
Definition
1]. deviation of 1.
Does not alter the shape of the distribution; only Alters the distribution by centering it around 0 and
Effect on Distribution
scales it. scaling by standard deviation.
Sensitive to Outliers More sensitive to outliers as they can skew the Less sensitive; outliers may still be present but are
range. scaled differently.
Assumes data is within a known range and is Assumes data is normally distributed and is
Assumption
bounded. unbounded.
Example Workflow
Tools for Data Cleaning
and Preprocessing
Python Libraries:
• Pandas
• NumPy
• SciPy
• Scikit-learn
•SQL-Based Tools:
• SQL
• Apache Hive
Questions?