0% found this document useful (0 votes)

26 views

Document (2)

data science

Uploaded by

Akansha S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Document (2)

data science

Uploaded by

Akansha S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

1.

Obtaining Data

Data can come from various sources and formats. Understanding how to
collect data is a critical step in data analysis.

1.1. Obtaining Data from the Web

Web Scraping: Extracting data from websites using tools like:

Python libraries: BeautifulSoup, Scrapy, or Selenium.

Web scraping involves fetching the HTML structure of a webpage and parsing
it to extract the required information.

Ethical considerations: Always check a website’s robots.txt file for

permissions.

Downloading Files:

Websites often provide downloadable files (e.g., CSV, Excel, or JSON).

Use tools like requests in Python to automate downloads.

1.2. Obtaining Data from APIs

APIs (Application Programming Interfaces) allow programs to interact with
web services to fetch data.

Examples: Twitter API, Google Maps API.

Tools for working with APIs:

Requests for making HTTP requests.

Json for parsing API responses.

Steps to use an API:

1. Obtain an API key if required.

2. Send a request to the API endpoint.

3. Process and store the response.

1.3. Obtaining Data from Databases

Databases store structured data that can be queried.

Examples: MySQL, PostgreSQL, MongoDB.

Use SQL (Structured Query Language) to interact with relational databases.

Tools like pymysql, psycopg2, or SQLAlchemy help connect Python to

databases.

1.4. Obtaining Data from Colleagues

Data is often shared in formats like CSV, Excel, or JSON.

Ensure consistent communication about:

Format of the data.

Column names and definitions.

Handling missing or incomplete data.

2. Basics of Data Cleaning

Data cleaning ensures the data is accurate, consistent, and ready for
analysis.

2.1. Common Issues in Raw Data

Missing values.

Duplicates.

Inconsistent formatting (e.g., date formats, case sensitivity).

Outliers and anomalies.

Incorrect data types.

2.2. Steps in Data Cleaning

1. Handle Missing Data:

Techniques:

Fill missing values using mean, median, or mode.

Drop rows or columns with too many missing values.

Use domain-specific imputation strategies.

2. Remove Duplicates:

Identify and drop duplicate rows using tools like:

Python: pandas.DataFrame.drop_duplicates().

3. Standardize Data:

Convert text to consistent case (e.g., lowercase).

Format dates consistently.

Ensure numerical data is scaled appropriately.

4. Correct Data Types:

Convert columns to their correct types (e.g., float, int, datetime).

5. Handle Outliers:

Identify outliers using statistical methods (e.g., Z-score, IQR).

Decide whether to remove or transform outliers.

3. Making Data “Tidy”

Tidy data is structured in a way that makes it easy to analyze and visualize.

3.1. Characteristics of Tidy Data

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

3.2. Reshaping Data

Wide to Long Format:

Use when data is in a wide format (e.g., multiple columns for similar
variables).

Tools:

Python: pandas.melt().

Long to Wide Format:

Pivot data to create a summary.

Tools:

Python: pandas.pivot() or pandas.pivot_table().

3.3. Tools for Tidying Data

Python:

Pandas library for data manipulation.

Functions: melt(), pivot(), dropna(), etc.

Tidyr package with functions like gather(), spread().

Key Tools and Libraries

Python:

Pandas for data cleaning and transformation.

Numpy for numerical operations.

Matplotlib and seaborn for data visualization (to identify cleaning needs).
R:

Tidyverse for data manipulation and tidying.

Excel:

Useful for small datasets and quick checks.

Practical Example

Problem: Cleaning a Sales Dataset

Raw Data: | ID | Name | Sales (Q1) | Sales (Q2) | Date |

|-----|-------------|------------|------------|-----------| | 1 | John Smith | 100 | 200
| 01/01/21 | | 2 | jane doe | 300 | | 2021-02-01| | 3 | Bob
Brown | 150 | 400 | 2021/03/01|

Steps:

1. Handle Missing Data:

Fill missing sales with 0 or the average of Q2.

2. Standardize Names:

Convert to title case: “Jane Doe”.

3. Format Dates:

Ensure consistent date format (e.g., YYYY-MM-DD).

4. Make Tidy:

Reshape to have one row per quarter per person.

Cleaned Data: | ID | Name | Quarter | Sales | Date |

|-----|------------|---------|-------|------------| | 1 | John Smith | Q1 | 100 | 2021-
01-01 | | 1 | John Smith | Q2 | 200 | 2021-04-01 | | 2 | Jane Doe | Q1
| 300 | 2021-01-01 | | 2 | Jane Doe | Q2 |0 | 2021-04-01 |
Conclusion

Getting and cleaning data is a critical part of data analysis. It ensures that
raw data is transformed into a usable format for further processing. By
mastering data collection and cleaning, you can build a solid foundation for
any data science project.

1. Obtaining Data

Data is everywhere, but how we get it depends on the source. Each source
requires specific tools and techniques.

1.1. From the Web

Web scraping:

Think of web scraping like copying information from a webpage automatically

using a program.

Example: You want the list of top movies from a website like IMDb.

Use requests to get the webpage’s HTML.

Use BeautifulSoup to find specific parts (like movie titles).

Challenges:

Websites may change structure, breaking your scraper.

Legal/Ethical issues: Always check if scraping is allowed in the site’s

robots.txt.

1.2. From APIs

APIs are like waiters in a restaurant. You (the client) ask for specific data (the
meal), and the API provides it (serves you the food).

Example: You want weather data.

Use the OpenWeather API.

Send a request to the API with parameters like the city name.

Receive a JSON response containing weather details.

Key Terms:

Endpoint: The URL where you send your request.

Authentication: Some APIs need keys to verify you’re authorized to use them.

Example Python code:

Import requests

Response = requests.get(https://api.weatherapi.com/v1/current.json?
key=YOUR_API_KEY&q=London)

Print(response.json())

1.3. From Databases

A database is a structured way to store data. Think of it as an organized

library.

Example:

Tables in a database could look like:

Books Table: Title, Author, Year.

Borrowers Table: Name, Book Borrowed, Date.

To fetch data, you use SQL queries like:

SELECT * FROM Books WHERE Year > 2000;

Python libraries like pymysql can run these queries programmatically.

1.4. From Colleagues

When data comes from colleagues:

Always ask for a data dictionary (a document explaining each column and
value).

Example issues:

Column names like Cust_ID vs CustomerID.

Missing documentation about what “N/A” means (is it missing data or “not
applicable”?).
2. Basics of Data Cleaning

Raw data is often messy. Cleaning it prepares it for analysis.

2.1. Common Problems

Missing Values:

Imagine a dataset of students’ grades, but some rows don’t have scores.

Duplicates:

If you’re analyzing customers, a duplicate might count the same person

twice.

Inconsistent Formatting:

Dates might appear as 01/01/2023, 2023-01-01, or January 1, 2023.

Outliers:
If a person’s age is listed as 500, it’s likely an error.

2.2. Cleaning Techniques

1. Handling Missing Data:

If a column has many missing values, you might drop it.

For fewer missing values:

Fill with a meaningful default.

Example: Replace missing salaries with the average salary.

Example in Python:

Df[‘Salary’] = df[‘Salary’].fillna(df[‘Salary’].mean())

2. Removing Duplicates:
Example: Two identical entries for the same customer.

Df = df.drop_duplicates()

3. Standardizing Data:

Make text consistent.

Convert names to title case (e.g., john smith → John Smith).

Df[‘Name’] = df[‘Name’].str.title()

4. Fixing Data Types:

Example: Dates stored as text strings (“2023-01-01”) need to be converted

into a date format.

5. Handling Outliers:
Example: If you’re analyzing employee ages, and one entry is 300, it’s likely
a mistake. Decide to remove it or fix it.

3. Making Data “Tidy”

“Tidy data” is a way of organizing your dataset so it’s easier to analyze.

3.1. Principles of Tidy Data

1. Columns are Variables:

Each column represents a single type of information.

Example:

Bad format: | Name | Sales_Q1 | Sales_Q2 | |-------|----------|----------| | John |

200 | 250 |

Tidy format: | Name | Quarter | Sales | |-------|---------|-------| | John | Q1 |

200 | | John | Q2 | 250 |
2. Rows are Observations:

Each row is a single instance of observation.

3. One Table Per Observational Unit:

If your data contains multiple types of observations (e.g., customers and

orders), keep them in separate tables.

3.2. Reshaping Data

Wide to Long:

Use pandas.melt() to make columns into rows.

Example:

Df = pd.melt(df, id_vars=[‘Name’], var_name=’Quarter’,

value_name=’Sales’)
Long to Wide:

Use pandas.pivot() or pivot_table().

Example:

Df.pivot(index=’Name’, columns=’Quarter’, values=’Sales’)

Practical Example: Cleaning and Tidying Data

Raw Data: | ID | Name | Product1_Sales | Product2_Sales | Joined On |

|-----|---------|----------------|----------------|------------| | 1 | John | 100 | 200
| 01/01/21 | | 2 | Jane | 300 | NaN | 02-02-2021 | | 3 |
Smith | 150 | 400 | 03.03.2021 |

Steps:

1. Handle missing data:

Fill NaN in Product2_Sales with 0.

2. Standardize names:
Convert Name to title case.

3. Standardize date format:

Convert Joined On to YYYY-MM-DD.

4. Tidy the dataset:

Reshape data to have Product and Sales columns.

Tidy Data: | ID | Name | Product | Sales | Joined On |

|-----|---------|-----------|-------|------------| | 1 | John | Product1 | 100 | 2021-
01-01 | | 1 | John | Product2 | 200 | 2021-01-01 | | 2 | Jane | Product1
| 300 | 2021-02-02 | | 2 | Jane | Product2 | 0 | 2021-02-02 |

Key Takeaways

Data is messy by default; cleaning is essential before analysis.

A consistent process:

1. Get the data from reliable sources.

2. Identify and fix issues (missing data, duplicates, inconsistent formats).

3. Organize data into tidy formats for analysis.

1. Loading Data

First, we need to load the data into Python.

Example: Load a CSV File

Import pandas as pd

# Load the CSV file

Df = pd.read_csv(‘data.csv’)

# Display the first few rows

Print(df.head())

2. Handling Missing Data

Identify Missing Data

# Check for missing values

Print(df.isnull().sum())

# Check percentage of missing values

Print(df.isnull().sum() / len(df) * 100)

Fill Missing Data

1. Replace missing values with a default value (e.g., 0):

Df[‘Column_Name’] = df[‘Column_Name’].fillna(0)

2. Replace missing values with the column mean:

Df[‘Column_Name’] = df[‘Column_Name’].fillna(df[‘Column_Name’].mean())
Drop Rows with Missing Values

# Drop rows with any missing values

Df = df.dropna()

# Drop rows with missing values in a specific column

Df = df.dropna(subset=[‘Column_Name’])

3. Removing Duplicates

# Remove duplicate rows

Df = df.drop_duplicates()

# Check for duplicates

Print(df.duplicated().sum())

4. Standardizing Data

Convert Text to Consistent Case

# Convert a column to lowercase

Df[‘Name’] = df[‘Name’].str.lower()
# Convert to title case (capitalize each word)

Df[‘Name’] = df[‘Name’].str.title()

Standardize Dates

# Convert to datetime format

Df[‘Date_Column’] = pd.to_datetime(df[‘Date_Column’])

# Check the new format

Print(df[‘Date_Column’].head())

Fix Data Types

# Convert a column to integer

Df[‘Column_Name’] = df[‘Column_Name’].astype(int)

# Convert to float

Df[‘Column_Name’] = df[‘Column_Name’].astype(float)

5. Handling Outliers

Detect Outliers Using Z-Score

Outliers can be detected using the Z-score, which measures how far a value
is from the mean.
From scipy.stats import zscore

# Calculate Z-scores for a column

Df[‘Z_Score’] = zscore(df[‘Column_Name’])

# Filter out rows where Z-Score > 3 or < -3

Df = df[(df[‘Z_Score’] < 3) & (df[‘Z_Score’] > -3)]

Visualize Outliers

Import matplotlib.pyplot as plt

# Boxplot to detect outliers

Plt.boxplot(df[‘Column_Name’])

Plt.show()

6. Tidying Data

Wide to Long Format

When your data has columns like Sales_Q1, Sales_Q2, and you want them as
rows:

# Original Data
Df = pd.DataFrame({

‘Name’: [‘John’, ‘Jane’, ‘Smith’],

‘Sales_Q1’: [100, 200, 150],

‘Sales_Q2’: [120, 220, 180]

})

# Reshape to long format

Df_long = pd.melt(df, id_vars=[‘Name’], var_name=’Quarter’,

value_name=’Sales’)

Print(df_long)

Output: | Name | Quarter | Sales | |--------|-----------|-------| | John | Sales_Q1

Long to Wide Format

Reshape long data into wide format.

# Pivot table to reshape data

Df_wide = df_long.pivot(index=’Name’, columns=’Quarter’, values=’Sales’)

Print(df_wide)
Output: | Quarter | Sales_Q1 | Sales_Q2 | |---------|----------|----------| | John |
100 | 120 | | Jane | 200 | 220 |

7. Real-Life Example

Imagine you have the following messy dataset:

Data = {

‘Name’: [‘John’, ‘JANE’, ‘Smith’, ‘John’],

‘Age’: [25, None, 30, 25],

‘Date_Joined’: [‘01/01/2021’, ’02-02-2021’, ‘March 3, 2021’, None],

‘Salary’: [50000, 60000, None, 50000]

Df = pd.DataFrame(data)

Step-by-Step Cleaning:

1. Standardize Names:

Df[‘Name’] = df[‘Name’].str.title()

2. Handle Missing Data:

# Fill missing Age with the mean

Df[‘Age’] = df[‘Age’].fillna(df[‘Age’].mean())

# Drop rows with missing Date_Joined

Df = df.dropna(subset=[‘Date_Joined’])

3. Fix Date Format:

Df[‘Date_Joined’] = pd.to_datetime(df[‘Date_Joined’])

4. Remove Duplicates:

Df = df.drop_duplicates()

Final Cleaned Data: | Name | Age | Date_Joined | Salary |

|--------|------|-------------|---------| | John | 25.0 | 2021-01-01 | 50000.0 | | Jane |
27.5 | 2021-02-02 | 60000.0 | | Smith | 30.0 | 2021-03-03 | NaN

Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
data-cleaning-using-pandas
No ratings yet
data-cleaning-using-pandas
9 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
L3
No ratings yet
L3
34 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
III-Unit
No ratings yet
III-Unit
4 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
Data Cleaning_ Importance and Techniques
No ratings yet
Data Cleaning_ Importance and Techniques
1 page
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
DABD (KMBNIT01) Model Paper With Solution
No ratings yet
DABD (KMBNIT01) Model Paper With Solution
19 pages
What Is The Concept of Data Cleaning
No ratings yet
What Is The Concept of Data Cleaning
20 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
EDA
100% (1)
EDA
9 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
PDS_Exp_7_to_9
No ratings yet
PDS_Exp_7_to_9
10 pages
step by step data wrangling
No ratings yet
step by step data wrangling
4 pages
Data Cleansing Steps
No ratings yet
Data Cleansing Steps
8 pages
20PMHS012_RH
No ratings yet
20PMHS012_RH
32 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
DS Practical
No ratings yet
DS Practical
9 pages
Document (5)
No ratings yet
Document (5)
37 pages
Document (1)
No ratings yet
Document (1)
32 pages
JS Part-2
No ratings yet
JS Part-2
24 pages
2nd semester question
No ratings yet
2nd semester question
8 pages
Document (4)
No ratings yet
Document (4)
21 pages
? Simple UI Design Elements (Community)
No ratings yet
? Simple UI Design Elements (Community)
1 page
statistics 2
No ratings yet
statistics 2
36 pages
SE unit 1
No ratings yet
SE unit 1
29 pages
SE unit -3
No ratings yet
SE unit -3
25 pages
MAD Unit 4
No ratings yet
MAD Unit 4
141 pages
DM UNIT 1
No ratings yet
DM UNIT 1
9 pages
L13 Relational Model DDL
No ratings yet
L13 Relational Model DDL
79 pages
Analysis the Biomedical Datasets CSV File
No ratings yet
Analysis the Biomedical Datasets CSV File
12 pages
Practical 1
No ratings yet
Practical 1
65 pages
Python & Excel Automation Cheat Sheet
No ratings yet
Python & Excel Automation Cheat Sheet
5 pages
opencv
No ratings yet
opencv
51 pages
Synopsis Report
No ratings yet
Synopsis Report
7 pages
unit 3
No ratings yet
unit 3
148 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
Pandas
No ratings yet
Pandas
11 pages
Yash Pandey Resume __ (1)
No ratings yet
Yash Pandey Resume __ (1)
1 page
Data Mining & Data Science Practical Slips
No ratings yet
Data Mining & Data Science Practical Slips
45 pages
All Unit Question Bank
No ratings yet
All Unit Question Bank
4 pages
ETL-Developer-Training
No ratings yet
ETL-Developer-Training
7 pages
PDS Gtu QB
No ratings yet
PDS Gtu QB
4 pages
WIN SEM (2023-24) FRESHERS - CSE0504 - ETH - AP2023247000196 - 2024-02-29 - Reference-Material-II
No ratings yet
WIN SEM (2023-24) FRESHERS - CSE0504 - ETH - AP2023247000196 - 2024-02-29 - Reference-Material-II
13 pages
R23 M.Tech AI_DS Syllabus--RCEE
No ratings yet
R23 M.Tech AI_DS Syllabus--RCEE
58 pages
Senior Insurance Data Analyst Resume Example
No ratings yet
Senior Insurance Data Analyst Resume Example
1 page
Todd K. Python & SQL. The Essential Guide For Programmers & Analysts 2024
No ratings yet
Todd K. Python & SQL. The Essential Guide For Programmers & Analysts 2024
116 pages
WT 1 and FDS Practical Slips Solution Form Www.dailycover.live
No ratings yet
WT 1 and FDS Practical Slips Solution Form Www.dailycover.live
91 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Business Analyst Roadmap
No ratings yet
Business Analyst Roadmap
40 pages
Download Network Security Through Data Analysis From Data to Action 2nd Edition Michael Collins ebook All Chapters PDF
100% (1)
Download Network Security Through Data Analysis From Data to Action 2nd Edition Michael Collins ebook All Chapters PDF
55 pages
2 Marks questions
No ratings yet
2 Marks questions
116 pages
SIC - C - P - Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization - v1.0
No ratings yet
SIC - C - P - Chapter 7. Data Processing, Descriptive Statistics, and Data Visualization - v1.0
479 pages
Data Science Project Details
No ratings yet
Data Science Project Details
8 pages
Data Analysis Book Python (Pandas)
No ratings yet
Data Analysis Book Python (Pandas)
75 pages
PYQ Data Analysis and Visualisation Using Python GE May 2024
No ratings yet
PYQ Data Analysis and Visualisation Using Python GE May 2024
6 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
XII IP MS (1)
No ratings yet
XII IP MS (1)
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.