0% found this document useful (0 votes)
26 views

Document (2)

data science

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Document (2)

data science

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

1.

Obtaining Data

Data can come from various sources and formats. Understanding how to
collect data is a critical step in data analysis.

1.1. Obtaining Data from the Web

Web Scraping: Extracting data from websites using tools like:

Python libraries: BeautifulSoup, Scrapy, or Selenium.

Web scraping involves fetching the HTML structure of a webpage and parsing
it to extract the required information.

Ethical considerations: Always check a website’s robots.txt file for


permissions.

Downloading Files:

Websites often provide downloadable files (e.g., CSV, Excel, or JSON).

Use tools like requests in Python to automate downloads.

1.2. Obtaining Data from APIs


APIs (Application Programming Interfaces) allow programs to interact with
web services to fetch data.

Examples: Twitter API, Google Maps API.

Tools for working with APIs:

Requests for making HTTP requests.

Json for parsing API responses.

Steps to use an API:

1. Obtain an API key if required.

2. Send a request to the API endpoint.

3. Process and store the response.


1.3. Obtaining Data from Databases

Databases store structured data that can be queried.

Examples: MySQL, PostgreSQL, MongoDB.

Use SQL (Structured Query Language) to interact with relational databases.

Tools like pymysql, psycopg2, or SQLAlchemy help connect Python to


databases.

1.4. Obtaining Data from Colleagues

Data is often shared in formats like CSV, Excel, or JSON.

Ensure consistent communication about:

Format of the data.

Column names and definitions.

Handling missing or incomplete data.


2. Basics of Data Cleaning

Data cleaning ensures the data is accurate, consistent, and ready for
analysis.

2.1. Common Issues in Raw Data

Missing values.

Duplicates.

Inconsistent formatting (e.g., date formats, case sensitivity).

Outliers and anomalies.

Incorrect data types.

2.2. Steps in Data Cleaning

1. Handle Missing Data:

Techniques:

Fill missing values using mean, median, or mode.


Drop rows or columns with too many missing values.

Use domain-specific imputation strategies.

2. Remove Duplicates:

Identify and drop duplicate rows using tools like:

Python: pandas.DataFrame.drop_duplicates().

3. Standardize Data:

Convert text to consistent case (e.g., lowercase).

Format dates consistently.

Ensure numerical data is scaled appropriately.


4. Correct Data Types:

Convert columns to their correct types (e.g., float, int, datetime).

5. Handle Outliers:

Identify outliers using statistical methods (e.g., Z-score, IQR).

Decide whether to remove or transform outliers.

3. Making Data “Tidy”

Tidy data is structured in a way that makes it easy to analyze and visualize.

3.1. Characteristics of Tidy Data

Each variable forms a column.

Each observation forms a row.


Each type of observational unit forms a table.

3.2. Reshaping Data

Wide to Long Format:

Use when data is in a wide format (e.g., multiple columns for similar
variables).

Tools:

Python: pandas.melt().

Long to Wide Format:

Pivot data to create a summary.

Tools:

Python: pandas.pivot() or pandas.pivot_table().


3.3. Tools for Tidying Data

Python:

Pandas library for data manipulation.

Functions: melt(), pivot(), dropna(), etc.

R:

Tidyr package with functions like gather(), spread().

Key Tools and Libraries

Python:

Pandas for data cleaning and transformation.

Numpy for numerical operations.

Matplotlib and seaborn for data visualization (to identify cleaning needs).
R:

Tidyverse for data manipulation and tidying.

Excel:

Useful for small datasets and quick checks.

Practical Example

Problem: Cleaning a Sales Dataset

Raw Data: | ID | Name | Sales (Q1) | Sales (Q2) | Date |


|-----|-------------|------------|------------|-----------| | 1 | John Smith | 100 | 200
| 01/01/21 | | 2 | jane doe | 300 | | 2021-02-01| | 3 | Bob
Brown | 150 | 400 | 2021/03/01|

Steps:

1. Handle Missing Data:

Fill missing sales with 0 or the average of Q2.


2. Standardize Names:

Convert to title case: “Jane Doe”.

3. Format Dates:

Ensure consistent date format (e.g., YYYY-MM-DD).

4. Make Tidy:

Reshape to have one row per quarter per person.

Cleaned Data: | ID | Name | Quarter | Sales | Date |


|-----|------------|---------|-------|------------| | 1 | John Smith | Q1 | 100 | 2021-
01-01 | | 1 | John Smith | Q2 | 200 | 2021-04-01 | | 2 | Jane Doe | Q1
| 300 | 2021-01-01 | | 2 | Jane Doe | Q2 |0 | 2021-04-01 |
Conclusion

Getting and cleaning data is a critical part of data analysis. It ensures that
raw data is transformed into a usable format for further processing. By
mastering data collection and cleaning, you can build a solid foundation for
any data science project.

1. Obtaining Data

Data is everywhere, but how we get it depends on the source. Each source
requires specific tools and techniques.

1.1. From the Web

Web scraping:

Think of web scraping like copying information from a webpage automatically


using a program.

Example: You want the list of top movies from a website like IMDb.

Use requests to get the webpage’s HTML.

Use BeautifulSoup to find specific parts (like movie titles).


Challenges:

Websites may change structure, breaking your scraper.

Legal/Ethical issues: Always check if scraping is allowed in the site’s


robots.txt.

1.2. From APIs

APIs are like waiters in a restaurant. You (the client) ask for specific data (the
meal), and the API provides it (serves you the food).

Example: You want weather data.

Use the OpenWeather API.

Send a request to the API with parameters like the city name.

Receive a JSON response containing weather details.


Key Terms:

Endpoint: The URL where you send your request.

Authentication: Some APIs need keys to verify you’re authorized to use them.

Example Python code:

Import requests

Response = requests.get(https://api.weatherapi.com/v1/current.json?
key=YOUR_API_KEY&q=London)

Print(response.json())

1.3. From Databases

A database is a structured way to store data. Think of it as an organized


library.

Example:

Tables in a database could look like:

Books Table: Title, Author, Year.

Borrowers Table: Name, Book Borrowed, Date.


To fetch data, you use SQL queries like:

SELECT * FROM Books WHERE Year > 2000;

Python libraries like pymysql can run these queries programmatically.

1.4. From Colleagues

When data comes from colleagues:

Always ask for a data dictionary (a document explaining each column and
value).

Example issues:

Column names like Cust_ID vs CustomerID.

Missing documentation about what “N/A” means (is it missing data or “not
applicable”?).
2. Basics of Data Cleaning

Raw data is often messy. Cleaning it prepares it for analysis.

2.1. Common Problems

Missing Values:

Imagine a dataset of students’ grades, but some rows don’t have scores.

Duplicates:

If you’re analyzing customers, a duplicate might count the same person


twice.

Inconsistent Formatting:

Dates might appear as 01/01/2023, 2023-01-01, or January 1, 2023.

Outliers:
If a person’s age is listed as 500, it’s likely an error.

2.2. Cleaning Techniques

1. Handling Missing Data:

If a column has many missing values, you might drop it.

For fewer missing values:

Fill with a meaningful default.

Example: Replace missing salaries with the average salary.

Example in Python:

Df[‘Salary’] = df[‘Salary’].fillna(df[‘Salary’].mean())

2. Removing Duplicates:
Example: Two identical entries for the same customer.

Df = df.drop_duplicates()

3. Standardizing Data:

Make text consistent.

Convert names to title case (e.g., john smith → John Smith).

Df[‘Name’] = df[‘Name’].str.title()

4. Fixing Data Types:

Example: Dates stored as text strings (“2023-01-01”) need to be converted


into a date format.

5. Handling Outliers:
Example: If you’re analyzing employee ages, and one entry is 300, it’s likely
a mistake. Decide to remove it or fix it.

3. Making Data “Tidy”

“Tidy data” is a way of organizing your dataset so it’s easier to analyze.

3.1. Principles of Tidy Data

1. Columns are Variables:

Each column represents a single type of information.

Example:

Bad format: | Name | Sales_Q1 | Sales_Q2 | |-------|----------|----------| | John |


200 | 250 |

Tidy format: | Name | Quarter | Sales | |-------|---------|-------| | John | Q1 |


200 | | John | Q2 | 250 |
2. Rows are Observations:

Each row is a single instance of observation.

3. One Table Per Observational Unit:

If your data contains multiple types of observations (e.g., customers and


orders), keep them in separate tables.

3.2. Reshaping Data

Wide to Long:

Use pandas.melt() to make columns into rows.

Example:

Df = pd.melt(df, id_vars=[‘Name’], var_name=’Quarter’,


value_name=’Sales’)
Long to Wide:

Use pandas.pivot() or pivot_table().

Example:

Df.pivot(index=’Name’, columns=’Quarter’, values=’Sales’)

Practical Example: Cleaning and Tidying Data

Raw Data: | ID | Name | Product1_Sales | Product2_Sales | Joined On |


|-----|---------|----------------|----------------|------------| | 1 | John | 100 | 200
| 01/01/21 | | 2 | Jane | 300 | NaN | 02-02-2021 | | 3 |
Smith | 150 | 400 | 03.03.2021 |

Steps:

1. Handle missing data:

Fill NaN in Product2_Sales with 0.

2. Standardize names:
Convert Name to title case.

3. Standardize date format:

Convert Joined On to YYYY-MM-DD.

4. Tidy the dataset:

Reshape data to have Product and Sales columns.

Tidy Data: | ID | Name | Product | Sales | Joined On |


|-----|---------|-----------|-------|------------| | 1 | John | Product1 | 100 | 2021-
01-01 | | 1 | John | Product2 | 200 | 2021-01-01 | | 2 | Jane | Product1
| 300 | 2021-02-02 | | 2 | Jane | Product2 | 0 | 2021-02-02 |

Key Takeaways

Data is messy by default; cleaning is essential before analysis.


A consistent process:

1. Get the data from reliable sources.

2. Identify and fix issues (missing data, duplicates, inconsistent formats).

3. Organize data into tidy formats for analysis.

1. Loading Data

First, we need to load the data into Python.

Example: Load a CSV File

Import pandas as pd

# Load the CSV file

Df = pd.read_csv(‘data.csv’)

# Display the first few rows


Print(df.head())

2. Handling Missing Data

Identify Missing Data

# Check for missing values

Print(df.isnull().sum())

# Check percentage of missing values

Print(df.isnull().sum() / len(df) * 100)

Fill Missing Data

1. Replace missing values with a default value (e.g., 0):

Df[‘Column_Name’] = df[‘Column_Name’].fillna(0)

2. Replace missing values with the column mean:

Df[‘Column_Name’] = df[‘Column_Name’].fillna(df[‘Column_Name’].mean())
Drop Rows with Missing Values

# Drop rows with any missing values

Df = df.dropna()

# Drop rows with missing values in a specific column

Df = df.dropna(subset=[‘Column_Name’])

3. Removing Duplicates

# Remove duplicate rows

Df = df.drop_duplicates()

# Check for duplicates

Print(df.duplicated().sum())

4. Standardizing Data

Convert Text to Consistent Case

# Convert a column to lowercase

Df[‘Name’] = df[‘Name’].str.lower()
# Convert to title case (capitalize each word)

Df[‘Name’] = df[‘Name’].str.title()

Standardize Dates

# Convert to datetime format

Df[‘Date_Column’] = pd.to_datetime(df[‘Date_Column’])

# Check the new format

Print(df[‘Date_Column’].head())

Fix Data Types

# Convert a column to integer

Df[‘Column_Name’] = df[‘Column_Name’].astype(int)

# Convert to float

Df[‘Column_Name’] = df[‘Column_Name’].astype(float)

5. Handling Outliers

Detect Outliers Using Z-Score

Outliers can be detected using the Z-score, which measures how far a value
is from the mean.
From scipy.stats import zscore

# Calculate Z-scores for a column

Df[‘Z_Score’] = zscore(df[‘Column_Name’])

# Filter out rows where Z-Score > 3 or < -3

Df = df[(df[‘Z_Score’] < 3) & (df[‘Z_Score’] > -3)]

Visualize Outliers

Import matplotlib.pyplot as plt

# Boxplot to detect outliers

Plt.boxplot(df[‘Column_Name’])

Plt.show()

6. Tidying Data

Wide to Long Format

When your data has columns like Sales_Q1, Sales_Q2, and you want them as
rows:

# Original Data
Df = pd.DataFrame({

‘Name’: [‘John’, ‘Jane’, ‘Smith’],

‘Sales_Q1’: [100, 200, 150],

‘Sales_Q2’: [120, 220, 180]

})

# Reshape to long format

Df_long = pd.melt(df, id_vars=[‘Name’], var_name=’Quarter’,


value_name=’Sales’)

Print(df_long)

Output: | Name | Quarter | Sales | |--------|-----------|-------| | John | Sales_Q1


| 100 | | John | Sales_Q2 | 120 | | Jane | Sales_Q1 | 200 | | Jane |
Sales_Q2 | 220 |

Long to Wide Format

Reshape long data into wide format.

# Pivot table to reshape data

Df_wide = df_long.pivot(index=’Name’, columns=’Quarter’, values=’Sales’)

Print(df_wide)
Output: | Quarter | Sales_Q1 | Sales_Q2 | |---------|----------|----------| | John |
100 | 120 | | Jane | 200 | 220 |

7. Real-Life Example

Imagine you have the following messy dataset:

Data = {

‘Name’: [‘John’, ‘JANE’, ‘Smith’, ‘John’],

‘Age’: [25, None, 30, 25],

‘Date_Joined’: [‘01/01/2021’, ’02-02-2021’, ‘March 3, 2021’, None],

‘Salary’: [50000, 60000, None, 50000]

Df = pd.DataFrame(data)

Step-by-Step Cleaning:

1. Standardize Names:

Df[‘Name’] = df[‘Name’].str.title()

2. Handle Missing Data:

# Fill missing Age with the mean


Df[‘Age’] = df[‘Age’].fillna(df[‘Age’].mean())

# Drop rows with missing Date_Joined

Df = df.dropna(subset=[‘Date_Joined’])

3. Fix Date Format:

Df[‘Date_Joined’] = pd.to_datetime(df[‘Date_Joined’])

4. Remove Duplicates:

Df = df.drop_duplicates()

Final Cleaned Data: | Name | Age | Date_Joined | Salary |


|--------|------|-------------|---------| | John | 25.0 | 2021-01-01 | 50000.0 | | Jane |
27.5 | 2021-02-02 | 60000.0 | | Smith | 30.0 | 2021-03-03 | NaN

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy