Document (2)
Document (2)
Obtaining Data
Data can come from various sources and formats. Understanding how to
collect data is a critical step in data analysis.
Web scraping involves fetching the HTML structure of a webpage and parsing
it to extract the required information.
Downloading Files:
Data cleaning ensures the data is accurate, consistent, and ready for
analysis.
Missing values.
Duplicates.
Techniques:
2. Remove Duplicates:
Python: pandas.DataFrame.drop_duplicates().
3. Standardize Data:
5. Handle Outliers:
Tidy data is structured in a way that makes it easy to analyze and visualize.
Use when data is in a wide format (e.g., multiple columns for similar
variables).
Tools:
Python: pandas.melt().
Tools:
Python:
R:
Python:
Matplotlib and seaborn for data visualization (to identify cleaning needs).
R:
Excel:
Practical Example
Steps:
3. Format Dates:
4. Make Tidy:
Getting and cleaning data is a critical part of data analysis. It ensures that
raw data is transformed into a usable format for further processing. By
mastering data collection and cleaning, you can build a solid foundation for
any data science project.
1. Obtaining Data
Data is everywhere, but how we get it depends on the source. Each source
requires specific tools and techniques.
Web scraping:
Example: You want the list of top movies from a website like IMDb.
APIs are like waiters in a restaurant. You (the client) ask for specific data (the
meal), and the API provides it (serves you the food).
Send a request to the API with parameters like the city name.
Authentication: Some APIs need keys to verify you’re authorized to use them.
Import requests
Response = requests.get(https://api.weatherapi.com/v1/current.json?
key=YOUR_API_KEY&q=London)
Print(response.json())
Example:
Always ask for a data dictionary (a document explaining each column and
value).
Example issues:
Missing documentation about what “N/A” means (is it missing data or “not
applicable”?).
2. Basics of Data Cleaning
Missing Values:
Imagine a dataset of students’ grades, but some rows don’t have scores.
Duplicates:
Inconsistent Formatting:
Outliers:
If a person’s age is listed as 500, it’s likely an error.
Example in Python:
Df[‘Salary’] = df[‘Salary’].fillna(df[‘Salary’].mean())
2. Removing Duplicates:
Example: Two identical entries for the same customer.
Df = df.drop_duplicates()
3. Standardizing Data:
Df[‘Name’] = df[‘Name’].str.title()
5. Handling Outliers:
Example: If you’re analyzing employee ages, and one entry is 300, it’s likely
a mistake. Decide to remove it or fix it.
Example:
Wide to Long:
Example:
Example:
Steps:
2. Standardize names:
Convert Name to title case.
Key Takeaways
1. Loading Data
Import pandas as pd
Df = pd.read_csv(‘data.csv’)
Print(df.isnull().sum())
Df[‘Column_Name’] = df[‘Column_Name’].fillna(0)
Df[‘Column_Name’] = df[‘Column_Name’].fillna(df[‘Column_Name’].mean())
Drop Rows with Missing Values
Df = df.dropna()
Df = df.dropna(subset=[‘Column_Name’])
3. Removing Duplicates
Df = df.drop_duplicates()
Print(df.duplicated().sum())
4. Standardizing Data
Df[‘Name’] = df[‘Name’].str.lower()
# Convert to title case (capitalize each word)
Df[‘Name’] = df[‘Name’].str.title()
Standardize Dates
Df[‘Date_Column’] = pd.to_datetime(df[‘Date_Column’])
Print(df[‘Date_Column’].head())
Df[‘Column_Name’] = df[‘Column_Name’].astype(int)
# Convert to float
Df[‘Column_Name’] = df[‘Column_Name’].astype(float)
5. Handling Outliers
Outliers can be detected using the Z-score, which measures how far a value
is from the mean.
From scipy.stats import zscore
Df[‘Z_Score’] = zscore(df[‘Column_Name’])
Visualize Outliers
Plt.boxplot(df[‘Column_Name’])
Plt.show()
6. Tidying Data
When your data has columns like Sales_Q1, Sales_Q2, and you want them as
rows:
# Original Data
Df = pd.DataFrame({
})
Print(df_long)
Print(df_wide)
Output: | Quarter | Sales_Q1 | Sales_Q2 | |---------|----------|----------| | John |
100 | 120 | | Jane | 200 | 220 |
7. Real-Life Example
Data = {
Df = pd.DataFrame(data)
Step-by-Step Cleaning:
1. Standardize Names:
Df[‘Name’] = df[‘Name’].str.title()
Df = df.dropna(subset=[‘Date_Joined’])
Df[‘Date_Joined’] = pd.to_datetime(df[‘Date_Joined’])
4. Remove Duplicates:
Df = df.drop_duplicates()