0% found this document useful (0 votes)

12 views

datascience unit-4

Exploratory Data Analysis (EDA) is a crucial process for data scientists to understand data sets through visualization and statistical methods. The EDA process includes steps such as data understanding, collection, cleaning, transformation, exploration, visualization, and hypothesis testing. Techniques and tools used in EDA help identify patterns, relationships, and outliers, ultimately refining hypotheses and informing model development.

Uploaded by

soumyamehersunumeher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

datascience unit-4

Uploaded by

soumyamehersunumeher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Exploratory data analysis (EDA) is used by data scientists to analyze and

investigate data sets and summarize their main characteristics, often employing
data visualization methods.

EDA is primarily used to see what data can reveal beyond the formal modeling
or hypothesis testing task and provides a provides a better understanding of data
set variables and the relationships between them.

Steps Involved in Exploratory Data Analysis

1. Understand the Data

Familiarize yourself with the data set, understand the domain, and identify the
objectives of the analysis.

2. Data Collection

Collect the required data from various sources such as databases, web scraping,
or APIs.

3. Data Cleaning

 Handle missing values: Impute or remove missing data.

 Remove duplicates: Ensure there are no duplicate records.
 Correct data types: Convert data types to appropriate formats.
 Fix errors: Address any inconsistencies or errors in the data.

4. Data Transformation

 Normalize or standardize the data if necessary.

 Create new features through feature engineering.
 Aggregate or disaggregate data based on analysis needs.
5. Data Integration

Integrate data from various sources to create a complete data set.

6. Data Exploration

 Univariate Analysis: Analyze individual variables using

summary statistics and visualizations (e.g., histograms, box plots).
 Bivariate Analysis: Analyze the relationship between two variables with
scatter plots, correlation coefficients, and cross-tabulations.
 Multivariate Analysis: Investigate interactions between multiple variables
using pair plots and correlation matrices.

7. Data Visualization

Visualize data distributions and relationships using visual tools such as bar
charts, line charts, scatter plots, heat maps, and box plots.

8. Descriptive Statistics

Calculate central tendency measures (mean, median, mode) and dispersion

measures (range, variance, standard deviation).

9. Identify Patterns and Outliers

Detect patterns, trends, and outliers in the data using visualizations and
statistical methods.

10. Hypothesis Testing

Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square
tests) to validate assumptions or relationships in the data.
11. Data Summarization

Summarize findings with descriptive statistics, visualizations, and key insights.

12. Documentation and Reporting

 Document the EDA process, findings, and insights clearly and structured.
 Create reports and presentations to convey results to stakeholders.

13. Iterate and Refine

Continuously refine the analysis based on feedback and additional questions

during the process.

Common multivariate statistical techniques used to visualize high-

dimensional data.

Visualization Techniques:
 Scatter Plots:
Scatter plots are a fundamental visualization technique that displays data
points on a 2D plane, allowing for the exploration of relationships between
two variables.
 Scatterplot Matrices:
Scatterplot matrices display all possible pairwise scatter plots for a set of
variables, providing a comprehensive overview of relationships between
multiple variables.
 Parallel Coordinate Plots:
Parallel coordinate plots represent data points as lines connecting different
variables, which can be useful for visualizing relationships between multiple
variables and identifying patterns.
 Heatmaps:
Heatmaps use color to represent data values, which can be particularly useful
for visualizing relationships between variables in a matrix format.
 Spider Plots/Radar Charts:
These plots are useful for comparing multiple variables across different
categories or individuals, allowing for a clear visualization of the relative
values of each variable.
 Table Lens:
This technique combines the strengths of tables and charts, allowing for the
visualization of large datasets with multiple variables.
 Small Multiples:
Small multiples involve displaying a series of similar plots, each representing
a different subset of the data, which can be useful for identifying patterns
across different groups or conditions.
 Data Context Maps:
These maps are useful for visualizing complex data by combining multiple
data layers and visual elements, allowing for a more comprehensive
understanding of the data.
Eliminating or sharpening potential hypotheses about the world that can be
addressed by the data,

Eliminating or sharpening potential hypotheses

A hypothesis is a tentative statement that expresses a relationship between

variables or phenomena that can be tested empirically.

Generating hypotheses for a data analytics project has some general steps to
make it easier and more effective.

Firstly, define the research question or goal and review the background
knowledge and literature.

Secondly, identify the data sources and methods, then brainstorm possible
hypotheses.

Finally, prioritize and select the most relevant, feasible, and impactful
hypotheses using certain criteria.

EDA facilitates hypothesis refinement:

 Identifying Patterns and Relationships:
EDA techniques like visualization and summary statistics reveal patterns and
relationships within the data that might not be immediately apparent.
 Uncovering Outliers and Anomalies:
Identifying outliers or unusual data points can lead to the refinement of
hypotheses by suggesting that certain assumptions or models might not hold
true for all cases.
 Informative for Model Development:
The insights gained from EDA can inform the development of more complex
statistical models by suggesting relevant variables, transformations, or even
alternative modeling approaches.
 Refining Hypotheses:
By understanding the data's characteristics, researchers can refine their initial
hypotheses, making them more focused and testable.
 Examples of EDA Techniques:
 Univariate Analysis: Analyzing individual variables (e.g., histograms, box
plots).
 Bivariate Analysis: Examining the relationships between two variables (e.g.,
scatter plots, correlation matrices).
 Multivariate Analysis: Exploring relationships among multiple variables (e.g.,
clustering, dimensionality reduction).
 Tools for EDA:
 Statistical Software: R, Python (with libraries like Pandas, NumPy, Scikit-learn),
SAS, SPSS.
 Spreadsheet Software: Excel, Google Sheets.
 Data Visualization Tools: Tableau, Power BI.

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
GMR Primary Standard Template
No ratings yet
GMR Primary Standard Template
2 pages
Assignment EDA
No ratings yet
Assignment EDA
4 pages
Dev 1
No ratings yet
Dev 1
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
DOC-20250125-WA0000.
No ratings yet
DOC-20250125-WA0000.
15 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
Document (4)
No ratings yet
Document (4)
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
EDA Feature eng- Estimation Inference and Hypothesis
No ratings yet
EDA Feature eng- Estimation Inference and Hypothesis
53 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Unit 4
No ratings yet
Unit 4
33 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
2 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
21 pages
Unit 2
No ratings yet
Unit 2
58 pages
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
No ratings yet
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
15 pages
Q2 Ans
No ratings yet
Q2 Ans
5 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
Document (1)
No ratings yet
Document (1)
10 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
5. Exploratory Data Analysis (EDA) in Data
No ratings yet
5. Exploratory Data Analysis (EDA) in Data
12 pages
Data Sciecnce
No ratings yet
Data Sciecnce
16 pages
unit-1
No ratings yet
unit-1
50 pages
UNIT II-DSDA.docx Notes
No ratings yet
UNIT II-DSDA.docx Notes
26 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
MULTIVARIATE ANALYSIS Part 1
No ratings yet
MULTIVARIATE ANALYSIS Part 1
30 pages
Unit-1
No ratings yet
Unit-1
52 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
5 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Unit 3
No ratings yet
Unit 3
47 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Unit 1
No ratings yet
Unit 1
19 pages
UNIT 1
No ratings yet
UNIT 1
23 pages
eda1
No ratings yet
eda1
25 pages
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
Unit 3
No ratings yet
Unit 3
31 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
EDA
No ratings yet
EDA
9 pages
Fundamentals of Data Source and Preparation For ML v31
No ratings yet
Fundamentals of Data Source and Preparation For ML v31
45 pages
ML EXP1_2201107
No ratings yet
ML EXP1_2201107
34 pages
Unit 3
No ratings yet
Unit 3
222 pages
ds unit 2 qb
No ratings yet
ds unit 2 qb
25 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
Ai ML Exp2
No ratings yet
Ai ML Exp2
7 pages
Edashsh
No ratings yet
Edashsh
7 pages
EDA Exploratory Data Analysis (1)
No ratings yet
EDA Exploratory Data Analysis (1)
6 pages
05_AIHC_Exp02
No ratings yet
05_AIHC_Exp02
11 pages
Unit 4 Exploratory Data Analysis and the Data Science Process (1)
No ratings yet
Unit 4 Exploratory Data Analysis and the Data Science Process (1)
9 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
7 pages
The analysis_In_EDA
No ratings yet
The analysis_In_EDA
7 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Visualization and Interpretation: Humanistic Approaches to Display
From Everand
Visualization and Interpretation: Humanistic Approaches to Display
Johanna Drucker
No ratings yet
Experiment No.-5 (A) : AIM: - Graphical User Interface in Matlab
No ratings yet
Experiment No.-5 (A) : AIM: - Graphical User Interface in Matlab
3 pages
Sensors: An Iot-Based Smart Home Automation System
No ratings yet
Sensors: An Iot-Based Smart Home Automation System
23 pages
Unit-1 Introduction of Computer PDF
No ratings yet
Unit-1 Introduction of Computer PDF
25 pages
Resume of Kamal Sharif
No ratings yet
Resume of Kamal Sharif
3 pages
Technical Requirement Checklist - NGAS
No ratings yet
Technical Requirement Checklist - NGAS
5 pages
How To Build An Effective Malware Protection Architecture
No ratings yet
How To Build An Effective Malware Protection Architecture
57 pages
Com - Wd.clan Logcat
No ratings yet
Com - Wd.clan Logcat
3 pages
Sample Resume - Azure Cloud
No ratings yet
Sample Resume - Azure Cloud
2 pages
STD - X - ICT - HTML Practical Revision Assignment 2
No ratings yet
STD - X - ICT - HTML Practical Revision Assignment 2
5 pages
8.3 - Windows Operating System Architecture: User Mode
No ratings yet
8.3 - Windows Operating System Architecture: User Mode
2 pages
Module-1
No ratings yet
Module-1
5 pages
Ad0 E117 Demo
No ratings yet
Ad0 E117 Demo
6 pages
WMS Experiment-2 20BCS5931
No ratings yet
WMS Experiment-2 20BCS5931
8 pages
FMG-FAZ 5.4.5 Event Log Reference
No ratings yet
FMG-FAZ 5.4.5 Event Log Reference
37 pages
Manish Resume
No ratings yet
Manish Resume
1 page
Module3 HTML Source Coding 1
No ratings yet
Module3 HTML Source Coding 1
39 pages
(Size 50) (Color Red) SAP2000 18.0.0 - 18.0.1 Enhancements (/color) (/size) (Size 40) (Color Blue) Graphical User Interface (/color) (/size)
No ratings yet
(Size 50) (Color Red) SAP2000 18.0.0 - 18.0.1 Enhancements (/color) (/size) (Size 40) (Color Blue) Graphical User Interface (/color) (/size)
8 pages
Assignment 4
No ratings yet
Assignment 4
7 pages
yozolog
No ratings yet
yozolog
2 pages
Piuma Dataviewer Quick Start Guide
No ratings yet
Piuma Dataviewer Quick Start Guide
2 pages
FUJITSU Server PRIMERGY RX2540 M5 Rack Server: Data Sheet
No ratings yet
FUJITSU Server PRIMERGY RX2540 M5 Rack Server: Data Sheet
15 pages
Calypso Business Analyst - Case Study - Fortis Investments
100% (2)
Calypso Business Analyst - Case Study - Fortis Investments
3 pages
Aula 08 - 8.4 Lab 8
No ratings yet
Aula 08 - 8.4 Lab 8
2 pages
Fuzzy Logic Dissertation
100% (2)
Fuzzy Logic Dissertation
10 pages
Palomo Act - No.1
No ratings yet
Palomo Act - No.1
6 pages
Instant Download Embedded Linux Development using Yocto Projects Second Edition Otavio Salvador PDF All Chapters
No ratings yet
Instant Download Embedded Linux Development using Yocto Projects Second Edition Otavio Salvador PDF All Chapters
41 pages
Philips 21PT5438Philips 21PT5438
0% (1)
Philips 21PT5438Philips 21PT5438
52 pages
(Int Const P 5 Printf ("%D",++ ( P) ) )
No ratings yet
(Int Const P 5 Printf ("%D",++ ( P) ) )
14 pages
Restaurant Food Service Ops Manual Sample Chapter 8
0% (1)
Restaurant Food Service Ops Manual Sample Chapter 8
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

datascience unit-4

Uploaded by

datascience unit-4

Uploaded by

Exploratory data analysis (EDA) is used by data scientists to analyze and

Steps Involved in Exploratory Data Analysis

1. Understand the Data

 Handle missing values: Impute or remove missing data.

 Normalize or standardize the data if necessary.

Integrate data from various sources to create a complete data set.

 Univariate Analysis: Analyze individual variables using

Calculate central tendency measures (mean, median, mode) and dispersion

9. Identify Patterns and Outliers

10. Hypothesis Testing

Summarize findings with descriptive statistics, visualizations, and key insights.

12. Documentation and Reporting

13. Iterate and Refine

Continuously refine the analysis based on feedback and additional questions

Common multivariate statistical techniques used to visualize high-

Eliminating or sharpening potential hypotheses

A hypothesis is a tentative statement that expresses a relationship between

EDA facilitates hypothesis refinement:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.