datascience unit-4
datascience unit-4
investigate data sets and summarize their main characteristics, often employing
data visualization methods.
EDA is primarily used to see what data can reveal beyond the formal modeling
or hypothesis testing task and provides a provides a better understanding of data
set variables and the relationships between them.
Familiarize yourself with the data set, understand the domain, and identify the
objectives of the analysis.
2. Data Collection
Collect the required data from various sources such as databases, web scraping,
or APIs.
3. Data Cleaning
4. Data Transformation
6. Data Exploration
7. Data Visualization
Visualize data distributions and relationships using visual tools such as bar
charts, line charts, scatter plots, heat maps, and box plots.
8. Descriptive Statistics
Detect patterns, trends, and outliers in the data using visualizations and
statistical methods.
Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square
tests) to validate assumptions or relationships in the data.
11. Data Summarization
Document the EDA process, findings, and insights clearly and structured.
Create reports and presentations to convey results to stakeholders.
Visualization Techniques:
Scatter Plots:
Scatter plots are a fundamental visualization technique that displays data
points on a 2D plane, allowing for the exploration of relationships between
two variables.
Scatterplot Matrices:
Scatterplot matrices display all possible pairwise scatter plots for a set of
variables, providing a comprehensive overview of relationships between
multiple variables.
Parallel Coordinate Plots:
Parallel coordinate plots represent data points as lines connecting different
variables, which can be useful for visualizing relationships between multiple
variables and identifying patterns.
Heatmaps:
Heatmaps use color to represent data values, which can be particularly useful
for visualizing relationships between variables in a matrix format.
Spider Plots/Radar Charts:
These plots are useful for comparing multiple variables across different
categories or individuals, allowing for a clear visualization of the relative
values of each variable.
Table Lens:
This technique combines the strengths of tables and charts, allowing for the
visualization of large datasets with multiple variables.
Small Multiples:
Small multiples involve displaying a series of similar plots, each representing
a different subset of the data, which can be useful for identifying patterns
across different groups or conditions.
Data Context Maps:
These maps are useful for visualizing complex data by combining multiple
data layers and visual elements, allowing for a more comprehensive
understanding of the data.
Eliminating or sharpening potential hypotheses about the world that can be
addressed by the data,
Generating hypotheses for a data analytics project has some general steps to
make it easier and more effective.
Firstly, define the research question or goal and review the background
knowledge and literature.
Secondly, identify the data sources and methods, then brainstorm possible
hypotheses.
Finally, prioritize and select the most relevant, feasible, and impactful
hypotheses using certain criteria.