0% found this document useful (0 votes)
3 views

Unit-5 new

The document provides an overview of data visualization in data science, detailing its principles, types, and importance in analyzing and interpreting data. It discusses various visualization techniques, tools, and best practices, emphasizing the need for clarity, interactivity, and audience consideration. Additionally, it highlights the advantages and disadvantages of data visualization, as well as the workflow involved in creating effective visual representations of data.

Uploaded by

PARTH BHARADIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit-5 new

The document provides an overview of data visualization in data science, detailing its principles, types, and importance in analyzing and interpreting data. It discusses various visualization techniques, tools, and best practices, emphasizing the need for clarity, interactivity, and audience consideration. Additionally, it highlights the advantages and disadvantages of data visualization, as well as the workflow involved in creating effective visual representations of data.

Uploaded by

PARTH BHARADIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit-5

Contents

• Data Visualization: Basic principles, ideas and tools for data visualization.

2
Data Visualization

• Data visualization in data science refers to the process of generating graphical representations
of information.
• These graphical depictions, often known as plots or charts, are pivotal in the realm of data science
for effective analysis and interpretation.
• Understanding the various types of data visualization in data science is crucial to select the
appropriate visual method for the dataset at hand.
• Different types serve different analytical needs, from understanding distributions with
histograms to spotting trends with line charts.
Data visualization benefits include communicating your results or findings, monitoring the
model’s performance at the evaluation stage, hyperparameter tuning, identifying trends, patterns and
correlation between dataset features, data cleaning such as outlier detection, and validating model
assumptions.

3
Data Visualization: Examples

• Here are some popular data visualization examples.

• Weather reports: Maps and other plot types are commonly used in weather reports.
• Internet websites: Social media analytics websites such as Social Blade and Google Analytics use
data visualization techniques to analyze and compare the performance of websites.
• Astronomy: Space agencies uses advanced data visualization techniques in its reports and
presentations.
• Geography
• Gaming industry

4
What Makes Data Visualization Effective?

• Clarity: Data should be visualized in a way that everyone can understand.


• Problem domain: When presenting data, the visualizations should be related to the business
problem.
• Interactivity: Interactive plots are useful to compare and highlight certain things within the plot.
• Comparability: We can compare the thighs easily with good plots.
• Aesthetics: Quality plots are visually aesthetic.
• Informative: A good plot summarizes all relevant information.

5
Importance of Data Visualization

1. Data cleaning
Data visualization plays an important role in data clearing. Good examples are detecting outliers and
removing multicollinearity. We can create scatterplots to detect outliers and generate heatmaps to
check multicollinearity.

2. Data Exploration
Before building any model, we need to do some exploratory data analysis to identify dataset
characteristics. For example, we can create histograms for continuous variables to check for normality
in the data. We can create scatterplots between two features to check whether they are correlated.
Likewise, we can create a bar chart for the label column with two or more classes to identify class
imbalance.

6
3. Evaluation of modeling outputs
We can create a confusion matrix and learning curve to measure the performance of a model during
training. Plots are also useful in validating model assumptions. For example, we can create a residuals
plot and histogram for the distribution of residuals to validate the assumptions of a linear regression
model.

4. Identifying trends
Time and seasonal plots are useful in time series analysis to identify certain trends over time.

5. Presenting results
As a data scientist, you need to present your findings to the company or other related persons who do
not have more knowledge in the subject domain. So, you need to explain everything in plain English.
You can use informative plots that summarize your findings.

7
Types of Data Visualization in Data Science

1. Distribution plot
A distribution plot is used to visualize data distribution—for example: A probability distribution plot
or density curve.

8
2. Box and whisker plot
This plot is used to plot the variation of the values of a numerical feature. You can get the
values' minimum, maximum, median, lower and upper quartiles.

9
3. Violin plot
Similar to the box and whisker plot, the violin plot is used to plot the variation of a numerical
feature. But it contains a kernel density curve in addition to the box plot. The kernel density curve
estimates the underlying distribution of data.

10
4. Line plot
A line plot is created by connecting a series of data points with straight lines. The number of
periods is on the x-axis.

11
5. Bar plot
A bar plot is used to plot the frequency of occurring categorical data. Each category is
represented by a bar. The bars can be created vertically or horizontally. Their heights or lengths are
proportional to the values they represent.

12
6. Scatter plot
Scatter plots are created to see whether there is a relationship (linear or non-linear and
positive or negative) between two numerical variables. They are commonly used in regression
analysis.

13
7. Histogram
A histogram represents the distribution of numerical data. Looking at a histogram, we can
decide whether the values are normally distributed (a bell-shaped curve), skewed to the right or
skewed left. A histogram of residuals is useful to validate important assumptions in regression
analysis.

14
8. Pie chart
A categorical variable pie chart includes each category's values as slices whose sizes are
proportional to the quantity they represent. It is a circular graph made with slices equal to the number
of categories.

15
9. Area plot
The area plot is based on the line chart. We get the area plot when we cover the area between
the line and the x-axis.

16
10. Hexbin plot
Similar to the scatter plot, a hexbin plot represents the relationship between two numerical
variables. It is useful when there are a lot of data points in the two variables. When you have a lot of
data points, they will overlap when represented in a scatter plot.

17
11. Heatmap
A heatmap visualizes the correlation coefficients of numerical features with a beautiful color
map. Light colors show a high correlation, while dark colors show a low correlation. The heatmap is
extremely useful for identifying multicollinearity that occurs when the input features are highly
correlated with one or more of the other features in the dataset.

18
Write python code to display the types of visualizations

19
Data Visualization Process/Workflow

The data visualization process or workflow includes the fowling key steps.

1. Develop your research question


This may be a business problem or any other related problem that could be solved with a
data-driven approach. You should note all the objectives and outcomes plus required resources such
as datasets, open-source software libraries, etc.

2. Get or create your data


The next step is collecting data. You can use existing datasets if they’re relevant to your
research question. Alternatively, you can download open-source datasets from the internet or do web
scraping to collect data.

20
3. Clean your data
Real-world data are messy. So, you need to clean them before using them for visualization.
You can identify missing values and outliers and treat them accordingly. You can perform feature
selection and remove unnecessary features from the data. You can create a new set of features based
on the original features.

4. Choose a chart type


The chart type depends on many factors. For example, it depends on the feature type
(numerical or categorical). It also depends on the type of visualization you need. Let’s say you have
two numerical features. If you want to find their distributions, you can create two histograms for each
feature. If you want to plot their variations, you can create box and whisker plots for each feature. You
can create a scatterplot if you want to find a relationship (linear or non-linear, positive or negative)
between the two features.

21
5. Choose your tool
You can use open-source data visualization tools such as matplotlib, seaborn, plotty and
ggplot. You can also use API-based software such as Matlab, Minitab, SPSS, etc.

6. Prepare data
You can extract relevant features. You can do feature standardization if the values of the
features are not on the same scale. You can apply data preprocessing steps such as PCA to reduce the
dimensionality of the data. That will allow you to visualize high-dimensional data in 2D and 3D plots!

7. Create a chart
This is the final step. Here. You define the title and names for the axes. You should also
choose a proper chart background to ensure the content is easily readable.

22
Data Visualization Tools
There are multiple tools and software available for data visualization.
1. Python provides open-source libraries such as
• Matplotlib
• Seaborn
• Plotty
• Bokeh
• Altair
2. R provides open-source libraries such as
• Ggplot2
• Lattice
3. Other data visualization libraries
• Tableau
• Microsoft Power BI
• IBM SPSS
• Minitab
23
Write detail notes on following tools for data visualization:
• Tableau
• Microsoft Power BI
• IBM SPSS
• Minitab

24
Data Visualization Techniques
Some of the main data visualization techniques in data science are :

1. Univariate Analysis

2. Bivariate Analysis

3. Multivariate Analysis

25
Advantages of Data Visualization
There are many advantages of data visualization. Data visualization is used to:

• Communicate your results or findings with your audience


• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions

26
Disadvantages of Data Visualization
There are also some disadvantages of data visualization.

• We need to download, install and configure software and open-source libraries. The process will be difficult and
time-consuming for beginners.
• Some data visualization tools are not available for free. We need to pay for those.
• When we summarize the data, we’ll lose the exact information.

27
Data Visualization Best Practices
1. Set the context
We need to develop a research question that could be solved with a data-driven approach.

2. Know your audience


This is very important as the visualizations depend on the type of audience you have. To present your
findings to a business people audience, you need to create visualizations closely related to money,
profits, and revenue the terms that business people are familiar with!

28
3. Choose an effective visual
You need to create the right plot that addresses your requirement. To see the correlations between
multiple variables, you can create histograms for each pair of variables. But that is not very effective.
Instead, you can create a heatmap that is an effective way of visualizing correlations. When you have
many categories, the pie chart is not suitable. Instead, you can create a bar chart. These are some
examples of choosing an effective visual for your requirements.

4. Keep it simple
Simple plots are easily readable. We can remove unnecessary backgrounds to make things stand out.
We should not include much content in the plot. Title, names for axis, scale, and legends are just
enough.

29
Assignment

1. How do challenges and best practices associated with visualizing datasets with multiple dimensions or
high-dimensional characteristics affect the capability of data visualization to discern patterns, trends, or
outliers?

2. What strategies can be employed to improve the effectiveness of data visualization in complex datasets
with numerous dimensions, considering the challenges and best practices associated with such
visualizations?

30
Happy Learning

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy