Unit-5 new
Unit-5 new
Contents
• Data Visualization: Basic principles, ideas and tools for data visualization.
2
Data Visualization
• Data visualization in data science refers to the process of generating graphical representations
of information.
• These graphical depictions, often known as plots or charts, are pivotal in the realm of data science
for effective analysis and interpretation.
• Understanding the various types of data visualization in data science is crucial to select the
appropriate visual method for the dataset at hand.
• Different types serve different analytical needs, from understanding distributions with
histograms to spotting trends with line charts.
Data visualization benefits include communicating your results or findings, monitoring the
model’s performance at the evaluation stage, hyperparameter tuning, identifying trends, patterns and
correlation between dataset features, data cleaning such as outlier detection, and validating model
assumptions.
3
Data Visualization: Examples
• Weather reports: Maps and other plot types are commonly used in weather reports.
• Internet websites: Social media analytics websites such as Social Blade and Google Analytics use
data visualization techniques to analyze and compare the performance of websites.
• Astronomy: Space agencies uses advanced data visualization techniques in its reports and
presentations.
• Geography
• Gaming industry
4
What Makes Data Visualization Effective?
5
Importance of Data Visualization
1. Data cleaning
Data visualization plays an important role in data clearing. Good examples are detecting outliers and
removing multicollinearity. We can create scatterplots to detect outliers and generate heatmaps to
check multicollinearity.
2. Data Exploration
Before building any model, we need to do some exploratory data analysis to identify dataset
characteristics. For example, we can create histograms for continuous variables to check for normality
in the data. We can create scatterplots between two features to check whether they are correlated.
Likewise, we can create a bar chart for the label column with two or more classes to identify class
imbalance.
6
3. Evaluation of modeling outputs
We can create a confusion matrix and learning curve to measure the performance of a model during
training. Plots are also useful in validating model assumptions. For example, we can create a residuals
plot and histogram for the distribution of residuals to validate the assumptions of a linear regression
model.
4. Identifying trends
Time and seasonal plots are useful in time series analysis to identify certain trends over time.
5. Presenting results
As a data scientist, you need to present your findings to the company or other related persons who do
not have more knowledge in the subject domain. So, you need to explain everything in plain English.
You can use informative plots that summarize your findings.
7
Types of Data Visualization in Data Science
1. Distribution plot
A distribution plot is used to visualize data distribution—for example: A probability distribution plot
or density curve.
8
2. Box and whisker plot
This plot is used to plot the variation of the values of a numerical feature. You can get the
values' minimum, maximum, median, lower and upper quartiles.
9
3. Violin plot
Similar to the box and whisker plot, the violin plot is used to plot the variation of a numerical
feature. But it contains a kernel density curve in addition to the box plot. The kernel density curve
estimates the underlying distribution of data.
10
4. Line plot
A line plot is created by connecting a series of data points with straight lines. The number of
periods is on the x-axis.
11
5. Bar plot
A bar plot is used to plot the frequency of occurring categorical data. Each category is
represented by a bar. The bars can be created vertically or horizontally. Their heights or lengths are
proportional to the values they represent.
12
6. Scatter plot
Scatter plots are created to see whether there is a relationship (linear or non-linear and
positive or negative) between two numerical variables. They are commonly used in regression
analysis.
13
7. Histogram
A histogram represents the distribution of numerical data. Looking at a histogram, we can
decide whether the values are normally distributed (a bell-shaped curve), skewed to the right or
skewed left. A histogram of residuals is useful to validate important assumptions in regression
analysis.
14
8. Pie chart
A categorical variable pie chart includes each category's values as slices whose sizes are
proportional to the quantity they represent. It is a circular graph made with slices equal to the number
of categories.
15
9. Area plot
The area plot is based on the line chart. We get the area plot when we cover the area between
the line and the x-axis.
16
10. Hexbin plot
Similar to the scatter plot, a hexbin plot represents the relationship between two numerical
variables. It is useful when there are a lot of data points in the two variables. When you have a lot of
data points, they will overlap when represented in a scatter plot.
17
11. Heatmap
A heatmap visualizes the correlation coefficients of numerical features with a beautiful color
map. Light colors show a high correlation, while dark colors show a low correlation. The heatmap is
extremely useful for identifying multicollinearity that occurs when the input features are highly
correlated with one or more of the other features in the dataset.
18
Write python code to display the types of visualizations
19
Data Visualization Process/Workflow
The data visualization process or workflow includes the fowling key steps.
20
3. Clean your data
Real-world data are messy. So, you need to clean them before using them for visualization.
You can identify missing values and outliers and treat them accordingly. You can perform feature
selection and remove unnecessary features from the data. You can create a new set of features based
on the original features.
21
5. Choose your tool
You can use open-source data visualization tools such as matplotlib, seaborn, plotty and
ggplot. You can also use API-based software such as Matlab, Minitab, SPSS, etc.
6. Prepare data
You can extract relevant features. You can do feature standardization if the values of the
features are not on the same scale. You can apply data preprocessing steps such as PCA to reduce the
dimensionality of the data. That will allow you to visualize high-dimensional data in 2D and 3D plots!
7. Create a chart
This is the final step. Here. You define the title and names for the axes. You should also
choose a proper chart background to ensure the content is easily readable.
22
Data Visualization Tools
There are multiple tools and software available for data visualization.
1. Python provides open-source libraries such as
• Matplotlib
• Seaborn
• Plotty
• Bokeh
• Altair
2. R provides open-source libraries such as
• Ggplot2
• Lattice
3. Other data visualization libraries
• Tableau
• Microsoft Power BI
• IBM SPSS
• Minitab
23
Write detail notes on following tools for data visualization:
• Tableau
• Microsoft Power BI
• IBM SPSS
• Minitab
24
Data Visualization Techniques
Some of the main data visualization techniques in data science are :
1. Univariate Analysis
2. Bivariate Analysis
3. Multivariate Analysis
25
Advantages of Data Visualization
There are many advantages of data visualization. Data visualization is used to:
26
Disadvantages of Data Visualization
There are also some disadvantages of data visualization.
• We need to download, install and configure software and open-source libraries. The process will be difficult and
time-consuming for beginners.
• Some data visualization tools are not available for free. We need to pay for those.
• When we summarize the data, we’ll lose the exact information.
27
Data Visualization Best Practices
1. Set the context
We need to develop a research question that could be solved with a data-driven approach.
28
3. Choose an effective visual
You need to create the right plot that addresses your requirement. To see the correlations between
multiple variables, you can create histograms for each pair of variables. But that is not very effective.
Instead, you can create a heatmap that is an effective way of visualizing correlations. When you have
many categories, the pie chart is not suitable. Instead, you can create a bar chart. These are some
examples of choosing an effective visual for your requirements.
4. Keep it simple
Simple plots are easily readable. We can remove unnecessary backgrounds to make things stand out.
We should not include much content in the plot. Title, names for axis, scale, and legends are just
enough.
29
Assignment
1. How do challenges and best practices associated with visualizing datasets with multiple dimensions or
high-dimensional characteristics affect the capability of data visualization to discern patterns, trends, or
outliers?
2. What strategies can be employed to improve the effectiveness of data visualization in complex datasets
with numerous dimensions, considering the challenges and best practices associated with such
visualizations?
30
Happy Learning
31