13_Data Visualization
13_Data Visualization
The plotting of numerical data is the responsibility of this library. It's for
this reason that it's used in analysis of data. It's an open-source library that
plots high-definition figures such as pie charts, scatterplots, boxplots, and
graphs, among other things.
NumPy
Images, sound waves, and other binary raw streams can be represented as
a multidimensional array of real values using the NumPy interface for
visualization. Full-stack developers must be familiar with Numpy to use
this machine learning library.
Pandas
SciPy
Scikit- learn
Seaborn
TensorFlow
Keras
Keras is a Python-based open-source neural network library that makes it
possible for us to examine deep neural networks deeply. As deep learning
becomes more common, Keras emerges as a viable option because,
according to its creators, it is an API (Application Programming Interface)
designed for humans, not machines. Compared to TensorFlow or Theano,
Keras has a greater adoption rate in the research community and industry.
Before installing Keras, the user should first download the TensorFlow
backend engine.
Statsmodels
In the world of Big Data, data visualization tools and technologies are essential to
analyse massive amounts of information and make data-driven decisions.
While there are many advantages, some of the disadvantages may seem less obvious.
For example, when viewing a visualization with many different data points, it’s easy to
make an inaccurate assumption. Or sometimes the visualization is just designed
wrong so that it’s biased or confusing.
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
y_value = 1
x = np.arange(10)
y = np.zeros_like(x) + y_value
plt.plot(x, y, ls='dotted', c='red', lw=5)
plt.show()
We can easily plot 3-D figures in matplotlib. Now, we discuss some important
and commonly used 3-D plots.
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
# defining x, y, z co-ordinates
x = np.random.randint(0, 10, size = 20)
y = np.random.randint(0, 10, size = 20)
z = np.random.randint(0, 10, size = 20)
Box plots
Histograms
Heat maps
Charts
Tree maps
kernel density estimate
Box Plots
The image above is a box plot. A boxplot is a standardized way of displaying the
distribution of data based on a five-number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.
A box plot is a graph that gives you a good indication of how the values in the data are
spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space, which is
useful when comparing distributions between many groups or datasets. For some
distributions/datasets, you will find that you need more information than the measures
of central tendency (median, mean, and mode). You need to have information on the
variability or dispersion of the data.
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
# show plot
plt.show()
Minimum Q1 -1.5*IQR
First quartile (Q1/25th The middle number between the smallest number (not the
Percentile) “minimum”) and the median of the dataset
Third quartile (Q3/75th the middle value between the median and the highest value (not
Percentile)”: the “maximum”) of the dataset.
Maximum Q3 + 1.5*IQR
Histograms
A histogram is a graphical display of data using bars of different heights. In a
histogram, each bar groups numbers into ranges. Taller bars show that more data falls
in that range. A histogram displays the shape and spread of continuous sample data.
It is a plot that lets you discover, and show, the underlying frequency distribution
(shape) of a set of continuous data. This allows the inspection of the data for its
underlying distribution (e.g., normal distribution), outliers, skewness, etc. It is an
accurate representation of the distribution of numerical data, it relates only one
variable. Includes bin or bucket- the range of values that divide the entire range of
values into a series of intervals and then count how many values fall into each interval.
Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins
leave no gaps, the rectangles of histogram touch each other to indicate that the original
value is continuous.
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(256)
x = 10*np.random.rand(200,1)
y = (0.2 + 0.8*x) * np.sin(2*np.pi*x) + np.random.randn(200,1)
plt.hist(y, bins=20, color='purple')
plt.show()
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses
height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the most
attention, a heat map shows you in a visual way that’s easy to assimilate and make
decisions from. It is a graphical representation of data where the individual values
contained in a matrix are represented as colours. Useful for two purposes: for
visualizing correlation tables and for visualizing missing values in the data. In both
cases, the information is conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of values, but they
are not a replacement for more precise graphical displays, such as bar charts,
because colour differences cannot be perceived accurately.
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()
Area Chart: It combines the line chart and bar chart to show how the numeric
values of one or more groups change over the progress of a viable area.
import plotly.express as px
df = px.data.iris()
fig.show()
Line Graph: The data points are connected through a straight line; therefore,
creating a representation of the changing trend.
x = np.linspace(0, 1, 201)
y = np.sin((2*np.pi*x)**2)
plt.plot(x, y, 'purple')
plt.show()
Pie Chart: It is a chart where various components of a data set are presented in
the form of a pie which represents their proportion in the entire data set.
plt.pie(y)
plt.show()
Scatter Charts
np.random.seed(256)
x = 10*np.random.rand(200,1)
y = (0.2 + 0.8*x) * np.sin(2*np.pi*x) + np.random.randn(200,1)
plt.scatter(x, y, color='purple')
plt.show()
Tree Map
squarify.plot(sizes=sizes,
label=labels,
color =colors,
alpha=.7,
bar_kwargs=dict(linewidth=1, edgecolor="#222222"))
plt.show()
Kernel density estimate (KDE) plot
def generate_data(seed=17):
# Fix the seed to reproduce the results
rand = np.random.RandomState(seed)
x = []
dat = rand.lognormal(0, 0.3, 1000)
x = np.concatenate((x, dat))
dat = rand.normal(3, 1, 1000)
x = np.concatenate((x, dat))
return x