Lecture 7

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

COSC 3107: Machine Learning

Lecture 7
Statistical Data Visualization
COSC-3107 Machine Learning

Shahzad Hussain
Lecturer

Previous Lecture Summary


• Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates COSC-3107 Machine Learning
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
• Summarization

2 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Today’s Lecture Outline
• Statistical Visualization
i. What the purpose of Visualization Tool - Graphs
ii. Plotting an Analytical Function
iii. Component of a Graph
iv. Creating Graphs – Mathematical Functions
v. Seaborn
vi. Which Tool I Should Be Used COSC-3107 Machine Learning
vii. Types of Graphs
viii. Line Graphs
ix. Creating Line Graphs Using Different Libraries
x. Pandas DataFrames and Grouped Data

3 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

1. What purpose of Visualization Tools - Graphs

COSC-3107 Machine Learning


What purpose of Visualization Tools - Graphs

• Analysis is not complete without visualizations,


even with big datasets, so knowing how to
generate images and graphs from data in Python is
relevant for our goal of big data analysis.
• There are several visualization libraries for
Python, such as Plotly, Bokeh, and others.
• But one of the oldest, most flexible, and most used is COSC-3107 Machine Learning
Matplotlib.
• Every analysis, whether on small or large datasets,
involves a descriptive statistics step, where the
data is summarized and described by statistics such
as mean, median, percentages, and correlation.
5 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

What purpose of Visualization Tools - Graphs


• This step is commonly the first step in the analysis workflow,
allowing a preliminary understanding of the data and its
general patterns and behaviors, providing grounds for the
analyst to formulate hypotheses, and directing the next steps in the
analysis.
• Graphs are powerful tools to aid in this step, enabling the analyst
to visualize the data, create new views and concepts, and
communicate them to a larger audience.

• A few qualities that a graph that will be used for COSC-3107 Machine Learning
analysis and transmitting information, including
statistics, should have:
– Show the data
– Avoid distorting what the data has to say
– Make large datasets coherent
– Serve a reasonably clear purpose—description, exploration, tabulation, or
decoration

6 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
What purpose of Visualization Tools - Graphs
• Graphs must reveal information.
• We should think about creating graphs with these
principles in mind when creating an analysis.
• A graph should also be able to stand out on its own,
outside the analysis.
• Let us say that you are writing an analysis report that
becomes extensive. Now, we need to create a summary COSC-3107 Machine Learning
of that extensive analysis. To make the analysis' points
clear, a graph can be used to represent the data. This
graph should be able to support the summary without
the entire extensive analysis. To enable the graph to
give more information and be able to stand out on its
own in the summary, we have to add more information to
it, such as a title and labels.

7 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

2. Plotting an Analytical Functions

COSC-3107 Machine Learning


Plotting an Analytical Functions
• x = interval [-50, 50] and 100 sample points
• y = f(x)= x2
# Required Packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
COSC-3107 Machine Learning
# Generate the Data
x = np.linspace(-50, 50, 100)
y = np.power(x, 2)
Basic Plot of X and Y Axis
# Plotting the data on a graph using Pyplot API
plt.plot(x, y)

9 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Plotting an Analytical Functions


• x = interval [-50, 50] and 100 sample points
• y = f(x)= x3
# Required Packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
COSC-3107 Machine Learning
# Generate the Data
x = np.linspace(-50, 50, 100)
y = np.power(x, 3)
Basic Plot of X and Y Axis

# Plotting the data on a graph using Pyplot API


plt.plot(x, y)

10 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
3. Component of a Graph

COSC-3107 Machine Learning

Components of Graph

• Each graph has a set of common components that can be


adjusted. The names that Matplotlib uses for these
components are demonstrated in the following graph:

COSC-3107 Machine Learning

12 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Components of Graph
The components of a graph are as follows:
• Figure: The base of the graph, where all the other components are
drawn.
• Axis: Contains the figure elements and sets the coordinate system.
• Title: The title gives the graph its name.
• X-axis label: The name of the x-axis, usually named with the units.
• Y-axis label: The name of the y-axis, usually named with the units.
• Legend: A description of the data plotted in the graph, allowing you to
identify the curves and points in the graph. COSC-3107 Machine Learning
• Ticks and tick labels: They indicate the points of reference on a scale for
the graph, where the values of the data are. The labels indicate the values
themselves.
• Line plots: These are the lines that are plotted with the data.
• Markers: Markers are the pictograms that mark the point data.
• Spines: The lines that delimit the area of the graph where data is plotted.

13 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

4. Creating Graphs for Mathematical Functions

COSC-3107 Machine Learning


Creating Graphs for Mathematical Functions
• x = interval [-0, 100] and 500 sample points
• y = f(x)= sin(2pix/100)
# Required Packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
COSC-3107 Machine Learning
# Generate the Data
x = np.linspace(0, 100, 500)
y = np.sin(2*np.pi*x/100)
Plot output using the object oriented API
# Plotting the data on a graph using object-oriented API fig, and ax
fig, ax = plt.subplots()
ax.plot(x, y)
15 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Creating Graphs for Mathematical Functions


• x = interval [-0, 100] and 200 sample points
• y = f(x)= sin(X)
# Required Packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
COSC-3107 Machine Learning
# Generate the Data
x = np.linspace(0, 100, 200)
y = np.sin(X) Graph for a mathematical function

# Plotting the data on a graph using object-oriented API fig, and ax


fig, ax = plt.subplots()
ax.plot(x, y)
16 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
5. Seaborn

COSC-3107 Machine Learning

Seaborn
• Seaborn(https://seaborn.pydata.org/) is part of the
PyData family of tools and is a visualization library based on
Matplotlib with the goal of creating statistical graphs more
easily.

• It can operate directly on DataFrames and series, doing


aggregations and mapping internally.

• Seaborn uses color palettes and styles to make


COSC-3107 Machine Learning
visualizations consistent and more informative.
• It also has functions that can calculate some statistics,
such as regression, estimation, and errors.

• Some specialized plots, such as violin plots and multi-facet


plots, are also easy to create with Seaborn.

18 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
6. Which Tool I Should Be Used

COSC-3107 Machine Learning

Which I should use


• There is no rule to determine whether an analyst should use
only the pandas plotting interface, Matplotlib directly, or
Seaborn.
• Analysts should keep in mind the visualization
requirements and the level of configuration required to
create the desired graph.
• Pandas' plotting interface is easier to use but is more
constrained and limited.
• Seaborn has several graph patterns ready to use, including COSC-3107 Machine Learning
common statistical graphs such as pair plots and boxplots,
but requires that the data is formatted into a tidy
format and is more opinionated on how the graphs should
look.
• Matplotlib is the base for both cases and is more flexible
than both, but it demands a lot more code to create the
same visualizations as the two other options.

20 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
7. Types of Graphs

COSC-3107 Machine Learning

Types of Graphs
• Line Graph: A line graph displays
data as a series of interconnected
points on two axes (x and y), usually
Cartesian, ordered commonly by the
x-axis. Line charts are useful for
demonstrating trends in data, such
as in time series.

• Scatter Graph: A scatter plot


represents the data as points in
Cartesian coordinates. Usually, two COSC-3107 Machine Learning
variables are demonstrated in this
graph, although more information
can be conveyed if the data is color-
coded or size-coded by category,
for example. Scatter plots are useful
for showing the relationship and
possible correlation between
variables.
22 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Types of Graphs

• Histograms are useful for representing


the distribution of data. Histograms
show only one variable, usually on the x-
axis, while the y-axis shows the
frequency of occurrence of the data.

• Boxplots can also be used for COSC-3107 Machine Learning


representing frequency distributions,
but it can help to compare groups of data
using some statistical measurements, such
as mean, median, and standard
deviation. Boxplots are used to
visualize the data distribution and
outliers.

23 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

8. Line Graphs

COSC-3107 Machine Learning


Line Graph
• Line graphs connect the data points with a line.

• Line graphs are useful for demonstrating tendencies and


trends.

• More than one line can be used on the same graph, for a
comparison between the behavior of each line, although
care must be taken so that the units on the graph are the
same.
COSC-3107 Machine Learning
• They can also demonstrate the relationship between an
independent and a dependent variable.

• A common case for this is time series.

25 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Line Graph – Time Series


• Time series plots, as the name suggests, graphs the
behavior of the data with respect to time.
• Time series graphs are used frequently in financial areas
and environmental sciences. For instance, a historical
series of temperature anomalies are shown in the following
graph:

COSC-3107 Machine Learning

Usually, a time series graph has the time variable on the x-axis.

26 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
9. Creating Line Graphs Using Different Libraries

COSC-3107 Machine Learning

Creating Line Graphs Using Different Libraries


• Let us compare the creation process between Matplotlib, Pandas, and Seaborn. We will
• create a Pandas DataFrame with random values and plot it using various methods:
• x = range [0, 100] and 100 sample points
• y = range [0, 200] and equal to x samples range

# Required Packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
COSC-3107 Machine Learning
# Generate the Data
X = np.arange(0,100)
Y = np.random.randint(0,200, size=X.shape[0])
# Plotting the data on a graph using Pyplot API
plt.plot(X, Y)

28 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Creating Line Graphs Using Different Libraries
# let's create a Pandas DataFrame with the created values:
df = pd.DataFrame({'x':X, 'y_col':Y})

#Plot it using the Pyplot interface, but with the data argument:
plt.plot('x', 'y_col', data=df)

# With the same DataFrame, we can also plot directly from the Pandas DataFrame:
df.plot('x', 'y_col')

COSC-3107 Machine Learning

29 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Creating Line Graphs Using Different Libraries


#What about Seaborn? Let's create the same line plot with # Seaborn:

import seaborn as sns


sns.lineplot(X, Y)
sns.lineplot('x', 'y_col', data=df)

COSC-3107 Machine Learning

30 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
10. Pandas DataFrames and Grouped Data

COSC-3107 Machine Learning

Pandas DataFrames and Grouped Data


• Pandas uses Matplotlib under the hood, so the integration is great.

• Depending on the situation, we can either plot directly from


pandas or create a figure and an axes with Matplotlib and pass
it to pandas to plot. For example, when doing a GroupBy, we can
separate the data into a GroupBy key.
• But how can we plot the results of GroupBy? We have a few
approaches at our disposal. We can, for example, use pandas
directly, if the DataFrame is already in the right format:
fig, ax = plt.subplots()
COSC-3107 Machine Learning
df = pd.read_csv(‘dow_jones_index.data')
df[df.stock.isin(['MSFT', 'GE', 'PG'])].groupby('stock')['volume’]
plot(ax=ax)
fig, ax = plt.subplots()
df.groupby('stock').volume.plot(ax=ax)
Note that we are using the plot functions from pandas but passing the
axis that we created directly with Matplotlib as an argument.
32 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Pandas DataFrames and Grouped Data

COSC-3107 Machine Learning

33 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Today’s Lecture Summary


• Statistical Visualization
i. What the purpose of Visualization Tool - Graphs
ii. Plotting an Analytical Function
iii. Component of a Graph
iv. Creating Graphs – Mathematical Functions
v. Seaborn
vi. Which Tool I Should Be Used COSC-3107 Machine Learning
vii. Types of Graphs
viii. Line Graphs
ix. Creating Line Graphs Using Different Libraries
x. Pandas DataFrames and Grouped Data

34 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy