0% found this document useful (0 votes)
9 views

Unit I and unit ii dev (1)

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) and its significance in data science, detailing the steps involved in EDA, including data handling techniques and visualization methods. It also discusses the importance of data science, its applications across various industries, and the necessary skills and tools for data scientists. Additionally, the document outlines different types of data, measurement scales, and the role of data scientists in interpreting and analyzing data to derive insights.

Uploaded by

kumaresan7751
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit I and unit ii dev (1)

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) and its significance in data science, detailing the steps involved in EDA, including data handling techniques and visualization methods. It also discusses the importance of data science, its applications across various industries, and the necessary skills and tools for data scientists. Additionally, the document outlines different types of data, measurement scales, and the role of data scientists in interpreting and analyzing data to derive insights.

Uploaded by

kumaresan7751
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit I

UNIT I EXPLORATORY DATA ANALYSIS


EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for EDA- Data
transformation techniques-merging database, reshaping and pivoting, Transformation techniques -
Grouping Datasets - data aggregation – Pivot tables and cross-tabulations.

Understanding EDA
Step 1

First, we will import all the python libraries that are required for this, which include NumPy for
numerical calculations and scientific computing, Pandas for handling data,
and Matplotlib and Seaborn for visualization.

Step 2
Then we will load the data into the Pandas data frame.
Step 3
We can observe the dataset by checking a few of the rows using the head() method, which returns the
first five records from the dataset.
Step 4
Using shape, we can observe the dimensions of the data.
Step 5
info() method shows some of the characteristics of the data such as Column Name, No. of non-null
values of our columns, Dtype of the data, and Memory Usage.
Step 6
We will use describe() method, which shows basic statistical characteristics of each numerical feature
(int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25,
0.50, 0.75 quartiles.
Step 7
Handling missing values in the dataset. Luckily, this dataset doesn’t have any missing values, but the real
world is not so naive as our case.
So I have removed a few values intentionally just to depict how to handle this particular case.
We can check if our data contains a null value or not by the following command
o, now we can handle the missing values by using a few techniques, which are

● Drop the missing values – If the dataset is huge and missing values are very few then we can directly
drop the values because it will not have much impact.
● Replace with mean values – We can replace the missing values with mean values, but this is not
advisable in case if the data has outliers.
● Replace with median values – We can replace the missing values with median values, and it is
recommended in case if the data has outliers.
● Replace with mode values – We can do this in the case of a Categorical feature.
● Regression – It can be used to predict the null value using other details from the dataset.

● Step 8

● We can check for duplicate values in our dataset as the presence of duplicate values will hamper
the accuracy of our ML model.
● Step 9

● Handling the outliers in the data, i.e. the extreme values in the data. We can find the outliers in
our data using a Boxplot.
● So to handle it we can either drop the outlier values or replace the outlier values
using IQR(Interquartile Range Method).
● IQR is calculated as the difference between the 25th and the 75th percentile of the data. The
percentiles can be calculated by sorting the selecting values at specific indices. The IQR is used
to identify outliers by defining limits on the sample values that are a factor k of the IQR.
● Step 10

● Normalizing and Scaling – Data Normalization or feature scaling is a process to standardize the
range of features of the data as the range may vary a lot. So we can preprocess the data using
ML algorithms. So for this, we will use StandardScaler for the numerical values, which uses the
formula as x-mean/std deviation.
Step 11
We can find the pairwise correlation between the different columns of the data using
the corr() method.
he resulting coefficient is a value between -1 and 1 inclusive, where:

● 1: Total positive linear correlation

● 0: No linear correlation, the two variables most likely do not affect each other

● -1: Total negative linear correlation


Pearson Correlation is the default method of the function “corr”.
Step 12
Now, using Seaborn, we will visualize the relation between Economy (GDP per Capita) and Happiness
Score by using a regression plot. And as we can see, as the Economy increases, the Happiness Score
increases as well as denoting a positive relation.

Understanding data science

Data Science is a combination of mathematics, statistics, machine learning, and computer science. Data
Science is collecting, analyzing and interpreting data to gather insights into the data that can help
decision-makers make informed decisions.
Data Science is used in almost every industry today that can predict customer behavior and trends and
identify new opportunities. Businesses can use it to make informed decisions about product
development and marketing. It is used as a tool to detect fraud and optimize processes. Governments
also use Data Science to improve efficiency in the delivery of public services.

In simple terms, Data Science helps to analyze data and extract meaningful insights from it by combining
statistics & mathematics, programming skills, and subject expertise.

Importance of Data Science

Nowadays, organizations are overwhelmed with data. Data Science will help in extracting meaningful
insights from that by combining various methods, technology, and tools. In the fields of e-commerce,
finance, medicine, human resources, etc, businesses come across huge amounts of data. Data Science
tools and technologies help them process all of them.

History of Data Science

Early in the 1960s, the term “Data Science” was coined to help comprehend and analyze the massive
volumes of data being gathered at the time. Data science is a discipline that is constantly developing,
employing computer science and statistical methods to acquire insights and generate valuable
predictions in a variety of industries.

Data Science – Prerequisites

● Statistics
Data science relies on statistics to capture and transform data patterns into usable evidence through the
use of complex machine-learning techniques.

● Programming
Python, R, and SQL are the most common programming languages. To successfully execute a data
science project, it is important to instill some level of programming knowledge.

● Machine Learning
Making accurate forecasts and estimates is made possible by Machine Learning, which is a crucial
component of data science. You must have a firm understanding of machine learning if you want to
succeed in the field of data science.

● Databases
A clear understanding of the functioning of Databases, and skills to manage and extract data is a must in
this domain.
● Modeling
You may quickly calculate and predict using mathematical models based on the data you already know.
Modeling helps in determining which algorithm is best suited to handle a certain issue and how to train
these models.

What is Data Science used for?

● Descriptive Analysis
It helps in accurately displaying data points for patterns that may appear that satisfy all of the data’s
requirements. In other words, it involves organizing, ordering, and manipulating data to produce
information that is insightful about the supplied data. It also involves converting raw data into a form
that will make it simple to grasp and interpret.

● Predictive Analysis
It is the process of using historical data along with various techniques like data mining, statistical
modeling, and machine learning to forecast future results. Utilizing trends in this data, businesses use
predictive analytics to spot dangers and opportunities.

● Diagnostic Analysis
It is an in-depth examination to understand why something happened. Techniques like drill-down, data
discovery, data mining, and correlations are used to describe it. Multiple data operations and
transformations may be performed on a given data set to discover unique patterns in each of these
techniques.

● Prescriptive Analysis
Prescriptive analysis advances the use of predictive data. It foresees what is most likely to occur and
offers the best course of action for dealing with that result. It can assess the probable effects of various
decisions and suggest the optimal course of action. It makes use of machine learning recommendation
engines, complicated event processing, neural networks, simulation, graph analysis, and simulation.

What is the Data Science process?

● Obtaining the data


The first step is to identify what type of data needs to be analyzed, and this data needs to be exported
to an excel or a CSV file.

● Scrubbing the data


It is essential because before you can read the data, you must ensure it is in a perfectly readable state,
without any mistakes, with no missing or wrong values.
● Exploratory Analysis
Analyzing the data is done by visualizing the data in various ways and identifying patterns to spot
anything out of the ordinary. To analyze the data, you must have excellent attention to detail to identify
if anything is out of place.

● Modeling or Machine Learning


A data engineer or scientist writes down instructions for the Machine Learning algorithm to follow based
on the Data that has to be analyzed. The algorithm iteratively uses these instructions to come up with
the correct output.

● Interpreting the data


In this step, you uncover your findings and present them to the organization. The most critical skill in this
would be your ability to explain your results.

What are different Data Science tools?

Here are a few examples of tools that will assist Data Scientists in making their job easier.

● Data Analysis – Informatica PowerCenter, Rapidminer, Excel, SAS

● Data Visualization – Tableau, Qlikview, RAW, Jupyter

● Data Warehousing – Apache Hadoop, Informatica/Talend, Microsoft HD insights

● Data Modelling – H2O.ai, Datarobot, Azure ML Studio, Mahout


Benefits of Data Science in Business

● Improves business predictions

● Interpretation of complex data

● Better decision making

● Product innovation

● Improves data security

● Development of user-centric products


Applications of Data Science

● Product Recommendation
The product recommendation technique can influence customers to buy similar products. For example,
a salesperson of Big Bazaar is trying to increase the store’s sales by bundling the products together and
giving discounts. So he bundled shampoo and conditioner together and gave a discount on them.
Furthermore, customers will buy them together for a discounted price.
● Future Forecasting
It is one of the widely applied techniques in Data Science. On the basis of various types of data that are
collected from various sources weather forecasting and future forecasting are done.

● Fraud and Risk Detection


It is one of the most logical applications of Data Science. Since online transactions are booming, losing
your data is possible. For example, Credit card fraud detection depends on the amount, merchant,
location, time, and other variables. If any of them looks unnatural, the transaction will be automatically
canceled, and it will block your card for 24 hours or more.

● Self-Driving Car
The self-driving car is one of the most successful inventions in today’s world. We train our car to make
decisions independently based on the previous data. In this process, we can penalize our model if it does
not perform well. The car becomes more intelligent with time when it starts learning through all the
real-time experiences.

● Image Recognition
When you want to recognize some images, data science can detect the object and classify it. The most
famous example of image recognition is face recognition – If you tell your smartphone to unblock it, it
will scan your face. So first, the system will detect the face, then classify your face as a human face, and
after that, it will decide if the phone belongs to the actual owner or not.

● Speech to text Convert


Speech recognition is a process of understanding natural language by the computer. We are quite
familiar with virtual assistants like Siri, Alexa, and Google Assistant.

● Healthcare
Data Science helps in various branches of healthcare such as Medical Image Analysis, Development of
new drugs, Genetics and Genomics, and providing virtual assistance to patients.

● Search Engines
Google, Yahoo, Bing, Ask, etc. provides us with a lot of results within a fraction of a second. It is made
possible using various data science algorithms.

How to become Data Scientist?

Role of a Data Scientist

As businesses generate more data than ever, it becomes clear that data is a valuable asset. However,
extracting meaningful insights from data requires analyzing the data, which is where Data Scientists
come in. A Data Scientist is a specialist in collecting, organizing, analyzing, and interpreting data to find
trends, patterns, and correlations.

Data Scientists play an essential role in ensuring that organizations make informed decisions. They work
closely with business leaders to identify specific objectives, such as identifying customer segmentation
and driving improvements in products and services. Using advanced machine learning algorithms and
statistical models, Data Scientists can examine large datasets to uncover patterns and insights that help
organizations make sound decisions.

Data Scientists generally have a combination of technical skills and knowledge of interpreting and
visualizing data. They must have expertise in statistical analysis, programming languages, machine
learning algorithms, and database systems.

overview of the responsibilities handled by a professional Data Scientist.

● Gathering, cleaning, and organizing data to be used in predictive and prescriptive models

● Analyzing vast amounts of information to discover trends and patterns

● Using programming languages to structure the data and convert it into usable information

● Working with stakeholders to understand business problems and develop data-driven solutions

● Developing predictive models using statistical models to forecast future trends

● Building, maintaining, and monitoring machine learning models

● Developing and using advanced machine learning algorithms and other analytical methods to
create data-driven solutions
● Communicating data-driven solutions to stakeholders

● Discover hidden patterns and trends in massive datasets using a variety of data mining tools

● Developing and validating data solutions through data visualizations, reports, dashboards, and
presentations

Significance of EDA
Exploratory Data Analysis is a crucial step before you jump to machine learning or modeling of your
data. It provides the context needed to develop an appropriate model – and interpret the results
correctly. The purpose of Exploratory Data Analysis is essential to tackle specific tasks such as:
● Spotting missing and erroneous data

● Mapping and understanding the underlying structure of your data

● Identifying the most important variables in your dataset

● Testing a hypothesis or checking assumptions related to a specific model

● Establishing a parsimonious model (one that can explain your data using minimum variables)
Making sense of data
A dataset contains many observations about a particular object. For instance, a dataset about patients in
a hospital can contain many observations. A patient can be described by a patient identifier (ID), name,
address, weight, date of birth, address, email, and gender. Each of these features that describes a
patient is a variable. Each observation can have a specific value for each of these variables.

Numerical data
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood
pressure, heart rate, temperature, number of teeth, number of bones, and the number of family
members. This data is often referred as quantitative data in statistics. The numerical dataset can be
either discrete or continuous types.

Discrete data
This is data that is countable and its values can be listed out. For example, if we flip a coin, the number
of heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a
discrete dataset is referred to as a discrete variable. The discrete variable takes a fixed number of
distinct values. For example, the Country variable can have values such as Nepal, India, Norway, and
Japan. It is fixed. The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so
on.
Continuous data
A variable that can have an infinite number of numerical values within a specific range is classified as
continuous data. A variable describing continuous data is a continuous variable. For example, what is the
temperature of your city today? Can we be finite? Similarly, the weight variable in the previous section is
a continuous variable.

Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type of
address, or categories of the movies. This data is often referred to as qualitative datasets in statistics.

● A binary categorical variable can take exactly two values and is also referred to as
a dichotomous variable. For example, when you create an experiment, the result is either
success or failure. Hence, results can be understood as a binary categorical variable.
● Polytomous variables are categorical variables that can take more than two possible values.

Measurement scales
There are four different types of measurement scales described in statistics: nominal, ordinal, interval,
and ratio.

Nominal

These are practiced for labeling variables without any quantitative value. The scales are generally
referred to as labels.
Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the order of the
values is a significant factor.

Interval
In interval scales, both the order and exact differences between the values are significant.

Ratio

Ratio scales contain order, exact values, and absolute zero, which makes it possible to be used in
descriptive and inferential statistics

Comparing EDA with classical and Bayesian analysis

TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:


Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for statistical computing
and graphics supported by the R foundation for statistical computing. The R language is widely used
among statisticians in developing statistical observations and data analysis.
2. Python: An interpreted, object-oriented programming language with dynamic semantics. Its high
level, built-in data structures, combined with dynamic binding, make it very attractive for rapid
application development, also as to be used as a scripting or glue language to attach existing
components together. Python and EDA are often used together to spot missing values in the data set,
which is vital so you’ll decide the way to handle missing values for machine learning.
Apart from these functions described above, EDA can also:

● Perform k-means clustering: Perform k-means clustering: it’s an unsupervised learning algorithm
where the info points are assigned to clusters, also referred to as k-groups, k-means clustering is
usually utilized in market segmentation, image compression, and pattern recognition
● EDA is often utilized in predictive models like linear regression, where it’s wont to predict outcomes.
● It is also utilized in univariate, bivariate, and multivariate visualization for summary statistics,
establishing relationships between each variable, and understanding how different fields within the
data interact with one another.
Visual Aids for EDA

● Line chart

● Bar chart

● Scatter plot

● Area plot and stacked plot

● Pie chart

● Table chart

● Polar chart

● Histogram

(Note:Each of the above can be searched separately and learn it)


https://www.analyticsvidhya.com/blog/2021/06/exploratory-data-analysis-using-data-visualization-
techniques/ (Kindly refer this link)

Data transformation techniques

Data merging is the process of combining two or more data sets into a single data set. Most often, this
process is necessary when you have raw data stored in multiple files, worksheets, or data tables, that
you want to analyze all in one go.

There are two common examples in which a data analyst will need to merge new cases into a main, or
principal, data file:

1. They have collected data in a longitudinal study (tracker) – a project in which an analyst collects data
over a period of time and analyzes it as intervals.

2. They have collected data in a before-and-after project – where the analyst collects data before an event,
and then again after.

Similarly, some analysts collect data for a specific set of variables, but may at a later stage augment it
with either data in different variables, or with data that comes from a different source altogether. Thus,
there are three situations that may necessitate merging data into an existing file: you can add new
cases, new variables (or both new cases and variables), or data based on one or more look-up values.

Merging in New Cases


Merging in new cases, sometimes known as appending data (or in SQL, “unions”) or adding data by
rows (i.e. you’re adding new rows of data to each column), assumes that the variables in the two files
you’re merging are the same in nature. For instance, var1 in the example below should be numeric in
both questions, and not a string (text) variable in one file and numeric in the other. Most software
matches up the data based on the variable name, and so the same names should be used across the two
files. “var1” in one file should be “var1” in the other.
This also assumes that the IDs for each case are different. If it should happen that you have a variable in
one file that doesn’t have a match in the other, then missing data (blank values) may be inserted for
those rows that do not have data.

Merging in New Variables

Contrary to when you merge new cases, merging in new variables requires the IDs for each case in the
two files to be the same, but the variable names should be different. In this scenario, which is
sometimes referred to as augmenting your data (or in SQL, “joins”) or merging data by columns (i.e.
you’re adding new columns of data to each row), you’re adding in new variables with information for
each existing case in your data file. As with merging new cases where not all variables are present, the
same thing applies if you merge in new variables where some cases are missing – these should simply be
given blank values.
It could also happen that you have a new file with both new cases and new variables. The approach
here will depend on the software you’re using for your merge. If the software cannot handle merging
both variables and cases at the same time, then consider first merging in only the new variables for the
existing sample (i.e. augment first), and then append the new cases across all variables as a second step
to your merge.

Merging in Data using Look-ups


The above is all good and well if you have complete data sets to combine. You can, however, augment
your data with information from other sources. Consider, for instance, a data file where you have
collected the zip-codes (or postcodes) of your respondents, and you want to attach some demographic
data to your survey data – maybe the average income in each zip-code.

You will have your survey data on the one hand (left in the diagram below), and a list of zip-codes with
corresponding income values on the other (right in the diagram). Here, the zip-code would be referred
to as a look-up code and function as the ID value did in our previous examples.
In other words, we use the look-up code as the identifier and add in the income values into a new
variable in our data file. In the diagram, observe how the data is matched up for each case by looking up
the zip-code and then augmenting the original data with the income data for each matching zip-code.
For those familiar with Excel, for instance, the formula to perform this type of augmentation is
=VLOOKUP().
The look-up code should be unique in the file that contains the additional data (in our example, each zip-
code should only appear once, with a single associated income), but the same value can appear multiple
times in the file you’re wanting to augment. Think of it like this: lots of people can share a zip-code, but
there’s only one average income for each of those locations.

Reshaping and pivoting


(Refer this link in detail) https://ai.plainenglish.io/reshaping-and-pivoting-in-pandas-a41678e72d68

Grouping Datasets

Aggregation and Grouping

computing aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives
insight into the nature of a potentially large dataset
Simple Aggregation in Pandas
"Aggregations: Min, Max, and Everything In Between"
Aggregation Description

count() Total number of items

first(), last() First and last item

mean(), median(
Mean and median
)

min(), max() Minimum and maximum

std(), var() Standard deviation and variance

mad() Mean absolute deviation

prod() Product of all items

sum() Sum of all items

GroupBy: Split, Apply, Combine

Split, apply, combine


● The split step involves breaking up and grouping a DataFrame depending on the value of the
specified key.
● The apply step involves computing some function, usually an aggregate, transformation, or
filtering, within the individual groups.
● The combine step merges the results of these operations into an output array.

Column indexing
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a
modified GroupBy object.

Iteration over groups


The GroupBy object supports direct iteration over the groups, returning each group as
a Series or DataFrame.

GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a
variety of useful operations before combining the grouped data.

Aggregation
GroupBy aggregations with sum(), median()

Filtering
A filtering operation allows you to drop data based on the group properties.

Transformation
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine.

The apply() method


The apply() method lets you apply an arbitrary function to the group results.

Specifying the split key


we split the DataFrame on a single column name

(https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)

Data aggregation

Data aggregation refers to a process of collecting information from different sources and presenting it in
a summarized format so that business analysts can perform statistical analyses of business schemes. The
collected information may be gathered from various data sources to summarize these data sources into
a draft for data analysis.
Working of data aggregators
The working of data aggregators can be performed in three stages

o Collection of data
o Processing of data
o Presentation of data

Collection of data

As the name suggests, the collection of data means gathering data from different sources. The data can
be extracted using the internet of things (IoT), such as

o Social media interaction


o News headlines
o Speech recognition like call centers
o Browsing personal data and history of devices
o Processing of data

o Once data is collected, the data aggregator determines the atomic data and aggregates it. In the
data processing technique, data aggregators use numerous algorithms form the AI or ML
techniques, and it also utilizes statical methodology to process it like the predictive analysis.

o Presentation of data

o In this step, the gathered information will be summarized, providing a desirable statistical
output with accurate data.

o Choice of automated or manual data aggregators

o Data aggregation can also be applied manually. When someone starts, any startup can choose a
manual aggregator by using excel sheets and creating charts to manage the performance,
marketing and budget.

o Data aggregation is a well-established organization that uses a middleware, typically third-party


software, to implement the data automatically using various marketing tools. But in the case of
huge datasets, a data aggregator system is needed because it provides accurate outcomes.

Types of Data Aggregation


Data Aggregation can be divided into two different types

1. Time Aggregation
2. Spatial aggregation

Time Aggregation

Time aggregation provides the data point for an individual resource for a defined period.

Spatial aggregation
Spatial aggregation provides the data point for various groups of resources for a defined period.

Pivot tables and cross-tabulations

Pivot tables and crosstabs are ways to display and analyze sets of data. Both are similar to each other,
with pivot tables having just a few added features.

Pivot tables and crosstabs present data in tabular format, with rows and columns displaying certain
data. This data can be aggregated as a sum, count, max, min, or average if desired. These tools allow the
user to easily recognize trends, see relationships between their data, and access information quickly and
efficiently.
The Differences Between Pivot Tables and Crosstabs

Pivot tables and crosstabs are nearly identical in form, and the terms are often used interchangeably.
However, pivot tables present some added benefits that regular crosstabs do not.

● Pivot tables allow the user to create additional reports on the spot by easily rearranging, adding,

counting, and deleting certain data entries.

● Pivot tables work well with hierarchal organization where data sets can be drilled into to reveal

more information. For example, when viewing the total sales at a store by month, you can drill
further into the data and see the sales data on individual products for each month. With a basic
crosstab, you would have to go back to the program and create a separate crosstab with the
information on individual products.

● Pivot tables let the user filter through their data, add or remove custom fields, and change the

appearance of their report.


When They Are Most Effective

Pivot tables and crosstabs work well with any sized data set. They both present quick and efficient ways
to analyze and summarize data. They are most useful with larger sets of data because the more data .
there is, the more difficult it becomes to recognize relationships without pivot/crosstabs or other
visualization tools.
UNIT II VISUALIZING USING MATPLOTLIB
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and
contour plots – Histograms – legends – colors – subplots – text and annotation – customization –
three dimensional plotting - Geographic Data with Basemap - Visualization with Seaborn.

Importing Matplotlib

Data Visualization is the process of presenting data in the form of graphs or charts. It helps to
understand large and complex amounts of data very easily. It allows the decision-makers to make
decisions very efficiently and also allows them in identifying new trends and patterns very easily. It is
also used in high-level data analysis for Machine Learning and Exploratory Data Analysis (EDA). Data
visualization can be done with various tools like Tableau, Power BI, Python.
In this article, we will discuss how to visualize data with the help of the Matplotlib library of Python.
Matplotlib
Matplotlib is a low-level library of Python which is used for data visualization. It is easy to use and
emulates MATLAB like graphs and visualization. This library is built on the top of NumPy arrays and
consist of several plots like line chart, bar chart, histogram, etc. It provides a lot of flexibility but at the
cost of writing more code.
Installation
We will use the pip command to install this module. If you do not have pip installed then refer to the
article, Download and install pip Latest Version.

(https://www.naukri.com/learning/articles/data-visualization-using-matplotlib/)

Simple line plots

Single Line Plot


A single line plot presents data on x-y axis using a line joining datapoints. To obtain a graph Seaborn
comes with an inbuilt function to draw a line plot called lineplot().

Syntax: lineplot(x,y,data)
where,

x– data variable for x-axis


y- data variable for y-axis
data- data to be plotted

Setting different styles


A line plot can also be displayed in different background style using set() function available under
seaborn module itself.

Syntax:
set()

Attributes:

● Context: plotting context parameters

● Style: defines style

● Palette: sets a color palette

● Font

● Font_scale: sets font size

● Color_codes: If set True then palette is activated, short hand notations for colors can be remapped

from the palette.

● rc: parameter to over-ride the above parameters

Multiple Line Plot


Functionalities at times dictate data to be compared against one another and for such cases a multiple
plot can be drawn. A multiple line plot helps differentiate between data so that it can be studied and
understood with respect to some other data. Each lineplot basically follows the concept of a single line
plot but differs on the way it is presented on the screen. Lineplot of each data can be made different by
changing its color, line style, size or all listed, and a scale can be used to read it.

To differentiate on the basis of color


lineplot(x,y,data,hue)

where, hue decides on basis of which variable the data is supposed to be displayed

Error Bars in Line Plot


Error bars are used to show error rates in a line plot which can be used to study intervals in a plot. For
this, err_style attribute can be employed. This takes only two attributes, either band or bars.

Syntax:
lineplot(x,y,data,err_style)

Color Palette along the Line Plot


The color scheme depicted by lines can be changed using a palette attribute along with hue. Different
colors supported using palette can be chosen from- SEABORN COLOR PALETTE
Syntax:
lineplot(x,y,data,hue,palette)

Simple scatter plots

A scatter plot is a means to represent data in a graphical format. A simple scatter plot makes use of
the Coordinate axes to plot the points, based on their values. The following scatter plot excel data for
age (of the child in years) and height (of the child in feet) can be represented as a scatter plot.

Example:

Age of the Child Height

3 2.3

4 2.7

5 3.1

6 3.6

7 3.8

8 4

9 4.3
10 4.5

How to Construct a Scatter Plot?

There are three simple steps to plot a scatter plot.

● STEP I: Identify the x-axis and y-axis for the scatter plot.

● STEP II: Define the scale for each of the axes.

● STEP III: Plot the points based on their values.

Types of Scatter Plot

A scatter plot helps find the relationship between two variables. This relationship is referred to as a
correlation. Based on the correlation, scatter plots can be classified as follows.

● Scatter Plot for Positive Correlation


● Scatter Plot for Negative Correlation

● Scatter Plot for Null Correlation

Scatter Plot for Positive Correlation

A scatter plot with increasing values of both variables can be said to have a positive correlation.

Scatter Plot for Negative Correlation

A scatter plot with an increasing value of one variable and a decreasing value for another variable can be
said to have a negative correlation.

Scatter Plot for Null Correlation

A scatter plot with no clear increasing or decreasing trend in the values of the variables is said to have
no correlation. Here the points are distributed randomly across the graph.

What is Scatter Plot Analysis?

Analysis of a scatter plot helps us understand the following aspects of the data.

● The different levels of correlation among the data points are useful to understand the relationship

within the data.

● A line of best fit can be drawn for the given data and used to further predict new data values.

● The data points lying outside the given set of data can be easily identified to find the outliers.

● The grouping of data points in a scatter plot can be identified as different clusters within the data.

(https://www.cuemath.com/data/scatter-plot/)

visualizing errors
Error bars function used as graphical enhancement that visualizes the variability of the plotted data on a
Cartesian graph. Error bars can be applied to graphs to provide an additional layer of detail on the
presented data. As you can see in below graphs.
A short error bar shows that values are concentrated signaling that the plotted averaged value is more
likely while a long error bar would indicate that the values are more spread out and less reliable.

also depending on the type of data. the length of each pair of error bars tends to be of equal length on
both sides, however, if the data is skewed then the lengths on each side would be unbalanced.

Error bars always run parallel to a quantity of scale axis so they can be displayed either vertically or
horizontally depending on whether the quantitative scale is on the y-axis or x-axis if there are two
quantity of scales and two pairs of arrow bars can be used for both axes.

Density and contour plots

A contour plot is a graphical method to visualize the 3-D surface by plotting constant Z slices called
contours in a 2-D format. The contour plot is an alternative to a 3-D surface plot

The contour plot is formed by:

● Vertical axis: Independent variable 2

● Horizontal axis: Independent variable 1

● Lines: iso-response values, can be calculated with the help (x,y).

The independent variable usually restricted to a regular grid. The actual techniques for determining the
correct iso-response values are rather complex and almost always computer-generated.

The contour plot is used to depict the change in Z values as compared to X and Y values. If the data (or
function) do not form a regular grid, you typically need to perform a 2-D interpolation to form a regular
grid.

For one variable data, a run sequence/ histogram is considered necessary. For two-variable data, a
scatter plot is considered necessary. The contour plots can also polar co-ordinates (r,theta) instead of
traditional rectangular (x, y, z) coordinates.

Types of Contour Plot:

● Rectangular Contour plot: A projection of 2D-plot in 2D-rectangular canvas. It is the most common

form of the contour plot.


● Polar contour plot: Polar contour plot is plotted by using the polar coordinates r and theta. The

response variable here is the collection of values generated while passing r and theta into the given
function, where r is the distance from origin and theta is the angle from the positive x axis.

● Ternary contour plot: Ternary contour plot is used to represent the relationship between 3

explanatory variables and the response variable in the form of a filled triangle.
Contour plot can be plotted in different programming languages:

● Python/ Matplotlib: Contour plot can be plotted using plt.contour or plt.contourf functions,

where plt is matplotlib.pyplot. The difference between these two that plot.contour generates
hollow contour plot, the plt.contourf generated filled.

● Matlab: functions such as contourf (2d-plot) and contour3 (3D-contour) can be used for contour

plotting

● R: can create a contour plot with filled.contour functions in R.

Histograms

Histograms

Histogram is a variation of a bar chart in which data values are grouped together and put into different
classes. This grouping enables you to see how frequently data in each class occur in the dataset.

The histogram graphically shows the following:


● Frequency of different data points in the dataset.

● Location of the center of data.

● The spread of dataset.

● Skewness/variance of dataset.

● Presence of outliers in the dataset.

The features provide a strong indication of the proper distributional model in the data. The probability
plot or a goodness-of-fit test can be used to verify the distributional model.

The histogram contains the following axes:

● Vertical Axis: Frequency/count of each bin.

● Horizontal Axis: List of bins/categories.

Interpretations of Histogram:

● Normal Histogram: It is a classical bell-shaped histogram with most of the frequency counts focused

in the middle with diminishing tails and there is symmetry with respect to the median. Since the
normal distribution is most commonly observed in real-world scenarios, you are most likely to find
these. In Normally distributed histogram mean is almost equal to median.

● Non-normal Short-tailed/ long-tailed histogram: In short-tailed distribution tail approaches 0 very

fast, as we move from the median of data, In the long-tailed histogram, the tail approaches 0 slowly
as we move far from the median. Here, we refer tail as the extreme regions in the histogram where
most of the data is not-concentrated and this is on both sides of the peak.

● Bimodal Histogram: A mode of data represents the most common values in the histogram (i.e peak

of the histogram. A bimodal histogram represents that there are two peaks in the histogram. The
histogram can be used to test the unimodality of data. The bimodality (or for instance non-
unimodality) in the dataset represents that there is something wrong with the process. Bimodal
histogram many one or both of two characters: Bimodal normal distribution and symmetric
distribution
● Skewed Left/Right Histogram: Skewed histogram are those where the one-side tail is quite clearly

longer than the other-side tail. A right-skewed histogram means that the right-sided tail of the peak
is more stretched than its left and vice-versa for the left-sided. In a left-skewed histogram, the mean
is always lesser than the median, while in a right-skewed histogram mean is greater than the
histogram.

● Uniform Histogram: In uniform histogram, each bin contains approximately the same number of

counts (frequency). The example of uniform histogram is such as a die is rolled n (n>>30) number of
times and record the frequency of different outcomes.

● Normal Distribution with an Outlier: This histogram is similar to normal histogram except it

contains an outlier where the count/ probability of outcome is substantive. This is mostly due to
some system errors in process, which led to faulty generation of products etc.

Legends
Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the data displayed
in the graph’s Y-axis. It generally appears as the box containing a small sample of each color on the
graph and a small description of what this data means.

The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the coordinates of the
legend, and the attribute ncol represents the number of columns that the legend has. Its default value is
1.

Syntax:
matplotlib.pyplot.legend([“name1”, “name2”], bbox_to_anchor=(x, y), ncol=1)

Before moving any further with Matplotlib let’s discuss some important classes that will be used further
in the tutorial. These classes are –

● Figure

● Axes

Note: Matplotlib take care of the creation of inbuilt defaults like Figure and Axes.
Figure class
Consider the figure class as the overall window or page on which everything is drawn. It is a top-level
container that contains one or more axes. A figure can be created using the figure() method.
Syntax:
class matplotlib.figure.Figure(figsize=None, dpi=None, facecolor=None, edgecolor=None, linewidth=0.0,
frameon=None, subplotpars=None, tight_layout=None, constrained_layout=None)

Axes Class
Axes class is the most basic and flexible unit for creating sub-plots. A given figure may contain many
axes, but a given axes can only be present in one figure. The axes() function creates the axes object.
Syntax:
axes([left, bottom, width, height])

Just like pyplot class, axes class also provides methods for adding titles, legends, limits, labels, etc. Let’s
see a few of them –

● Adding Title – ax.set_title()

● Adding X Label and Y label – ax.set_xlabel(), ax.set_ylabel()

● Setting Limits – ax.set_xlim(), ax.set_ylim()

● Tick labels – ax.set_xticklabels(), ax.set_yticklabels()

● Adding Legends – ax.legend()

Types of Color Palette

Three major types of color palette exist for data visualization:

● Qualitative palettes

● Sequential palettes

● Diverging palettes
The type of color palette that you use in a visualization depends on the nature of the data mapped to
color.

A qualitative palette is used when the variable is categorical in nature. Categorical variables are those
that take on distinct labels without inherent ordering. Examples include country or state, race, and
gender. Each possible value of the variable is assigned one color from a qualitative palette.

Sequential palettes

When the variable assigned to be colored is numeric or has inherently ordered values, then it can be
depicted with a sequential palette. Colors are assigned to data values in a continuum, usually based on
lightness, hue, or both.

Diverging palettes

If our numeric variable has a meaningful central value, like zero, then we can apply a diverging
palette. A diverging palette is essentially a combination of two sequential palettes with a shared
endpoint sitting at the central value. Values larger than the center are assigned to colors on one
side of the center, while smaller values get assigned to colors on the opposing side.

Subplots:
Using subplot() method.
This method adds another plot at the specified grid position in the current figure.

Syntax:
subplot(nrows, ncols, index, **kwargs)

subplot(pos, **kwargs)

subplot(ax)

Using subplots() method


This function is used to create figures and multiple subplots at the same time.

Syntax:
matplotlib.pyplot.subplots(nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,
subplot_kw=None, gridspec_kw=None, **fig_kw)
Using subplot2grid() method
This function creates axes object at a specified location inside a grid and also helps in spanning the axes
object across multiple rows or columns. In simpler words, this function is used to create multiple charts
within the same figure.

Syntax:
Plt.subplot2grid(shape, location, rowspan, colspan)

Text and annotation

Transforms and Text Position

In Matplotlib, this is done by modifying the transform.

Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has a
well-developed set of tools that it uses internally to perform them (these tools can be explored in
the matplotlib.transforms submodule).

There are three pre-defined transforms that can be useful in this situation:

● ax.transData: Transform associated with data coordinates

● ax.transAxes: Transform associated with the axes (in units of axes dimensions)

● fig.transFigure: Transform associated with the figure (in units of figure dimensions)

The transData coordinates give the usual data coordinates associated with the x- and y-axis labels.
The transAxes coordinates give the location from the bottom-left corner of the axes (here the white
box), as a fraction of the axes size. The transFigure coordinates are similar, but specify the position from
the bottom-left of the figure (here the gray box), as a fraction of the figure size.

Arrows and Annotation

plt.arrow()

plt.annotate()
Introduction to 3D Plotting with Matplotlib
we will be defining the axes of our 3D plot where we define that the projection of the plot will be in
“3D” format. This helps us to create the 3D empty axes figure in the canvas. After this, if we show the
plot using plt.show(), then it would look like the one shown in the output.
Example: Creating an empty 3D figure using Matplotlib

● In the above example, first, we are importing packages from the python library in order to have a 3D

plot in our empty canvas. So, for that, we are importing numpy, matplotlib.pyplot,
and mpl_toolkits.

● After importing all the necessary packages, we are creating an empty figure using plt.figure().

● After that, we are defining the axis of the plot where we are specifying that the plot will be of 3D

projection.

● After that, we are taking 3 arrays with a wide range of arbitrary points which will act as X, Y, and Z

coordinates for plotting the graph respectively. Now after initializing the points, we are plotting a 3D
plot using ax.plot3D() where we are using x,y,z as the X, Y, and Z coordinates respectively and the
color of the line will be red.

● All these things are sent as parameters inside the bracket.


● After that, we are also plotting a scatter plot with the same set of values and as we progress with

each point, the color of the points will be based on the values that the coordinates contain.

● After this, we are showing the above plot.

Geographic Data with Basemap

One common type of visualization in data science is that of geographic data. Matplotlib's main tool for
this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits which lives
under the mpl_toolkits namespace.

Installation of Basemap is straightforward; if you're using conda you can type this and the package will
be downloaded:

$ conda install basemap

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting


x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

Map Projections
Depending on the intended use of the map projection, there are certain map features (e.g., direction,
area, distance, shape, or other considerations) that are useful to maintain.
Cylindrical projections
This type of mapping represents equatorial regions quite well, but results in extreme distortions near
the poles. The spacing of latitude lines varies between different cylindrical projections, leading to
different conservation properties, and different distortion near the poles. In the following figure we
show an example of the equidistant cylindrical projection, which chooses a latitude scaling that
preserves distances along meridians. Other cylindrical projections are the Mercator (projection='merc')
and the cylindrical equal area (projection='cea') projections.

Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain
vertical; this can give better properties near the poles of the projection. The Mollweide projection
(projection='moll') is one common example of this, in which all meridians are elliptical arcs. It is
constructed so as to preserve area across the map: though there are distortions near the poles, the area
of small patches reflects the true area. Other pseudo-cylindrical projections are the sinusoidal
(projection='sinu') and Robinson (projection='robin') projections.

Perspective projections
One common example is the orthographic projection (projection='ortho'), which shows one side of the
globe as seen from a viewer at a very long distance. As such, it can show only half the globe at a time.
Other perspective-based projections include the gnomonic projection (projection='gnom') and
stereographic projection (projection='stere'). These are often the most useful for showing small portions
of the map.

Conic projections
One example of this is the Lambert Conformal Conic projection (projection='lcc'), which we saw earlier
in the map of North America. It projects the map onto a cone arranged in such a way that two standard
parallels (specified in Basemap by lat_1 and lat_2) have well-represented distances, with scale
decreasing between them and increasing outside of them. Other useful conic projections are the
equidistant conic projection (projection='eqdc') and the Albers equal-area projection (projection='aea').
Conic projections, like perspective projections, tend to be good choices for representing small to
medium patches of the globe.

Drawing a Map Background


● Physical boundaries and bodies of water

o drawcoastlines(): Draw continental coast lines


o drawlsmask(): Draw a mask between the land and sea, for use with projecting images on
one or the other
o drawmapboundary(): Draw the map boundary, including the fill color for oceans.
o drawrivers(): Draw rivers on the map
o fillcontinents(): Fill the continents with a given color; optionally fill lakes with another
color
● Political boundaries

o drawcountries(): Draw country boundaries


o drawstates(): Draw US state boundaries
o drawcounties(): Draw US county boundaries
● Map features

o drawgreatcircle(): Draw a great circle between two points


o drawparallels(): Draw lines of constant latitude
o drawmeridians(): Draw lines of constant longitude
o drawmapscale(): Draw a linear scale on the map
● Whole-globe images

o bluemarble(): Project NASA's blue marble image onto the map


o shadedrelief(): Project a shaded relief image onto the map
o etopo(): Draw an etopo relief image onto the map
o warpimage(): Project a user-provided image onto the map

Plotting Data on Maps

Perhaps the most useful piece of the Basemap toolkit is the ability to over-plot a variety of data onto a
map background. For simple plotting and text, any plt function works on the map; you can use
the Basemap instance to project latitude and longitude coordinates to (x, y) coordinates for plotting
with plt, as we saw earlier in the Seattle example.

In addition to this, there are many map-specific functions available as methods of the Basemap instance.
These work very similarly to their standard Matplotlib counterparts, but have an additional Boolean
argument latlon, which if set to True allows you to pass raw latitudes and longitudes to the method,
rather than projected (x, y) coordinates.

Some of these map-specific methods are:

● contour()/contourf() : Draw contour lines or filled contours

● imshow(): Draw an image


● pcolor()/pcolormesh() : Draw a pseudocolor plot for irregular/regular meshes

● plot(): Draw lines and/or markers.

● scatter(): Draw points with markers.

● quiver(): Draw vectors.

● barbs(): Draw wind barbs.

● drawgreatcircle(): Draw a great circle.

(https://www.geeksforgeeks.org/working-with-geospatial-data-in-python/)

Visualization with Seaborn

(https://www.geeksforgeeks.org/python-seaborn-tutorial/)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy