Unit I and unit ii dev (1)
Unit I and unit ii dev (1)
Understanding EDA
Step 1
First, we will import all the python libraries that are required for this, which include NumPy for
numerical calculations and scientific computing, Pandas for handling data,
and Matplotlib and Seaborn for visualization.
Step 2
Then we will load the data into the Pandas data frame.
Step 3
We can observe the dataset by checking a few of the rows using the head() method, which returns the
first five records from the dataset.
Step 4
Using shape, we can observe the dimensions of the data.
Step 5
info() method shows some of the characteristics of the data such as Column Name, No. of non-null
values of our columns, Dtype of the data, and Memory Usage.
Step 6
We will use describe() method, which shows basic statistical characteristics of each numerical feature
(int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25,
0.50, 0.75 quartiles.
Step 7
Handling missing values in the dataset. Luckily, this dataset doesn’t have any missing values, but the real
world is not so naive as our case.
So I have removed a few values intentionally just to depict how to handle this particular case.
We can check if our data contains a null value or not by the following command
o, now we can handle the missing values by using a few techniques, which are
● Drop the missing values – If the dataset is huge and missing values are very few then we can directly
drop the values because it will not have much impact.
● Replace with mean values – We can replace the missing values with mean values, but this is not
advisable in case if the data has outliers.
● Replace with median values – We can replace the missing values with median values, and it is
recommended in case if the data has outliers.
● Replace with mode values – We can do this in the case of a Categorical feature.
● Regression – It can be used to predict the null value using other details from the dataset.
● Step 8
● We can check for duplicate values in our dataset as the presence of duplicate values will hamper
the accuracy of our ML model.
● Step 9
● Handling the outliers in the data, i.e. the extreme values in the data. We can find the outliers in
our data using a Boxplot.
● So to handle it we can either drop the outlier values or replace the outlier values
using IQR(Interquartile Range Method).
● IQR is calculated as the difference between the 25th and the 75th percentile of the data. The
percentiles can be calculated by sorting the selecting values at specific indices. The IQR is used
to identify outliers by defining limits on the sample values that are a factor k of the IQR.
● Step 10
● Normalizing and Scaling – Data Normalization or feature scaling is a process to standardize the
range of features of the data as the range may vary a lot. So we can preprocess the data using
ML algorithms. So for this, we will use StandardScaler for the numerical values, which uses the
formula as x-mean/std deviation.
Step 11
We can find the pairwise correlation between the different columns of the data using
the corr() method.
he resulting coefficient is a value between -1 and 1 inclusive, where:
● 0: No linear correlation, the two variables most likely do not affect each other
Data Science is a combination of mathematics, statistics, machine learning, and computer science. Data
Science is collecting, analyzing and interpreting data to gather insights into the data that can help
decision-makers make informed decisions.
Data Science is used in almost every industry today that can predict customer behavior and trends and
identify new opportunities. Businesses can use it to make informed decisions about product
development and marketing. It is used as a tool to detect fraud and optimize processes. Governments
also use Data Science to improve efficiency in the delivery of public services.
In simple terms, Data Science helps to analyze data and extract meaningful insights from it by combining
statistics & mathematics, programming skills, and subject expertise.
Nowadays, organizations are overwhelmed with data. Data Science will help in extracting meaningful
insights from that by combining various methods, technology, and tools. In the fields of e-commerce,
finance, medicine, human resources, etc, businesses come across huge amounts of data. Data Science
tools and technologies help them process all of them.
Early in the 1960s, the term “Data Science” was coined to help comprehend and analyze the massive
volumes of data being gathered at the time. Data science is a discipline that is constantly developing,
employing computer science and statistical methods to acquire insights and generate valuable
predictions in a variety of industries.
● Statistics
Data science relies on statistics to capture and transform data patterns into usable evidence through the
use of complex machine-learning techniques.
● Programming
Python, R, and SQL are the most common programming languages. To successfully execute a data
science project, it is important to instill some level of programming knowledge.
● Machine Learning
Making accurate forecasts and estimates is made possible by Machine Learning, which is a crucial
component of data science. You must have a firm understanding of machine learning if you want to
succeed in the field of data science.
● Databases
A clear understanding of the functioning of Databases, and skills to manage and extract data is a must in
this domain.
● Modeling
You may quickly calculate and predict using mathematical models based on the data you already know.
Modeling helps in determining which algorithm is best suited to handle a certain issue and how to train
these models.
● Descriptive Analysis
It helps in accurately displaying data points for patterns that may appear that satisfy all of the data’s
requirements. In other words, it involves organizing, ordering, and manipulating data to produce
information that is insightful about the supplied data. It also involves converting raw data into a form
that will make it simple to grasp and interpret.
● Predictive Analysis
It is the process of using historical data along with various techniques like data mining, statistical
modeling, and machine learning to forecast future results. Utilizing trends in this data, businesses use
predictive analytics to spot dangers and opportunities.
● Diagnostic Analysis
It is an in-depth examination to understand why something happened. Techniques like drill-down, data
discovery, data mining, and correlations are used to describe it. Multiple data operations and
transformations may be performed on a given data set to discover unique patterns in each of these
techniques.
● Prescriptive Analysis
Prescriptive analysis advances the use of predictive data. It foresees what is most likely to occur and
offers the best course of action for dealing with that result. It can assess the probable effects of various
decisions and suggest the optimal course of action. It makes use of machine learning recommendation
engines, complicated event processing, neural networks, simulation, graph analysis, and simulation.
Here are a few examples of tools that will assist Data Scientists in making their job easier.
● Product innovation
● Product Recommendation
The product recommendation technique can influence customers to buy similar products. For example,
a salesperson of Big Bazaar is trying to increase the store’s sales by bundling the products together and
giving discounts. So he bundled shampoo and conditioner together and gave a discount on them.
Furthermore, customers will buy them together for a discounted price.
● Future Forecasting
It is one of the widely applied techniques in Data Science. On the basis of various types of data that are
collected from various sources weather forecasting and future forecasting are done.
● Self-Driving Car
The self-driving car is one of the most successful inventions in today’s world. We train our car to make
decisions independently based on the previous data. In this process, we can penalize our model if it does
not perform well. The car becomes more intelligent with time when it starts learning through all the
real-time experiences.
● Image Recognition
When you want to recognize some images, data science can detect the object and classify it. The most
famous example of image recognition is face recognition – If you tell your smartphone to unblock it, it
will scan your face. So first, the system will detect the face, then classify your face as a human face, and
after that, it will decide if the phone belongs to the actual owner or not.
● Healthcare
Data Science helps in various branches of healthcare such as Medical Image Analysis, Development of
new drugs, Genetics and Genomics, and providing virtual assistance to patients.
● Search Engines
Google, Yahoo, Bing, Ask, etc. provides us with a lot of results within a fraction of a second. It is made
possible using various data science algorithms.
As businesses generate more data than ever, it becomes clear that data is a valuable asset. However,
extracting meaningful insights from data requires analyzing the data, which is where Data Scientists
come in. A Data Scientist is a specialist in collecting, organizing, analyzing, and interpreting data to find
trends, patterns, and correlations.
Data Scientists play an essential role in ensuring that organizations make informed decisions. They work
closely with business leaders to identify specific objectives, such as identifying customer segmentation
and driving improvements in products and services. Using advanced machine learning algorithms and
statistical models, Data Scientists can examine large datasets to uncover patterns and insights that help
organizations make sound decisions.
Data Scientists generally have a combination of technical skills and knowledge of interpreting and
visualizing data. They must have expertise in statistical analysis, programming languages, machine
learning algorithms, and database systems.
● Gathering, cleaning, and organizing data to be used in predictive and prescriptive models
● Using programming languages to structure the data and convert it into usable information
● Working with stakeholders to understand business problems and develop data-driven solutions
● Developing and using advanced machine learning algorithms and other analytical methods to
create data-driven solutions
● Communicating data-driven solutions to stakeholders
● Discover hidden patterns and trends in massive datasets using a variety of data mining tools
● Developing and validating data solutions through data visualizations, reports, dashboards, and
presentations
Significance of EDA
Exploratory Data Analysis is a crucial step before you jump to machine learning or modeling of your
data. It provides the context needed to develop an appropriate model – and interpret the results
correctly. The purpose of Exploratory Data Analysis is essential to tackle specific tasks such as:
● Spotting missing and erroneous data
● Establishing a parsimonious model (one that can explain your data using minimum variables)
Making sense of data
A dataset contains many observations about a particular object. For instance, a dataset about patients in
a hospital can contain many observations. A patient can be described by a patient identifier (ID), name,
address, weight, date of birth, address, email, and gender. Each of these features that describes a
patient is a variable. Each observation can have a specific value for each of these variables.
Numerical data
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood
pressure, heart rate, temperature, number of teeth, number of bones, and the number of family
members. This data is often referred as quantitative data in statistics. The numerical dataset can be
either discrete or continuous types.
Discrete data
This is data that is countable and its values can be listed out. For example, if we flip a coin, the number
of heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a
discrete dataset is referred to as a discrete variable. The discrete variable takes a fixed number of
distinct values. For example, the Country variable can have values such as Nepal, India, Norway, and
Japan. It is fixed. The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so
on.
Continuous data
A variable that can have an infinite number of numerical values within a specific range is classified as
continuous data. A variable describing continuous data is a continuous variable. For example, what is the
temperature of your city today? Can we be finite? Similarly, the weight variable in the previous section is
a continuous variable.
Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type of
address, or categories of the movies. This data is often referred to as qualitative datasets in statistics.
● A binary categorical variable can take exactly two values and is also referred to as
a dichotomous variable. For example, when you create an experiment, the result is either
success or failure. Hence, results can be understood as a binary categorical variable.
● Polytomous variables are categorical variables that can take more than two possible values.
Measurement scales
There are four different types of measurement scales described in statistics: nominal, ordinal, interval,
and ratio.
Nominal
These are practiced for labeling variables without any quantitative value. The scales are generally
referred to as labels.
Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the order of the
values is a significant factor.
Interval
In interval scales, both the order and exact differences between the values are significant.
Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be used in
descriptive and inferential statistics
● Perform k-means clustering: Perform k-means clustering: it’s an unsupervised learning algorithm
where the info points are assigned to clusters, also referred to as k-groups, k-means clustering is
usually utilized in market segmentation, image compression, and pattern recognition
● EDA is often utilized in predictive models like linear regression, where it’s wont to predict outcomes.
● It is also utilized in univariate, bivariate, and multivariate visualization for summary statistics,
establishing relationships between each variable, and understanding how different fields within the
data interact with one another.
Visual Aids for EDA
● Line chart
● Bar chart
● Scatter plot
● Pie chart
● Table chart
● Polar chart
● Histogram
Data merging is the process of combining two or more data sets into a single data set. Most often, this
process is necessary when you have raw data stored in multiple files, worksheets, or data tables, that
you want to analyze all in one go.
There are two common examples in which a data analyst will need to merge new cases into a main, or
principal, data file:
1. They have collected data in a longitudinal study (tracker) – a project in which an analyst collects data
over a period of time and analyzes it as intervals.
2. They have collected data in a before-and-after project – where the analyst collects data before an event,
and then again after.
Similarly, some analysts collect data for a specific set of variables, but may at a later stage augment it
with either data in different variables, or with data that comes from a different source altogether. Thus,
there are three situations that may necessitate merging data into an existing file: you can add new
cases, new variables (or both new cases and variables), or data based on one or more look-up values.
Contrary to when you merge new cases, merging in new variables requires the IDs for each case in the
two files to be the same, but the variable names should be different. In this scenario, which is
sometimes referred to as augmenting your data (or in SQL, “joins”) or merging data by columns (i.e.
you’re adding new columns of data to each row), you’re adding in new variables with information for
each existing case in your data file. As with merging new cases where not all variables are present, the
same thing applies if you merge in new variables where some cases are missing – these should simply be
given blank values.
It could also happen that you have a new file with both new cases and new variables. The approach
here will depend on the software you’re using for your merge. If the software cannot handle merging
both variables and cases at the same time, then consider first merging in only the new variables for the
existing sample (i.e. augment first), and then append the new cases across all variables as a second step
to your merge.
You will have your survey data on the one hand (left in the diagram below), and a list of zip-codes with
corresponding income values on the other (right in the diagram). Here, the zip-code would be referred
to as a look-up code and function as the ID value did in our previous examples.
In other words, we use the look-up code as the identifier and add in the income values into a new
variable in our data file. In the diagram, observe how the data is matched up for each case by looking up
the zip-code and then augmenting the original data with the income data for each matching zip-code.
For those familiar with Excel, for instance, the formula to perform this type of augmentation is
=VLOOKUP().
The look-up code should be unique in the file that contains the additional data (in our example, each zip-
code should only appear once, with a single associated income), but the same value can appear multiple
times in the file you’re wanting to augment. Think of it like this: lots of people can share a zip-code, but
there’s only one average income for each of those locations.
Grouping Datasets
computing aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives
insight into the nature of a potentially large dataset
Simple Aggregation in Pandas
"Aggregations: Min, Max, and Everything In Between"
Aggregation Description
mean(), median(
Mean and median
)
Column indexing
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a
modified GroupBy object.
GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a
variety of useful operations before combining the grouped data.
Aggregation
GroupBy aggregations with sum(), median()
Filtering
A filtering operation allows you to drop data based on the group properties.
Transformation
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine.
(https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)
Data aggregation
Data aggregation refers to a process of collecting information from different sources and presenting it in
a summarized format so that business analysts can perform statistical analyses of business schemes. The
collected information may be gathered from various data sources to summarize these data sources into
a draft for data analysis.
Working of data aggregators
The working of data aggregators can be performed in three stages
o Collection of data
o Processing of data
o Presentation of data
Collection of data
As the name suggests, the collection of data means gathering data from different sources. The data can
be extracted using the internet of things (IoT), such as
o Once data is collected, the data aggregator determines the atomic data and aggregates it. In the
data processing technique, data aggregators use numerous algorithms form the AI or ML
techniques, and it also utilizes statical methodology to process it like the predictive analysis.
o Presentation of data
o In this step, the gathered information will be summarized, providing a desirable statistical
output with accurate data.
o Data aggregation can also be applied manually. When someone starts, any startup can choose a
manual aggregator by using excel sheets and creating charts to manage the performance,
marketing and budget.
1. Time Aggregation
2. Spatial aggregation
Time Aggregation
Time aggregation provides the data point for an individual resource for a defined period.
Spatial aggregation
Spatial aggregation provides the data point for various groups of resources for a defined period.
Pivot tables and crosstabs are ways to display and analyze sets of data. Both are similar to each other,
with pivot tables having just a few added features.
Pivot tables and crosstabs present data in tabular format, with rows and columns displaying certain
data. This data can be aggregated as a sum, count, max, min, or average if desired. These tools allow the
user to easily recognize trends, see relationships between their data, and access information quickly and
efficiently.
The Differences Between Pivot Tables and Crosstabs
Pivot tables and crosstabs are nearly identical in form, and the terms are often used interchangeably.
However, pivot tables present some added benefits that regular crosstabs do not.
● Pivot tables allow the user to create additional reports on the spot by easily rearranging, adding,
● Pivot tables work well with hierarchal organization where data sets can be drilled into to reveal
more information. For example, when viewing the total sales at a store by month, you can drill
further into the data and see the sales data on individual products for each month. With a basic
crosstab, you would have to go back to the program and create a separate crosstab with the
information on individual products.
● Pivot tables let the user filter through their data, add or remove custom fields, and change the
Pivot tables and crosstabs work well with any sized data set. They both present quick and efficient ways
to analyze and summarize data. They are most useful with larger sets of data because the more data .
there is, the more difficult it becomes to recognize relationships without pivot/crosstabs or other
visualization tools.
UNIT II VISUALIZING USING MATPLOTLIB
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and
contour plots – Histograms – legends – colors – subplots – text and annotation – customization –
three dimensional plotting - Geographic Data with Basemap - Visualization with Seaborn.
Importing Matplotlib
Data Visualization is the process of presenting data in the form of graphs or charts. It helps to
understand large and complex amounts of data very easily. It allows the decision-makers to make
decisions very efficiently and also allows them in identifying new trends and patterns very easily. It is
also used in high-level data analysis for Machine Learning and Exploratory Data Analysis (EDA). Data
visualization can be done with various tools like Tableau, Power BI, Python.
In this article, we will discuss how to visualize data with the help of the Matplotlib library of Python.
Matplotlib
Matplotlib is a low-level library of Python which is used for data visualization. It is easy to use and
emulates MATLAB like graphs and visualization. This library is built on the top of NumPy arrays and
consist of several plots like line chart, bar chart, histogram, etc. It provides a lot of flexibility but at the
cost of writing more code.
Installation
We will use the pip command to install this module. If you do not have pip installed then refer to the
article, Download and install pip Latest Version.
(https://www.naukri.com/learning/articles/data-visualization-using-matplotlib/)
Syntax: lineplot(x,y,data)
where,
Syntax:
set()
Attributes:
● Font
● Color_codes: If set True then palette is activated, short hand notations for colors can be remapped
where, hue decides on basis of which variable the data is supposed to be displayed
Syntax:
lineplot(x,y,data,err_style)
A scatter plot is a means to represent data in a graphical format. A simple scatter plot makes use of
the Coordinate axes to plot the points, based on their values. The following scatter plot excel data for
age (of the child in years) and height (of the child in feet) can be represented as a scatter plot.
Example:
3 2.3
4 2.7
5 3.1
6 3.6
7 3.8
8 4
9 4.3
10 4.5
● STEP I: Identify the x-axis and y-axis for the scatter plot.
A scatter plot helps find the relationship between two variables. This relationship is referred to as a
correlation. Based on the correlation, scatter plots can be classified as follows.
A scatter plot with increasing values of both variables can be said to have a positive correlation.
A scatter plot with an increasing value of one variable and a decreasing value for another variable can be
said to have a negative correlation.
A scatter plot with no clear increasing or decreasing trend in the values of the variables is said to have
no correlation. Here the points are distributed randomly across the graph.
Analysis of a scatter plot helps us understand the following aspects of the data.
● The different levels of correlation among the data points are useful to understand the relationship
● A line of best fit can be drawn for the given data and used to further predict new data values.
● The data points lying outside the given set of data can be easily identified to find the outliers.
● The grouping of data points in a scatter plot can be identified as different clusters within the data.
(https://www.cuemath.com/data/scatter-plot/)
visualizing errors
Error bars function used as graphical enhancement that visualizes the variability of the plotted data on a
Cartesian graph. Error bars can be applied to graphs to provide an additional layer of detail on the
presented data. As you can see in below graphs.
A short error bar shows that values are concentrated signaling that the plotted averaged value is more
likely while a long error bar would indicate that the values are more spread out and less reliable.
also depending on the type of data. the length of each pair of error bars tends to be of equal length on
both sides, however, if the data is skewed then the lengths on each side would be unbalanced.
Error bars always run parallel to a quantity of scale axis so they can be displayed either vertically or
horizontally depending on whether the quantitative scale is on the y-axis or x-axis if there are two
quantity of scales and two pairs of arrow bars can be used for both axes.
A contour plot is a graphical method to visualize the 3-D surface by plotting constant Z slices called
contours in a 2-D format. The contour plot is an alternative to a 3-D surface plot
The independent variable usually restricted to a regular grid. The actual techniques for determining the
correct iso-response values are rather complex and almost always computer-generated.
The contour plot is used to depict the change in Z values as compared to X and Y values. If the data (or
function) do not form a regular grid, you typically need to perform a 2-D interpolation to form a regular
grid.
For one variable data, a run sequence/ histogram is considered necessary. For two-variable data, a
scatter plot is considered necessary. The contour plots can also polar co-ordinates (r,theta) instead of
traditional rectangular (x, y, z) coordinates.
● Rectangular Contour plot: A projection of 2D-plot in 2D-rectangular canvas. It is the most common
response variable here is the collection of values generated while passing r and theta into the given
function, where r is the distance from origin and theta is the angle from the positive x axis.
● Ternary contour plot: Ternary contour plot is used to represent the relationship between 3
explanatory variables and the response variable in the form of a filled triangle.
Contour plot can be plotted in different programming languages:
● Python/ Matplotlib: Contour plot can be plotted using plt.contour or plt.contourf functions,
where plt is matplotlib.pyplot. The difference between these two that plot.contour generates
hollow contour plot, the plt.contourf generated filled.
● Matlab: functions such as contourf (2d-plot) and contour3 (3D-contour) can be used for contour
plotting
Histograms
Histograms
Histogram is a variation of a bar chart in which data values are grouped together and put into different
classes. This grouping enables you to see how frequently data in each class occur in the dataset.
● Skewness/variance of dataset.
The features provide a strong indication of the proper distributional model in the data. The probability
plot or a goodness-of-fit test can be used to verify the distributional model.
Interpretations of Histogram:
● Normal Histogram: It is a classical bell-shaped histogram with most of the frequency counts focused
in the middle with diminishing tails and there is symmetry with respect to the median. Since the
normal distribution is most commonly observed in real-world scenarios, you are most likely to find
these. In Normally distributed histogram mean is almost equal to median.
fast, as we move from the median of data, In the long-tailed histogram, the tail approaches 0 slowly
as we move far from the median. Here, we refer tail as the extreme regions in the histogram where
most of the data is not-concentrated and this is on both sides of the peak.
● Bimodal Histogram: A mode of data represents the most common values in the histogram (i.e peak
of the histogram. A bimodal histogram represents that there are two peaks in the histogram. The
histogram can be used to test the unimodality of data. The bimodality (or for instance non-
unimodality) in the dataset represents that there is something wrong with the process. Bimodal
histogram many one or both of two characters: Bimodal normal distribution and symmetric
distribution
● Skewed Left/Right Histogram: Skewed histogram are those where the one-side tail is quite clearly
longer than the other-side tail. A right-skewed histogram means that the right-sided tail of the peak
is more stretched than its left and vice-versa for the left-sided. In a left-skewed histogram, the mean
is always lesser than the median, while in a right-skewed histogram mean is greater than the
histogram.
● Uniform Histogram: In uniform histogram, each bin contains approximately the same number of
counts (frequency). The example of uniform histogram is such as a die is rolled n (n>>30) number of
times and record the frequency of different outcomes.
● Normal Distribution with an Outlier: This histogram is similar to normal histogram except it
contains an outlier where the count/ probability of outcome is substantive. This is mostly due to
some system errors in process, which led to faulty generation of products etc.
Legends
Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the data displayed
in the graph’s Y-axis. It generally appears as the box containing a small sample of each color on the
graph and a small description of what this data means.
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the coordinates of the
legend, and the attribute ncol represents the number of columns that the legend has. Its default value is
1.
Syntax:
matplotlib.pyplot.legend([“name1”, “name2”], bbox_to_anchor=(x, y), ncol=1)
Before moving any further with Matplotlib let’s discuss some important classes that will be used further
in the tutorial. These classes are –
● Figure
● Axes
Note: Matplotlib take care of the creation of inbuilt defaults like Figure and Axes.
Figure class
Consider the figure class as the overall window or page on which everything is drawn. It is a top-level
container that contains one or more axes. A figure can be created using the figure() method.
Syntax:
class matplotlib.figure.Figure(figsize=None, dpi=None, facecolor=None, edgecolor=None, linewidth=0.0,
frameon=None, subplotpars=None, tight_layout=None, constrained_layout=None)
Axes Class
Axes class is the most basic and flexible unit for creating sub-plots. A given figure may contain many
axes, but a given axes can only be present in one figure. The axes() function creates the axes object.
Syntax:
axes([left, bottom, width, height])
Just like pyplot class, axes class also provides methods for adding titles, legends, limits, labels, etc. Let’s
see a few of them –
● Qualitative palettes
● Sequential palettes
● Diverging palettes
The type of color palette that you use in a visualization depends on the nature of the data mapped to
color.
A qualitative palette is used when the variable is categorical in nature. Categorical variables are those
that take on distinct labels without inherent ordering. Examples include country or state, race, and
gender. Each possible value of the variable is assigned one color from a qualitative palette.
Sequential palettes
When the variable assigned to be colored is numeric or has inherently ordered values, then it can be
depicted with a sequential palette. Colors are assigned to data values in a continuum, usually based on
lightness, hue, or both.
Diverging palettes
If our numeric variable has a meaningful central value, like zero, then we can apply a diverging
palette. A diverging palette is essentially a combination of two sequential palettes with a shared
endpoint sitting at the central value. Values larger than the center are assigned to colors on one
side of the center, while smaller values get assigned to colors on the opposing side.
Subplots:
Using subplot() method.
This method adds another plot at the specified grid position in the current figure.
Syntax:
subplot(nrows, ncols, index, **kwargs)
subplot(pos, **kwargs)
subplot(ax)
Syntax:
matplotlib.pyplot.subplots(nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,
subplot_kw=None, gridspec_kw=None, **fig_kw)
Using subplot2grid() method
This function creates axes object at a specified location inside a grid and also helps in spanning the axes
object across multiple rows or columns. In simpler words, this function is used to create multiple charts
within the same figure.
Syntax:
Plt.subplot2grid(shape, location, rowspan, colspan)
Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has a
well-developed set of tools that it uses internally to perform them (these tools can be explored in
the matplotlib.transforms submodule).
There are three pre-defined transforms that can be useful in this situation:
● ax.transAxes: Transform associated with the axes (in units of axes dimensions)
● fig.transFigure: Transform associated with the figure (in units of figure dimensions)
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels.
The transAxes coordinates give the location from the bottom-left corner of the axes (here the white
box), as a fraction of the axes size. The transFigure coordinates are similar, but specify the position from
the bottom-left of the figure (here the gray box), as a fraction of the figure size.
plt.arrow()
plt.annotate()
Introduction to 3D Plotting with Matplotlib
we will be defining the axes of our 3D plot where we define that the projection of the plot will be in
“3D” format. This helps us to create the 3D empty axes figure in the canvas. After this, if we show the
plot using plt.show(), then it would look like the one shown in the output.
Example: Creating an empty 3D figure using Matplotlib
● In the above example, first, we are importing packages from the python library in order to have a 3D
plot in our empty canvas. So, for that, we are importing numpy, matplotlib.pyplot,
and mpl_toolkits.
● After importing all the necessary packages, we are creating an empty figure using plt.figure().
● After that, we are defining the axis of the plot where we are specifying that the plot will be of 3D
projection.
● After that, we are taking 3 arrays with a wide range of arbitrary points which will act as X, Y, and Z
coordinates for plotting the graph respectively. Now after initializing the points, we are plotting a 3D
plot using ax.plot3D() where we are using x,y,z as the X, Y, and Z coordinates respectively and the
color of the line will be red.
each point, the color of the points will be based on the values that the coordinates contain.
One common type of visualization in data science is that of geographic data. Matplotlib's main tool for
this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits which lives
under the mpl_toolkits namespace.
Installation of Basemap is straightforward; if you're using conda you can type this and the package will
be downloaded:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
Map Projections
Depending on the intended use of the map projection, there are certain map features (e.g., direction,
area, distance, shape, or other considerations) that are useful to maintain.
Cylindrical projections
This type of mapping represents equatorial regions quite well, but results in extreme distortions near
the poles. The spacing of latitude lines varies between different cylindrical projections, leading to
different conservation properties, and different distortion near the poles. In the following figure we
show an example of the equidistant cylindrical projection, which chooses a latitude scaling that
preserves distances along meridians. Other cylindrical projections are the Mercator (projection='merc')
and the cylindrical equal area (projection='cea') projections.
Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain
vertical; this can give better properties near the poles of the projection. The Mollweide projection
(projection='moll') is one common example of this, in which all meridians are elliptical arcs. It is
constructed so as to preserve area across the map: though there are distortions near the poles, the area
of small patches reflects the true area. Other pseudo-cylindrical projections are the sinusoidal
(projection='sinu') and Robinson (projection='robin') projections.
Perspective projections
One common example is the orthographic projection (projection='ortho'), which shows one side of the
globe as seen from a viewer at a very long distance. As such, it can show only half the globe at a time.
Other perspective-based projections include the gnomonic projection (projection='gnom') and
stereographic projection (projection='stere'). These are often the most useful for showing small portions
of the map.
Conic projections
One example of this is the Lambert Conformal Conic projection (projection='lcc'), which we saw earlier
in the map of North America. It projects the map onto a cone arranged in such a way that two standard
parallels (specified in Basemap by lat_1 and lat_2) have well-represented distances, with scale
decreasing between them and increasing outside of them. Other useful conic projections are the
equidistant conic projection (projection='eqdc') and the Albers equal-area projection (projection='aea').
Conic projections, like perspective projections, tend to be good choices for representing small to
medium patches of the globe.
Perhaps the most useful piece of the Basemap toolkit is the ability to over-plot a variety of data onto a
map background. For simple plotting and text, any plt function works on the map; you can use
the Basemap instance to project latitude and longitude coordinates to (x, y) coordinates for plotting
with plt, as we saw earlier in the Seattle example.
In addition to this, there are many map-specific functions available as methods of the Basemap instance.
These work very similarly to their standard Matplotlib counterparts, but have an additional Boolean
argument latlon, which if set to True allows you to pass raw latitudes and longitudes to the method,
rather than projected (x, y) coordinates.
(https://www.geeksforgeeks.org/working-with-geospatial-data-in-python/)
(https://www.geeksforgeeks.org/python-seaborn-tutorial/)