DS Assignment
DS Assignment
Data Exploration
Submitted To
Dr. Tenvir Ali
Submitted By
Shahbaz Ahmad
Roll No
7
1
2
Data Exploration
Data exploration, also known as exploratory data analysis (EDA), is the initial step in the data
science process. It involves examining and analyzing a dataset to understand its underlying
structure, patterns, and characteristics. Data exploration can be broadly divided into two
types.
Descriptive statistics
Data visualization
Descriptive Statistics
Descriptive statistics play a crucial role in data exploration, helping to understand the
characteristics of the data and identify potential patterns or trends.
Data Visualization
Visualization is the process of projecting the data or parts
of it into multi-dimensional space or abstract images. All the useful charts
fall under this category.
Data exploration in the context of data science uses both descriptive statistics and
visualization techniques.
Dataset
A dataset is a collection of data, typically in a table or spreadsheet format, containing
information about a particular phenomenon or phenomenon. It consists of rows and columns .
Each row represents a single data point, and each column represents a variable or feature of
that data point. There are few classic datasets, which are simple to understand, easy to
explain. The Iris dataset is a classic and widely used dataset in machine learning and data
analysis. It was introduced by Ronald Fisher in 1936 and is also known as Fisher's Iris
dataset. The dataset contains 150 samples from three species of iris flowers (Iris setosa, Iris
virginica, and Iris versicolor). Each sample is described by four features:
Types of Data
Data comes in different formats and types. Understandind the properties of each attribute or
feature provides information about what kind of operations can be performed on that attribute.
For example, the temperature in weather data can be expressed as any of the following
formats:
Numeric centigrade (31_C, 33.3_C) or Fahrenheit (100_F, 101.45_F) or
on the Kelvin scale
Ordered labels as in hot, mild, or cold
Number of days within a year below 0_C (10 days in a year below freezing).
Few of these data types can be converted into another types.
Numeric or Continuous
Numeric or continuous data types are variables that can take on any value within a certain
range or interval. They are typically measured or quantified, and can be represented by
numbers or numerical values. Temperature expressed in Centigrade or Fahrenheit is numeric
and continuous because it can be denoted by numbers and take an infinite number of values
between digits.
An integer is a special form of the numeric data type which does not have decimals in the
value or more precisely does not have infinite values between consecutive numbers. Usually,
they denote a count of something, number of days with temperature less than 0C, number of
orders, number of children in a family, etc. If a zero point is defined, numeric data become a
ratio or real data type. Examples include temperature in Kelvin scale, bank account balance,
and income. Both integer and ratio data types are categorized as a numeric data type in most
data science tools.
Categorical or Nominal
Categorical data types are attributes treated as distinct symbols or just names. The color of the
iris of the human eye is a categorical data type because it takes a value like black, green, blue,
gray, etc. There is no direct relationship among the data values, and hence, mathematical
operators except the logical or “is equal” operator cannot be applied. They are also called a
nominal or polynominal data type, derived from the Latin word for name.
An ordered nominal data type is a special case of a categorical data type where there is some
kind of order among the values. An example of an ordered data type is temperature expressed
as hot, mild, cold.
4
Descriptive Statistics
Descriptive statistics refers to the study of the aggregate quantities of a
dataset. These measures are some of the commonly used notations in
everyday life. It provides a concise summary of the data, including measures of central
tendency(mean, median, mode) measures of variabilitu ( range, variance, standard deviation)
and data distribution (shape, skewness, kurtosis).
It can be broadly classified into Univariate and Multivariate exploration.
Univariate Exploration
Univariate data exploration denotes analysis of any attribute at a time. The example Iris
dataset for any species, I. setosa has 50 observation and 4 attributes. Here some of the
descriptive statistics for sepal length attributes are explores.
Measure Of Spread
In desert regions, it is common for the temperature to cross above 110℉ during the day and
drop below 30 F during the night while the average temperature for a 24-hour period is
around 70℉. Obviously, the experience of living in the desert is not the same as living in a
tropical region with the same average daily temperature around 70 ℉, where the temperature
within the day is between70 ℉and 80 ℉. What matters here is not just the central location of
the temperature, but the spread of the temperature. There are two common metrics to quantify
spread.
Range: The range is the difference between the maximum value and the minimum value of
the attribute. In the example, the range for the temperature in the desert is 80 ⁰ F and the range
for the tropics is 20⁰F The desert region experiences larger temperature swings as indicated
by the range.
Deviation: The variance and standard deviation measures the spread, by considering all the
values of the attribute. Deviation is simply measured as the difference between any given
value (xi) and the mean of the sample (μ). The variance is the sum of the squared deviations
of all data points divided by the number of data points. For a dataset with N observations, the
variance is given by the following equation
Standard deviation is the square root of the variance. Since the standard deviation is measured
in the same units as the attribute, it is easy to understand the magnitude of the metric. High
standard deviation means the data points are spread widely around the central point. Low
standard deviation means data points are closer to the central point. Fig. 3.2 provides the
5
univariate summary of the Iris dataset with all 150 observations, for each of the four numeric
attributes.
Multivariate Exploration
Multivariate exploration is the study of more than one attribute in the dataset simultaneously.
This technique is critical to understanding the relationship between the attributes, which is
central to data science methods. Similar to univariate explorations, the measure of central
tendency and variance in the data will be discussed.
1. Iris Setosa:
Sepal length: 5.006
Sepal width: 3.418
Petal length: 1.464
Petal width: 0.244
2. Iris Versicolor:
Sepal length: 5.936
Sepal width: 2.770
Petal length: 4.260
Petal width: 1.326
3. Iris Virginica:
Sepal length: 6.588
Sepal width: 2.974
Petal length: 5.552
Petal width: 2.026
These centroids represent the average values for each feature (sepal length, sepal width, petal
length, and petal width) for each species of iris flower. They can be used as a reference point
for comparison and classification.
Correlation
Correlation measures the statistical relationship between two attributes particularly dependent
of one attribute on another attribute. When two attribute are highly correlated with each other
they both vary at the same rate with each other either in the same or opposite directions
Correlation is typically measured using the correlation coefficient (ρ or r), which ranges from
-1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation coefficient of
0 indicates no linear relationship. However, it's important to note that correlation does not
imply causation. In other words, just because two variables are correlated, it doesn't mean that
one causes the other.
6
Data Visualization
Visualizing data is one of the most important techniques of data discovery and exploration.
Though visualization is not considered a data science technique, terms like visual mining or
pattern discovery based on visuals are increasingly used in the context of data science,
particularly in the business world. The discipline of data visualization encompasses the
methods of expressing data in an abstract visual form. The motivation for using data
visualization includes:
Comprehension of dense information: A simple visual chart can easily include
thousands of data points. By using visuals, the user can see the big picture, as well as
longer term trends that are extremely difficult to interpret purely by expressing data in
number
Relationships: Visualizing data in Cartesian coordinates enables exploration of the
relationships between the attributes. Although representing more than three attributes
on the x, y, and z-axes is not feasible in Cartesian coordinates, there are a few
creative solutions available by changing properties like the size, color, and shape of
data markers or using flow maps (Tufte, 2001), where more than two attributes are
used in a two-dimensional medium.
. As with descriptive statistics, visualization techniques are also categorized into: univariate
visualization, multivariate visualization and visualization of a large number of attributes using
parallel dimensions.
Univariate Visualization
Visual exploration starts with investigating one attribute at a time using univariate charts. The
techniques discussed in this section give an idea of how the attribute values are distributed
and the shape of the distribution.
Histogram
A histogram is one of the most basic visualization techniques to understand the frequency of
the occurrence of values. It shows the distribution of the data by plotting the frequency of
occurrence in a range. In a histogram, the attribute under inquiry is shown on the horizontal
axis and the frequency of occurrence is on the vertical axis. For a continuous numeric data
type, the range or binning value to group a range of values need to be specified. For example,
in the case of human height in centimeters, all the occurrences between 152.00 and 152.99 are
grouped under 152. There is no optimal number of bins or bin width that works for all the
distributions. If the bin width is too small, the distribution becomes more precise but reveals
the noise due to sampling. A general rule of thumb is to have a number of bins equal to the
square root or cube root of the number of data points.
7
Quartile
A quartile is a statistical term that refers to one of three values that divide a dataset or a
distribution into four equal parts, each containing 25% of the data. The quartiles are:
1. First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of
the data points fall.
2. Second Quartile (Q2): Also known as the 50th percentile or the median, it is the value
below which 50% of the data points fall.
3. Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of
the data points fall.
Distribution Chart
A distribution chart, also known as a distribution plot or frequency distribution graph, is a
graphical representation of the distribution of a dataset or a statistical population. It shows the
number of observations or frequencies against the values of a variable, typically in a graphical
format. Distribution charts help to:
8
Multivariate Visualization
The multivariate visual exploration considers more than one attribute in the same visual. The
techniques discussed in this section focus on the relationship of one attribute with another
attribute. These visualizations examine two to four attributes simultaneously.
Scatterplot
In multivariate data visualization, scatterplots can be used to visualize the relationship
between multiple variables by coloring the points or adding shapes or sizes.
Scatterplot multiple
A scatter multiple is an enhanced form of a simple scatterplot where more than two
dimensions can be included in the chart and studied simultaneously. The primary attribute is
used for the x-axis coordinate. The secondary axis is shared with more attributes or
dimensions. In this example the values on the y-axis are shared between sepal length,
sepalwidth, and petal width. Th. Here, sepal length is represented by data points occupying
the topmost part of the chart, sepal width occupies the middle portion, and petal width is in
the bottom portion. Note that the data points are duplicated for each attribute in the y-axis.
Data points are color-coded for each dimension in y-axis while the x-axis is anchored with
one attribute— petal length. All the attributes sharing the y-axis should be of the same unit or
normalized.
Scatter Matrix
A scatter matrix, also known as a pairs plot or splom (scatter plot matrix), is a graphical
representation of the relationship between multiple variables in a dataset. It displays the
distribution of each variable against every other variable, using scatter plots or other
visualization techniques. A scatter matrix typically consists of a matrix of scatter plots, where:
The diagonal cells show the distribution of each variable (often as a histogram or density
plot).
The off-diagonal cells display the scatter plots of each variable against every other variable.
9
Bubble chart
A bubble chart, also known as a bubble plot or scatter plot with bubbles, is a type of data
visualization used in data exploration to display three dimensions of data:
Each bubble represents a single data point, with the x and y coordinates determining its
position and the z-value determining its size. This allows for the visualization of relationships
between three variables at once.
Density Chart
A density chart, also known as a density plot or kernel density estimate (KDE) plot, is a
graphical representation of the distribution of a continuous variable. It shows the density of
the data points at different values, creating a smooth curve that estimates the underlying
distribution of the data. Density charts are particularly useful when: Dealing with large
datasets, working with continuous variables, needing to visualize complex distributions
Parallel Chart
A parallel chart visualizes a data point quite innovatively by transforming or projecting multi-
dimensional data into a two-dimensional chart medium. In this chart, every attribute or
dimension is linearly arranged in one coordinate (x-axis) and all the measures are arranged in
the other coordinate (y-axis). Since the x-axis is multivariate, each data point is represented as
a line in a parallel space. In the case of the Iris dataset, all four attributes are arranged along
the x-axis. The y-axis represents a generic distance and it is “shared” by all these attributes on
the x-axis. Hence, parallel charts work only when attributes share a common unit of
10
numerical measure or when the attributes are normalized. This visualization is called a
parallel axis because all four attributes are represented in four parallel axes parallel to the y-
axis.
Deviation Chart
A deviation chart is very similar to a parallel chart, as it has parallel axes for all the attributes
on the x-axis. Data points are extended across the dimensions as lines and there is one
common y-axis. Instead of plotting all data lines, deviation charts only show the mean and
standard deviation statistics. For each class, deviation charts show the mean line connecting
the mean of each attribute; the standard deviation is shown as the band above and below the
mean line. The mean line does not have to correspond to a data point (line). With this method,
information is elegantly displayed, and the essence of a parallel chart is maintained.
Andrews Curves
According to the Iris dataset, Andrews curves can be used to visualize the four features (sepal
length, sepal width, petal length, and petal width) of the three species (Setosa, Versicolor, and
Virginica) in a two-dimensional plot. By examining the Andrews curves, you can identify
patterns and relationships between the features and species, such as:
Sepal length and petal length are highly correlated for the Setosa species.
Sepal width and petal width are highly correlated for the Versicolor species.
Virginica species has a more variable and dispersed distribution of features
11
Step 3: Data Visualization
3.1. Explore individual variables (univariate analysis)
* Visualize distributions and patterns
* Identify outliers and anomalies
3.2. Explore relationships between variables (bivariate analysis)
* Visualize correlations and patterns
* Identify relationships and interactions
12