0% found this document useful (0 votes)
3 views

DS Assignment

The document discusses data exploration, a critical initial step in data science that involves analyzing datasets to understand their structure and characteristics through descriptive statistics and data visualization. It outlines the objectives of data exploration, types of data, and methods of descriptive statistics, including univariate and multivariate exploration, using the Iris dataset as a primary example. Additionally, it covers various visualization techniques to comprehend data relationships and distributions effectively.

Uploaded by

aqsachishti5892
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DS Assignment

The document discusses data exploration, a critical initial step in data science that involves analyzing datasets to understand their structure and characteristics through descriptive statistics and data visualization. It outlines the objectives of data exploration, types of data, and methods of descriptive statistics, including univariate and multivariate exploration, using the Iris dataset as a primary example. Additionally, it covers various visualization techniques to comprehend data relationships and distributions effectively.

Uploaded by

aqsachishti5892
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Assignment

Data Exploration

Submitted To
Dr. Tenvir Ali

Submitted By
Shahbaz Ahmad

Roll No
7

MSCS 1st Semester

Department of Computer Science & IT


The Islamia University of Bahawalpur,
Bahawalnagar Campus

1
2
Data Exploration
Data exploration, also known as exploratory data analysis (EDA), is the initial step in the data
science process. It involves examining and analyzing a dataset to understand its underlying
structure, patterns, and characteristics. Data exploration can be broadly divided into two
types.
 Descriptive statistics
 Data visualization

Descriptive Statistics
Descriptive statistics play a crucial role in data exploration, helping to understand the
characteristics of the data and identify potential patterns or trends.

Data Visualization
Visualization is the process of projecting the data or parts
of it into multi-dimensional space or abstract images. All the useful charts
fall under this category.
Data exploration in the context of data science uses both descriptive statistics and
visualization techniques.

Objectives of Data Exploration


The primary objectives of data exploration are:
1. Data Understanding : Data exploration provides a high level overview of each
attribute in the dataset and the interaction between attributes.Data exploration helps
answers the questions like what is typical value of an attribute or how much do the
data points differ from the typical value or presence of extreme value.
2. Data Preparation: Before applying the data science algorithm, the dataset has to be
prepared for handling any of anomalies that may be present in the data. These
anomalies include outliers, missing values, or highly correlated attributes. Some data
science algorithm do not work well when input attributes are correalted with each
other. Thus correlated attributes need to be identified or removed.
3. Data science Tasks: Basic data exploration can sometimes subsitute the entire data
science process. For example, scatterplots can identify clusters in low-dimensional
data or can help develop regression or classification models with simple visual rules.
4. Interpreting the results: Finally data exploration is used in understanding the
prediction, classification and clustering of the data scince process. Histogram help to
comprehend the distribution of the attribute and can also be useful for visualizing
numeric prediction, error rate estimate.

Dataset
A dataset is a collection of data, typically in a table or spreadsheet format, containing
information about a particular phenomenon or phenomenon. It consists of rows and columns .
Each row represents a single data point, and each column represents a variable or feature of
that data point. There are few classic datasets, which are simple to understand, easy to
explain. The Iris dataset is a classic and widely used dataset in machine learning and data
analysis. It was introduced by Ronald Fisher in 1936 and is also known as Fisher's Iris
dataset. The dataset contains 150 samples from three species of iris flowers (Iris setosa, Iris
virginica, and Iris versicolor). Each sample is described by four features:

1. Sepal length (cm)


2. Sepal width (cm)
3. Petal length (cm) 4. Petal width (cm)
3
The Iris dataset is often used for:
1. Classification tasks (e.g., predicting the species of an iris flower based on its features)
2. Clustering analysis (e.g., grouping similar iris flowers together)
3. Dimensionality reduction (e.g., reducing the number of features while preserving important
information)
4. Data visualization (e.g., exploring the relationships between features).
This dataset is free and is publicly available at the UCI Machine Learning Repository. The
Iris datset is used for learning data science mainly because it is simple to understand, explore
and can be used to illustrate how different data scince algorithm approach the problem on the
same standard dataset.

Types of Data
Data comes in different formats and types. Understandind the properties of each attribute or
feature provides information about what kind of operations can be performed on that attribute.
For example, the temperature in weather data can be expressed as any of the following
formats:
Numeric centigrade (31_C, 33.3_C) or Fahrenheit (100_F, 101.45_F) or
on the Kelvin scale
Ordered labels as in hot, mild, or cold
Number of days within a year below 0_C (10 days in a year below freezing).
Few of these data types can be converted into another types.

Numeric or Continuous
Numeric or continuous data types are variables that can take on any value within a certain
range or interval. They are typically measured or quantified, and can be represented by
numbers or numerical values. Temperature expressed in Centigrade or Fahrenheit is numeric
and continuous because it can be denoted by numbers and take an infinite number of values
between digits.
An integer is a special form of the numeric data type which does not have decimals in the
value or more precisely does not have infinite values between consecutive numbers. Usually,
they denote a count of something, number of days with temperature less than 0C, number of
orders, number of children in a family, etc. If a zero point is defined, numeric data become a
ratio or real data type. Examples include temperature in Kelvin scale, bank account balance,
and income. Both integer and ratio data types are categorized as a numeric data type in most
data science tools.

Categorical or Nominal
Categorical data types are attributes treated as distinct symbols or just names. The color of the
iris of the human eye is a categorical data type because it takes a value like black, green, blue,
gray, etc. There is no direct relationship among the data values, and hence, mathematical
operators except the logical or “is equal” operator cannot be applied. They are also called a
nominal or polynominal data type, derived from the Latin word for name.
An ordered nominal data type is a special case of a categorical data type where there is some
kind of order among the values. An example of an ordered data type is temperature expressed
as hot, mild, cold.

4
Descriptive Statistics
Descriptive statistics refers to the study of the aggregate quantities of a
dataset. These measures are some of the commonly used notations in
everyday life. It provides a concise summary of the data, including measures of central
tendency(mean, median, mode) measures of variabilitu ( range, variance, standard deviation)
and data distribution (shape, skewness, kurtosis).
It can be broadly classified into Univariate and Multivariate exploration.

Univariate Exploration
Univariate data exploration denotes analysis of any attribute at a time. The example Iris
dataset for any species, I. setosa has 50 observation and 4 attributes. Here some of the
descriptive statistics for sepal length attributes are explores.

Measure Of Central Tendency


The objective of finding the central location of an attribute is to quantify the dataset with one
central or most common number.
● Mean: The mean is the arithmetic average of all observations in the dataset. It is calculated
by summing all the data points and dividing by the number of data points. The mean for sepal
length in centimeters is 5.0060.
● Median: The median is the value of the central point in the distribution. The median is
calculated by sorting all the observations from small to large and selecting the mid-point
observation in the sorted list. If the number of data points is even, then the average of the
middle two data points is used as the median. The median for sepal length is in centimeters is
5.0000.
● Mode: The mode is the most frequently occurring observation. In the dataset, data points
may be repetitive, and the most repetitive data point is the mode of the dataset. In this
example, the mode in centimeters is 5.1000.

Measure Of Spread
In desert regions, it is common for the temperature to cross above 110℉ during the day and
drop below 30 F during the night while the average temperature for a 24-hour period is
around 70℉. Obviously, the experience of living in the desert is not the same as living in a
tropical region with the same average daily temperature around 70 ℉, where the temperature
within the day is between70 ℉and 80 ℉. What matters here is not just the central location of
the temperature, but the spread of the temperature. There are two common metrics to quantify
spread.
Range: The range is the difference between the maximum value and the minimum value of
the attribute. In the example, the range for the temperature in the desert is 80 ⁰ F and the range
for the tropics is 20⁰F The desert region experiences larger temperature swings as indicated
by the range.
Deviation: The variance and standard deviation measures the spread, by considering all the
values of the attribute. Deviation is simply measured as the difference between any given
value (xi) and the mean of the sample (μ). The variance is the sum of the squared deviations
of all data points divided by the number of data points. For a dataset with N observations, the
variance is given by the following equation
Standard deviation is the square root of the variance. Since the standard deviation is measured
in the same units as the attribute, it is easy to understand the magnitude of the metric. High
standard deviation means the data points are spread widely around the central point. Low
standard deviation means data points are closer to the central point. Fig. 3.2 provides the

5
univariate summary of the Iris dataset with all 150 observations, for each of the four numeric

attributes.

Multivariate Exploration
Multivariate exploration is the study of more than one attribute in the dataset simultaneously.
This technique is critical to understanding the relationship between the attributes, which is
central to data science methods. Similar to univariate explorations, the measure of central
tendency and variance in the data will be discussed.

Central Data Point


According to the Iris dataset, the central data point, also known as the centroid, varies
depending on the species of iris flower. Here are the centroids for each species:

1. Iris Setosa:
 Sepal length: 5.006
 Sepal width: 3.418
 Petal length: 1.464
 Petal width: 0.244
2. Iris Versicolor:
 Sepal length: 5.936
 Sepal width: 2.770
 Petal length: 4.260
 Petal width: 1.326
3. Iris Virginica:
 Sepal length: 6.588
 Sepal width: 2.974
 Petal length: 5.552
 Petal width: 2.026

These centroids represent the average values for each feature (sepal length, sepal width, petal
length, and petal width) for each species of iris flower. They can be used as a reference point
for comparison and classification.
Correlation
Correlation measures the statistical relationship between two attributes particularly dependent
of one attribute on another attribute. When two attribute are highly correlated with each other
they both vary at the same rate with each other either in the same or opposite directions
Correlation is typically measured using the correlation coefficient (ρ or r), which ranges from
-1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation coefficient of
0 indicates no linear relationship. However, it's important to note that correlation does not
imply causation. In other words, just because two variables are correlated, it doesn't mean that
one causes the other.

6
Data Visualization
Visualizing data is one of the most important techniques of data discovery and exploration.
Though visualization is not considered a data science technique, terms like visual mining or
pattern discovery based on visuals are increasingly used in the context of data science,
particularly in the business world. The discipline of data visualization encompasses the
methods of expressing data in an abstract visual form. The motivation for using data
visualization includes:
 Comprehension of dense information: A simple visual chart can easily include
thousands of data points. By using visuals, the user can see the big picture, as well as
longer term trends that are extremely difficult to interpret purely by expressing data in
number
 Relationships: Visualizing data in Cartesian coordinates enables exploration of the
relationships between the attributes. Although representing more than three attributes
on the x, y, and z-axes is not feasible in Cartesian coordinates, there are a few
creative solutions available by changing properties like the size, color, and shape of
data markers or using flow maps (Tufte, 2001), where more than two attributes are
used in a two-dimensional medium.
. As with descriptive statistics, visualization techniques are also categorized into: univariate
visualization, multivariate visualization and visualization of a large number of attributes using
parallel dimensions.

Univariate Visualization
Visual exploration starts with investigating one attribute at a time using univariate charts. The
techniques discussed in this section give an idea of how the attribute values are distributed
and the shape of the distribution.
Histogram
A histogram is one of the most basic visualization techniques to understand the frequency of
the occurrence of values. It shows the distribution of the data by plotting the frequency of
occurrence in a range. In a histogram, the attribute under inquiry is shown on the horizontal
axis and the frequency of occurrence is on the vertical axis. For a continuous numeric data
type, the range or binning value to group a range of values need to be specified. For example,
in the case of human height in centimeters, all the occurrences between 152.00 and 152.99 are
grouped under 152. There is no optimal number of bins or bin width that works for all the
distributions. If the bin width is too small, the distribution becomes more precise but reveals
the noise due to sampling. A general rule of thumb is to have a number of bins equal to the
square root or cube root of the number of data points.

7
Quartile
A quartile is a statistical term that refers to one of three values that divide a dataset or a
distribution into four equal parts, each containing 25% of the data. The quartiles are:

1. First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of
the data points fall.
2. Second Quartile (Q2): Also known as the 50th percentile or the median, it is the value
below which 50% of the data points fall.
3. Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of
the data points fall.

Quartiles are used to:

1. Summarize and describe the distribution of a dataset


2. Identify the spread and variability of the data
3. Compare datasets or groups
4. Detect outliers and anomalies
5. Determine the interquartile range (IQR), which is the difference between Q3 and Q1, used
to measure the spread of the data.

Distribution Chart
A distribution chart, also known as a distribution plot or frequency distribution graph, is a
graphical representation of the distribution of a dataset or a statistical population. It shows the
number of observations or frequencies against the values of a variable, typically in a graphical
format. Distribution charts help to:

1. Visualize the shape of the distribution (e.g., normal, skewed, bimodal)


2. Identify outliers and anomalies
3. Understand the central tendency and variability of the data
4. Compare the distribution of different variables or groups
5. Check for assumptions of statistical tests (e.g., normality)

8
Multivariate Visualization
The multivariate visual exploration considers more than one attribute in the same visual. The
techniques discussed in this section focus on the relationship of one attribute with another
attribute. These visualizations examine two to four attributes simultaneously.

Scatterplot
In multivariate data visualization, scatterplots can be used to visualize the relationship
between multiple variables by coloring the points or adding shapes or sizes.

Scatterplot multiple
A scatter multiple is an enhanced form of a simple scatterplot where more than two
dimensions can be included in the chart and studied simultaneously. The primary attribute is
used for the x-axis coordinate. The secondary axis is shared with more attributes or
dimensions. In this example the values on the y-axis are shared between sepal length,
sepalwidth, and petal width. Th. Here, sepal length is represented by data points occupying
the topmost part of the chart, sepal width occupies the middle portion, and petal width is in
the bottom portion. Note that the data points are duplicated for each attribute in the y-axis.
Data points are color-coded for each dimension in y-axis while the x-axis is anchored with
one attribute— petal length. All the attributes sharing the y-axis should be of the same unit or
normalized.

Scatter Matrix
A scatter matrix, also known as a pairs plot or splom (scatter plot matrix), is a graphical
representation of the relationship between multiple variables in a dataset. It displays the
distribution of each variable against every other variable, using scatter plots or other
visualization techniques. A scatter matrix typically consists of a matrix of scatter plots, where:
The diagonal cells show the distribution of each variable (often as a histogram or density
plot).
The off-diagonal cells display the scatter plots of each variable against every other variable.

9
Bubble chart
A bubble chart, also known as a bubble plot or scatter plot with bubbles, is a type of data
visualization used in data exploration to display three dimensions of data:

1. X-axis (horizontal): One variable


2. Y-axis (vertical): Another variable
3. Bubble size (z-axis): A third variable, represented by the size of the bubbles

Each bubble represents a single data point, with the x and y coordinates determining its
position and the z-value determining its size. This allows for the visualization of relationships
between three variables at once.

Density Chart
A density chart, also known as a density plot or kernel density estimate (KDE) plot, is a
graphical representation of the distribution of a continuous variable. It shows the density of
the data points at different values, creating a smooth curve that estimates the underlying
distribution of the data. Density charts are particularly useful when: Dealing with large
datasets, working with continuous variables, needing to visualize complex distributions

Visualizing High-Dimensional Data


Visualizing more than three attributes on a two-dimensional medium (like a paper or screen)
is challenging. This limitation can be overcome by using transformation techniques to project
the high-dimensional data points into parallel axis space. In this approach, a Cartesian axis is
shared by more than one attribute.

Parallel Chart
A parallel chart visualizes a data point quite innovatively by transforming or projecting multi-
dimensional data into a two-dimensional chart medium. In this chart, every attribute or
dimension is linearly arranged in one coordinate (x-axis) and all the measures are arranged in
the other coordinate (y-axis). Since the x-axis is multivariate, each data point is represented as
a line in a parallel space. In the case of the Iris dataset, all four attributes are arranged along
the x-axis. The y-axis represents a generic distance and it is “shared” by all these attributes on
the x-axis. Hence, parallel charts work only when attributes share a common unit of

10
numerical measure or when the attributes are normalized. This visualization is called a
parallel axis because all four attributes are represented in four parallel axes parallel to the y-
axis.

Deviation Chart
A deviation chart is very similar to a parallel chart, as it has parallel axes for all the attributes
on the x-axis. Data points are extended across the dimensions as lines and there is one
common y-axis. Instead of plotting all data lines, deviation charts only show the mean and
standard deviation statistics. For each class, deviation charts show the mean line connecting
the mean of each attribute; the standard deviation is shown as the band above and below the
mean line. The mean line does not have to correspond to a data point (line). With this method,
information is elegantly displayed, and the essence of a parallel chart is maintained.

Andrews Curves
According to the Iris dataset, Andrews curves can be used to visualize the four features (sepal
length, sepal width, petal length, and petal width) of the three species (Setosa, Versicolor, and
Virginica) in a two-dimensional plot. By examining the Andrews curves, you can identify
patterns and relationships between the features and species, such as:

 Sepal length and petal length are highly correlated for the Setosa species.
 Sepal width and petal width are highly correlated for the Versicolor species.
 Virginica species has a more variable and dispersed distribution of features

Roadmap For Data Exploration


Step 1: Data Preparation
1.1. Collect and clean the data
1.2. Handle missing values and outliers
1.3. Transform and normalize data (if necessary)
1.4. Split data into training and testing sets (if necessary)

Step 2: Initial Data Investigation


2.1. Calculate summary statistics (mean, median, mode, range, variance, etc.)
2.2. Visualize data distribution (histograms, box plots, density plots)
2.3. Check for correlations and relationships (scatter plots, correlation matrix)

11
Step 3: Data Visualization
3.1. Explore individual variables (univariate analysis)
* Visualize distributions and patterns
* Identify outliers and anomalies
3.2. Explore relationships between variables (bivariate analysis)
* Visualize correlations and patterns
* Identify relationships and interactions

Step 4: Data Transformation and Feature Engineering


4.1. Transform variables (log, square root, normalization, etc.)
4.2. Create new features (interaction terms, polynomial terms, etc.)
4.3. Select relevant features (feature selection, dimensionality reduction)

Step 5: Pattern Detection and Hypothesis Generation


5.1. Identify patterns and relationships
5.2. Generate hypotheses and research questions
5.3. Develop a plan for further analysis and modeling

Step 6: Advanced Data Analysis


6.1. Cluster analysis (k-means, hierarchical clustering, etc.)
6.2. Dimensionality reduction (PCA, t-SNE, etc.)
6.3. Time series analysis (if applicable)

Step 7: Model Development and Evaluation


7.1. Develop and train models (regression, classification, clustering, etc.)
7.2. Evaluate model performance (accuracy, precision, recall, etc.)
7.3. Refine and optimize models

Step 8: Insight Generation and Communication


8.1. Interpret results and generate insights
8.2. Communicate findings and recommendations
8.3. Visualize results and present to stakeholders

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy