0% found this document useful (0 votes)

3 views

DS Assignment

The document discusses data exploration, a critical initial step in data science that involves analyzing datasets to understand their structure and characteristics through descriptive statistics and data visualization. It outlines the objectives of data exploration, types of data, and methods of descriptive statistics, including univariate and multivariate exploration, using the Iris dataset as a primary example. Additionally, it covers various visualization techniques to comprehend data relationships and distributions effectively.

Uploaded by

aqsachishti5892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

DS Assignment

Uploaded by

aqsachishti5892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Assignment

Data Exploration

Submitted To
Dr. Tenvir Ali

Submitted By
Shahbaz Ahmad

Roll No
7

MSCS 1st Semester

Department of Computer Science & IT

The Islamia University of Bahawalpur,
Bahawalnagar Campus

1
2
Data Exploration
Data exploration, also known as exploratory data analysis (EDA), is the initial step in the data
science process. It involves examining and analyzing a dataset to understand its underlying
structure, patterns, and characteristics. Data exploration can be broadly divided into two
types.
 Descriptive statistics
 Data visualization

Descriptive Statistics
Descriptive statistics play a crucial role in data exploration, helping to understand the
characteristics of the data and identify potential patterns or trends.

Data Visualization
Visualization is the process of projecting the data or parts
of it into multi-dimensional space or abstract images. All the useful charts
fall under this category.
Data exploration in the context of data science uses both descriptive statistics and
visualization techniques.

Objectives of Data Exploration

The primary objectives of data exploration are:
1. Data Understanding : Data exploration provides a high level overview of each
attribute in the dataset and the interaction between attributes.Data exploration helps
answers the questions like what is typical value of an attribute or how much do the
data points differ from the typical value or presence of extreme value.
2. Data Preparation: Before applying the data science algorithm, the dataset has to be
prepared for handling any of anomalies that may be present in the data. These
anomalies include outliers, missing values, or highly correlated attributes. Some data
science algorithm do not work well when input attributes are correalted with each
other. Thus correlated attributes need to be identified or removed.
3. Data science Tasks: Basic data exploration can sometimes subsitute the entire data
science process. For example, scatterplots can identify clusters in low-dimensional
data or can help develop regression or classification models with simple visual rules.
4. Interpreting the results: Finally data exploration is used in understanding the
prediction, classification and clustering of the data scince process. Histogram help to
comprehend the distribution of the attribute and can also be useful for visualizing
numeric prediction, error rate estimate.

Dataset
A dataset is a collection of data, typically in a table or spreadsheet format, containing
information about a particular phenomenon or phenomenon. It consists of rows and columns .
Each row represents a single data point, and each column represents a variable or feature of
that data point. There are few classic datasets, which are simple to understand, easy to
explain. The Iris dataset is a classic and widely used dataset in machine learning and data
analysis. It was introduced by Ronald Fisher in 1936 and is also known as Fisher's Iris
dataset. The dataset contains 150 samples from three species of iris flowers (Iris setosa, Iris
virginica, and Iris versicolor). Each sample is described by four features:

1. Sepal length (cm)

2. Sepal width (cm)
3. Petal length (cm) 4. Petal width (cm)
3
The Iris dataset is often used for:
1. Classification tasks (e.g., predicting the species of an iris flower based on its features)
2. Clustering analysis (e.g., grouping similar iris flowers together)
3. Dimensionality reduction (e.g., reducing the number of features while preserving important
information)
4. Data visualization (e.g., exploring the relationships between features).
This dataset is free and is publicly available at the UCI Machine Learning Repository. The
Iris datset is used for learning data science mainly because it is simple to understand, explore
and can be used to illustrate how different data scince algorithm approach the problem on the
same standard dataset.

Types of Data
Data comes in different formats and types. Understandind the properties of each attribute or
feature provides information about what kind of operations can be performed on that attribute.
For example, the temperature in weather data can be expressed as any of the following
formats:
Numeric centigrade (31_C, 33.3_C) or Fahrenheit (100_F, 101.45_F) or
on the Kelvin scale
Ordered labels as in hot, mild, or cold
Number of days within a year below 0_C (10 days in a year below freezing).
Few of these data types can be converted into another types.

Numeric or Continuous
Numeric or continuous data types are variables that can take on any value within a certain
range or interval. They are typically measured or quantified, and can be represented by
numbers or numerical values. Temperature expressed in Centigrade or Fahrenheit is numeric
and continuous because it can be denoted by numbers and take an infinite number of values
between digits.
An integer is a special form of the numeric data type which does not have decimals in the
value or more precisely does not have infinite values between consecutive numbers. Usually,
they denote a count of something, number of days with temperature less than 0C, number of
orders, number of children in a family, etc. If a zero point is defined, numeric data become a
ratio or real data type. Examples include temperature in Kelvin scale, bank account balance,
and income. Both integer and ratio data types are categorized as a numeric data type in most
data science tools.

Categorical or Nominal
Categorical data types are attributes treated as distinct symbols or just names. The color of the
iris of the human eye is a categorical data type because it takes a value like black, green, blue,
gray, etc. There is no direct relationship among the data values, and hence, mathematical
operators except the logical or “is equal” operator cannot be applied. They are also called a
nominal or polynominal data type, derived from the Latin word for name.
An ordered nominal data type is a special case of a categorical data type where there is some
kind of order among the values. An example of an ordered data type is temperature expressed
as hot, mild, cold.

4
Descriptive Statistics
Descriptive statistics refers to the study of the aggregate quantities of a
dataset. These measures are some of the commonly used notations in
everyday life. It provides a concise summary of the data, including measures of central
tendency(mean, median, mode) measures of variabilitu ( range, variance, standard deviation)
and data distribution (shape, skewness, kurtosis).
It can be broadly classified into Univariate and Multivariate exploration.

Univariate Exploration
Univariate data exploration denotes analysis of any attribute at a time. The example Iris
dataset for any species, I. setosa has 50 observation and 4 attributes. Here some of the
descriptive statistics for sepal length attributes are explores.

Measure Of Central Tendency

The objective of finding the central location of an attribute is to quantify the dataset with one
central or most common number.
● Mean: The mean is the arithmetic average of all observations in the dataset. It is calculated
by summing all the data points and dividing by the number of data points. The mean for sepal
length in centimeters is 5.0060.
● Median: The median is the value of the central point in the distribution. The median is
calculated by sorting all the observations from small to large and selecting the mid-point
observation in the sorted list. If the number of data points is even, then the average of the
middle two data points is used as the median. The median for sepal length is in centimeters is
5.0000.
● Mode: The mode is the most frequently occurring observation. In the dataset, data points
may be repetitive, and the most repetitive data point is the mode of the dataset. In this
example, the mode in centimeters is 5.1000.

Measure Of Spread
In desert regions, it is common for the temperature to cross above 110℉ during the day and
drop below 30 F during the night while the average temperature for a 24-hour period is
around 70℉. Obviously, the experience of living in the desert is not the same as living in a
tropical region with the same average daily temperature around 70 ℉, where the temperature
within the day is between70 ℉and 80 ℉. What matters here is not just the central location of
the temperature, but the spread of the temperature. There are two common metrics to quantify
spread.
Range: The range is the difference between the maximum value and the minimum value of
the attribute. In the example, the range for the temperature in the desert is 80 ⁰ F and the range
for the tropics is 20⁰F The desert region experiences larger temperature swings as indicated
by the range.
Deviation: The variance and standard deviation measures the spread, by considering all the
values of the attribute. Deviation is simply measured as the difference between any given
value (xi) and the mean of the sample (μ). The variance is the sum of the squared deviations
of all data points divided by the number of data points. For a dataset with N observations, the
variance is given by the following equation
Standard deviation is the square root of the variance. Since the standard deviation is measured
in the same units as the attribute, it is easy to understand the magnitude of the metric. High
standard deviation means the data points are spread widely around the central point. Low
standard deviation means data points are closer to the central point. Fig. 3.2 provides the

5
univariate summary of the Iris dataset with all 150 observations, for each of the four numeric

attributes.

Multivariate Exploration
Multivariate exploration is the study of more than one attribute in the dataset simultaneously.
This technique is critical to understanding the relationship between the attributes, which is
central to data science methods. Similar to univariate explorations, the measure of central
tendency and variance in the data will be discussed.

Central Data Point

According to the Iris dataset, the central data point, also known as the centroid, varies
depending on the species of iris flower. Here are the centroids for each species:

1. Iris Setosa:
 Sepal length: 5.006
 Sepal width: 3.418
 Petal length: 1.464
 Petal width: 0.244
2. Iris Versicolor:
 Sepal length: 5.936
 Sepal width: 2.770
 Petal length: 4.260
 Petal width: 1.326
3. Iris Virginica:
 Sepal length: 6.588
 Sepal width: 2.974
 Petal length: 5.552
 Petal width: 2.026

These centroids represent the average values for each feature (sepal length, sepal width, petal
length, and petal width) for each species of iris flower. They can be used as a reference point
for comparison and classification.
Correlation
Correlation measures the statistical relationship between two attributes particularly dependent
of one attribute on another attribute. When two attribute are highly correlated with each other
they both vary at the same rate with each other either in the same or opposite directions
Correlation is typically measured using the correlation coefficient (ρ or r), which ranges from
-1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation coefficient of
0 indicates no linear relationship. However, it's important to note that correlation does not
imply causation. In other words, just because two variables are correlated, it doesn't mean that
one causes the other.

6
Data Visualization
Visualizing data is one of the most important techniques of data discovery and exploration.
Though visualization is not considered a data science technique, terms like visual mining or
pattern discovery based on visuals are increasingly used in the context of data science,
particularly in the business world. The discipline of data visualization encompasses the
methods of expressing data in an abstract visual form. The motivation for using data
visualization includes:
 Comprehension of dense information: A simple visual chart can easily include
thousands of data points. By using visuals, the user can see the big picture, as well as
longer term trends that are extremely difficult to interpret purely by expressing data in
number
 Relationships: Visualizing data in Cartesian coordinates enables exploration of the
relationships between the attributes. Although representing more than three attributes
on the x, y, and z-axes is not feasible in Cartesian coordinates, there are a few
creative solutions available by changing properties like the size, color, and shape of
data markers or using flow maps (Tufte, 2001), where more than two attributes are
used in a two-dimensional medium.
. As with descriptive statistics, visualization techniques are also categorized into: univariate
visualization, multivariate visualization and visualization of a large number of attributes using
parallel dimensions.

Univariate Visualization
Visual exploration starts with investigating one attribute at a time using univariate charts. The
techniques discussed in this section give an idea of how the attribute values are distributed
and the shape of the distribution.
Histogram
A histogram is one of the most basic visualization techniques to understand the frequency of
the occurrence of values. It shows the distribution of the data by plotting the frequency of
occurrence in a range. In a histogram, the attribute under inquiry is shown on the horizontal
axis and the frequency of occurrence is on the vertical axis. For a continuous numeric data
type, the range or binning value to group a range of values need to be specified. For example,
in the case of human height in centimeters, all the occurrences between 152.00 and 152.99 are
grouped under 152. There is no optimal number of bins or bin width that works for all the
distributions. If the bin width is too small, the distribution becomes more precise but reveals
the noise due to sampling. A general rule of thumb is to have a number of bins equal to the
square root or cube root of the number of data points.

7
Quartile
A quartile is a statistical term that refers to one of three values that divide a dataset or a
distribution into four equal parts, each containing 25% of the data. The quartiles are:

1. First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of
the data points fall.
2. Second Quartile (Q2): Also known as the 50th percentile or the median, it is the value
below which 50% of the data points fall.
3. Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of
the data points fall.

Quartiles are used to:

1. Summarize and describe the distribution of a dataset

2. Identify the spread and variability of the data
3. Compare datasets or groups
4. Detect outliers and anomalies
5. Determine the interquartile range (IQR), which is the difference between Q3 and Q1, used
to measure the spread of the data.

Distribution Chart
A distribution chart, also known as a distribution plot or frequency distribution graph, is a
graphical representation of the distribution of a dataset or a statistical population. It shows the
number of observations or frequencies against the values of a variable, typically in a graphical
format. Distribution charts help to:

1. Visualize the shape of the distribution (e.g., normal, skewed, bimodal)

2. Identify outliers and anomalies
3. Understand the central tendency and variability of the data
4. Compare the distribution of different variables or groups
5. Check for assumptions of statistical tests (e.g., normality)

8
Multivariate Visualization
The multivariate visual exploration considers more than one attribute in the same visual. The
techniques discussed in this section focus on the relationship of one attribute with another
attribute. These visualizations examine two to four attributes simultaneously.

Scatterplot
In multivariate data visualization, scatterplots can be used to visualize the relationship
between multiple variables by coloring the points or adding shapes or sizes.

Scatterplot multiple
A scatter multiple is an enhanced form of a simple scatterplot where more than two
dimensions can be included in the chart and studied simultaneously. The primary attribute is
used for the x-axis coordinate. The secondary axis is shared with more attributes or
dimensions. In this example the values on the y-axis are shared between sepal length,
sepalwidth, and petal width. Th. Here, sepal length is represented by data points occupying
the topmost part of the chart, sepal width occupies the middle portion, and petal width is in
the bottom portion. Note that the data points are duplicated for each attribute in the y-axis.
Data points are color-coded for each dimension in y-axis while the x-axis is anchored with
one attribute— petal length. All the attributes sharing the y-axis should be of the same unit or
normalized.

Scatter Matrix
A scatter matrix, also known as a pairs plot or splom (scatter plot matrix), is a graphical
representation of the relationship between multiple variables in a dataset. It displays the
distribution of each variable against every other variable, using scatter plots or other
visualization techniques. A scatter matrix typically consists of a matrix of scatter plots, where:
The diagonal cells show the distribution of each variable (often as a histogram or density
plot).
The off-diagonal cells display the scatter plots of each variable against every other variable.

9
Bubble chart
A bubble chart, also known as a bubble plot or scatter plot with bubbles, is a type of data
visualization used in data exploration to display three dimensions of data:

1. X-axis (horizontal): One variable

2. Y-axis (vertical): Another variable
3. Bubble size (z-axis): A third variable, represented by the size of the bubbles

Each bubble represents a single data point, with the x and y coordinates determining its
position and the z-value determining its size. This allows for the visualization of relationships
between three variables at once.

Density Chart
A density chart, also known as a density plot or kernel density estimate (KDE) plot, is a
graphical representation of the distribution of a continuous variable. It shows the density of
the data points at different values, creating a smooth curve that estimates the underlying
distribution of the data. Density charts are particularly useful when: Dealing with large
datasets, working with continuous variables, needing to visualize complex distributions

Visualizing High-Dimensional Data

Visualizing more than three attributes on a two-dimensional medium (like a paper or screen)
is challenging. This limitation can be overcome by using transformation techniques to project
the high-dimensional data points into parallel axis space. In this approach, a Cartesian axis is
shared by more than one attribute.

Parallel Chart
A parallel chart visualizes a data point quite innovatively by transforming or projecting multi-
dimensional data into a two-dimensional chart medium. In this chart, every attribute or
dimension is linearly arranged in one coordinate (x-axis) and all the measures are arranged in
the other coordinate (y-axis). Since the x-axis is multivariate, each data point is represented as
a line in a parallel space. In the case of the Iris dataset, all four attributes are arranged along
the x-axis. The y-axis represents a generic distance and it is “shared” by all these attributes on
the x-axis. Hence, parallel charts work only when attributes share a common unit of

10
numerical measure or when the attributes are normalized. This visualization is called a
parallel axis because all four attributes are represented in four parallel axes parallel to the y-
axis.

Deviation Chart
A deviation chart is very similar to a parallel chart, as it has parallel axes for all the attributes
on the x-axis. Data points are extended across the dimensions as lines and there is one
common y-axis. Instead of plotting all data lines, deviation charts only show the mean and
standard deviation statistics. For each class, deviation charts show the mean line connecting
the mean of each attribute; the standard deviation is shown as the band above and below the
mean line. The mean line does not have to correspond to a data point (line). With this method,
information is elegantly displayed, and the essence of a parallel chart is maintained.

Andrews Curves
According to the Iris dataset, Andrews curves can be used to visualize the four features (sepal
length, sepal width, petal length, and petal width) of the three species (Setosa, Versicolor, and
Virginica) in a two-dimensional plot. By examining the Andrews curves, you can identify
patterns and relationships between the features and species, such as:

 Sepal length and petal length are highly correlated for the Setosa species.
 Sepal width and petal width are highly correlated for the Versicolor species.
 Virginica species has a more variable and dispersed distribution of features

Roadmap For Data Exploration

Step 1: Data Preparation
1.1. Collect and clean the data
1.2. Handle missing values and outliers
1.3. Transform and normalize data (if necessary)
1.4. Split data into training and testing sets (if necessary)

Step 2: Initial Data Investigation

2.1. Calculate summary statistics (mean, median, mode, range, variance, etc.)
2.2. Visualize data distribution (histograms, box plots, density plots)
2.3. Check for correlations and relationships (scatter plots, correlation matrix)

11
Step 3: Data Visualization
3.1. Explore individual variables (univariate analysis)
* Visualize distributions and patterns
* Identify outliers and anomalies
3.2. Explore relationships between variables (bivariate analysis)
* Visualize correlations and patterns
* Identify relationships and interactions

Step 4: Data Transformation and Feature Engineering

4.1. Transform variables (log, square root, normalization, etc.)
4.2. Create new features (interaction terms, polynomial terms, etc.)
4.3. Select relevant features (feature selection, dimensionality reduction)

Step 5: Pattern Detection and Hypothesis Generation

5.1. Identify patterns and relationships
5.2. Generate hypotheses and research questions
5.3. Develop a plan for further analysis and modeling

Step 6: Advanced Data Analysis

6.1. Cluster analysis (k-means, hierarchical clustering, etc.)
6.2. Dimensionality reduction (PCA, t-SNE, etc.)
6.3. Time series analysis (if applicable)

Step 7: Model Development and Evaluation

7.1. Develop and train models (regression, classification, clustering, etc.)
7.2. Evaluate model performance (accuracy, precision, recall, etc.)
7.3. Refine and optimize models

Step 8: Insight Generation and Communication

8.1. Interpret results and generate insights
8.2. Communicate findings and recommendations
8.3. Visualize results and present to stakeholders

(eBook PDF) Biocalculus: Calculus, Probability, and Statistics for the Life Sciencesinstant download
100% (4)
(eBook PDF) Biocalculus: Calculus, Probability, and Statistics for the Life Sciencesinstant download
53 pages
Epid 600 Class 12 Statistical Inference
No ratings yet
Epid 600 Class 12 Statistical Inference
58 pages
Martello, S., Pisinger, D., & Vigo, D. (2000) - The Three-Dimensional Bin Packing Problem
No ratings yet
Martello, S., Pisinger, D., & Vigo, D. (2000) - The Three-Dimensional Bin Packing Problem
13 pages
Univariate and Multivariate Data Exploration
No ratings yet
Univariate and Multivariate Data Exploration
26 pages
Data Exploration LEC3 AM
No ratings yet
Data Exploration LEC3 AM
59 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Wk. 4. Exploring Data (12-05-2021)
No ratings yet
Wk. 4. Exploring Data (12-05-2021)
10 pages
Lecture3
No ratings yet
Lecture3
15 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
lec2-data
No ratings yet
lec2-data
51 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
5 Data Exploration
No ratings yet
5 Data Exploration
41 pages
Lecture 2.1 Data_exploration
No ratings yet
Lecture 2.1 Data_exploration
22 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Data
No ratings yet
Data
84 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Data Mining and Analysis
No ratings yet
Data Mining and Analysis
25 pages
01 Data
No ratings yet
01 Data
100 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Lecture2
No ratings yet
Lecture2
33 pages
Chapter 03 Exploring Data
No ratings yet
Chapter 03 Exploring Data
45 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
CH 2
No ratings yet
CH 2
68 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
34 pages
Data Mining Data Exploration
No ratings yet
Data Mining Data Exploration
66 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Module 1 - Interval - Ratio
No ratings yet
Module 1 - Interval - Ratio
28 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
10
No ratings yet
10
7 pages
Week 2
No ratings yet
Week 2
30 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
Lec_2_Getting_to_Know_Data_EDA
No ratings yet
Lec_2_Getting_to_Know_Data_EDA
64 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Week 2
No ratings yet
Week 2
73 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
CPSC 4830 2025Summer Lecture 2
No ratings yet
CPSC 4830 2025Summer Lecture 2
42 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Module1 Understanding Data1
No ratings yet
Module1 Understanding Data1
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Lecture 2 EDA 1
No ratings yet
Lecture 2 EDA 1
26 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
02 Data
No ratings yet
02 Data
35 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Syallabus
No ratings yet
Syallabus
2 pages
Apm1513 TL101 Assignment01 2023
No ratings yet
Apm1513 TL101 Assignment01 2023
3 pages
Maths Infinity: (Agni 3.0 Batch)
No ratings yet
Maths Infinity: (Agni 3.0 Batch)
9 pages
Statistical Concepts A First Course Debbie L Hahsvaughn Richard G Lomax pdf download
100% (4)
Statistical Concepts A First Course Debbie L Hahsvaughn Richard G Lomax pdf download
86 pages
Projection Disaggregation Options
No ratings yet
Projection Disaggregation Options
89 pages
Lecture 4
No ratings yet
Lecture 4
42 pages
Robotics1 SS09.pdsd SD SKD F
No ratings yet
Robotics1 SS09.pdsd SD SKD F
64 pages
Dynamics of Elastic Systems
No ratings yet
Dynamics of Elastic Systems
5 pages
Scheme of Valuation: Apj Abdul Kalam Technological University
No ratings yet
Scheme of Valuation: Apj Abdul Kalam Technological University
2 pages
Delta Ia-Plc Pid An en 20141222
No ratings yet
Delta Ia-Plc Pid An en 20141222
26 pages
PDF Statistical Inference for Models with Multivariate t Distributed Errors 1st Edition A. K. Md. Ehsanes Saleh download
100% (4)
PDF Statistical Inference for Models with Multivariate t Distributed Errors 1st Edition A. K. Md. Ehsanes Saleh download
71 pages
Design of A Small-Scaled de Laval Nozzle For IGLIS Experiment
No ratings yet
Design of A Small-Scaled de Laval Nozzle For IGLIS Experiment
14 pages
Yield Criteria PDF
100% (1)
Yield Criteria PDF
11 pages
AR-19 M.tech Syllabus
No ratings yet
AR-19 M.tech Syllabus
74 pages
(A) Centre of Mass
No ratings yet
(A) Centre of Mass
34 pages
Interference Mitigation in WSN by Means of Directional Antennas and Duty Cycle Control
No ratings yet
Interference Mitigation in WSN by Means of Directional Antennas and Duty Cycle Control
12 pages
Assignment # 1
No ratings yet
Assignment # 1
9 pages
Gamma Function
No ratings yet
Gamma Function
20 pages
Alternative SPICE Implementation of Circuit Uncertainties Based On Orthogonal Polynomials
No ratings yet
Alternative SPICE Implementation of Circuit Uncertainties Based On Orthogonal Polynomials
4 pages
Data_Structure_2_lyst1728905520227
No ratings yet
Data_Structure_2_lyst1728905520227
344 pages
Practice Sheet System of Particles and Centre of Mass Anil Sir Vinay
No ratings yet
Practice Sheet System of Particles and Centre of Mass Anil Sir Vinay
7 pages
Soroban Exam 47AWLc
No ratings yet
Soroban Exam 47AWLc
1 page
Chapter 2 Force and Motion TEACHER's GUIDE
No ratings yet
Chapter 2 Force and Motion TEACHER's GUIDE
44 pages
topic for pnpa or pma
No ratings yet
topic for pnpa or pma
2 pages
System of Linear Equations: 3 X y +2 Z 10 4 X 6 y +2 Z 8 2 X 3 Y+ Z 4
No ratings yet
System of Linear Equations: 3 X y +2 Z 10 4 X 6 y +2 Z 8 2 X 3 Y+ Z 4
3 pages
D0685 Math 05 Merged
No ratings yet
D0685 Math 05 Merged
22 pages
Autocad 2D Content
No ratings yet
Autocad 2D Content
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DS Assignment

Uploaded by

DS Assignment

Uploaded by

Assignment

MSCS 1st Semester

Department of Computer Science & IT

Objectives of Data Exploration

1. Sepal length (cm)

Measure Of Central Tendency

Central Data Point

Quartiles are used to:

1. Summarize and describe the distribution of a dataset

1. Visualize the shape of the distribution (e.g., normal, skewed, bimodal)

1. X-axis (horizontal): One variable

Visualizing High-Dimensional Data

Roadmap For Data Exploration

Step 2: Initial Data Investigation

Step 4: Data Transformation and Feature Engineering

Step 5: Pattern Detection and Hypothesis Generation

Step 6: Advanced Data Analysis

Step 7: Model Development and Evaluation

Step 8: Insight Generation and Communication

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.