0% found this document useful (0 votes)

29 views20 pages

DJ 14 Ai&ds 3

Data science is the field of applying advanced analytics and scientific principles to extract valuable information from data. It uses statistics, visualization, machine learning and other techniques. Data science is used in many industries to analyze past, current and predict future data. Statistics are used to describe, analyze and make inferences from data through measures like mean, median, mode, variance and distributions.

Uploaded by

Care Medical Academy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views20 pages

DJ 14 Ai&ds 3

Uploaded by

Care Medical Academy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

3RD UNIT – Introduction to Data Science

What is Data Science

 Data Science is the filed of applying advanced analytics and
scientific principles to extract valuable information from data
for business decision-makeing , strategic planning , and other
uses.
Definition: It is the study of predicting meaningful insights from
raw, structured and unstructured data.
 Data science is used in Marketing ,finance , and Human
Resources, healthcare, govt programmes, and other industry
that generates data.
 Statistics , Visualization , Deep Learning , Machine Learning
are important concepts .
 The term DATA SCIENCE was developed by PETER NAUR in
1974.
 Data science concentrate on the past data, current data, and
also how predict the future data with the help of raw data.
 Both BI & DA they are created in the multi-dimensio0nal
model [3d,4d….]
 Data science was approved by IASC [International Association
for Statistical Computing].
 Large number of algorithms are used by the data sciencists
like Linear Regression, KNN, K- means, Navie Bayes,…etc .

Data MODEL : Organizes the data elements and standardizes

how the data elements relate to each other .
Data Visualization: It is a graphical representation in which there
data will be converted in to charts, scattercharts, histograms, pie
charts etc.
Data Analysis : It is the core branch of the data science . Here we
can concentrate on the past data and current data .
Business Intelligence : it is only concentrated on the past data
(up to 10 years of data)
Big Data :It is a collection of data that is huge in volume , yet
growing exponentially with time.
 It is used to describe the massive amount of both
structured and unstructured data that is so large it is
difficult to process using traditional techniques.
 It can be described by in 4V’s – collection of data in the
form of Volume (size) , Velocity(how fast data is generated)
, Variety(structured and unstructured), Variability (how
much inconsistency is there).
 Approximately 30,000 hours of videos are uploaded in
Youtube per minute.
 It shows nearly 800 million videos on youtube that
much amount of huge data is organised by special
techniques .

Machine Learning : It is branch of AI that enables computers to Self

Learn from training data .

That allows software applications to become more accurate at

predicting outcomes.

 Machine learning is totally an Automation.

 Automatically takes the input.
 Automatically process the input.
 Automatically gives the output.
 Types of ML algorithms.
 Supervised Learn algorithms – [Trained data set /
labeled input & output data]
 Unsupervised learning algorithms – [ test data set]
 Semi supervised algorithms – [combination of both]

Datafication :It refers to the fact that daily interactions of living

things can be rendered into data format . for examples Netflix,
instagram, facebook ,social media platforms … etc.

 “ process of taking all the aspects of life and turning then into
Quantified data”.
 Once we datafy things , we can’t transform their purposes and
then the into new forms of value.
 Applications:
 HRM- human resource mgnt
 CRM – customer relationship mgnt
 Banking
 Insurance agency
 Copmmercial Real Estate .

For example,

o twitter datafies stray thoughts .

o linkedin datafies professional networks.

The Current Landscape :

Skills set needed for Data Science:

Statistics for Data Science:

Terminology & Concepts

Statistics : It is the branch of mathematics ,which collects the data ,

interprets the data , organise the raw , structured and unstructured
data and visualising the data in the form of graphs, charts etc.

There are two types of Statistics

 Descriptive :
 It is represented general property of data .
 It is used to summarize data in the form of names of
the variables, no of variables, no of samples.
 It can be represented in numerical quantitative ,
categoricals nominal data, discrete.
 Inferential probability distribution:
 It is used to predict the future of the sample of
the population.

Modeling in Statistics: It is mathematical model which shows the

relationship between two or more variables.

They are two types of models

 Supervisied model
 Linear regression model
 Logistic regression model
 Polynomial regression
 Classification models
 Decision tree model
 Bayes model
 Navie bayes model
 K nearest model
 Knn model
 Unsuperviused model
 Clustering model.
 Statistical models are used to make predictions or draw
conclusions.
 These predictions are based on how two random variables are
connected.
 The models will show a relationship between a variables.
For example

Age Weight

10 25

20 35

35 55

40 65

 Weight = Ao +A1(age)+sigma
 Y=mx+c
Y= response dependent
m= slope of the variable
x= predicted variable

Population: Collections of observations , objects, are anything

overall world , by at least one common characteristic for the
purposes of data collection and analysis. For example , no of mails=
4 billion.
Sample: sample is a subset of observations from the population
that ideally is a true representation of the population.

Normal Distribution: If each and every variable is in unimodel then

it is perfect Normal Distribution .
 Unimodel = mean = median = median
 How much data is pilled then it is easy to identify.
Measures of statistics:
Mean : The mean, also known as the average, is a central value of a
finite set of numbers.

 Let’s assume a random variable X in the data has the following

values x1,x2,x3,…..xn.
 Then the sample mean defined by μ,

Median :n= odd then the middle value is median .

n= even , then the middle two values ,should be the average

then the value is median.

L+ (n/2-f)/F * c in the form of class intervals.

Mode : Frequently repeated value

 Unimodel : one data values is repeated
 Bimodal : two data values is repeated
 Multimodel : more than two data
 All the above models will be represented in range.

Measure of Dispersion:

Range: range is the function of setting the value of min – max values
interval.

Inter Quartive Range: Representing data in the form of percentile

quantile.

 It is aslo called as the median of the data.

 The main aim of the IQR is most probably data is represented
in plots.

Variance:The variance measures how far the data points are spread
out from the average value.

Standard Deviation: The standard deviation is simply the square

root of the variance and measures the extent to which data varies
from its mean.

 The standard deviation defined by sigma can be expressed as

follows:
Skewness: It describes how much statistical data distribution is
asymmetrical from the normal distribution, where distribution is
equally divided on each side

 If the skewness is between -0.5 and 0.5, the data are fairly
symmetrical.
 If the skewness is between -1 and – 0.5 or between 0.5 and 1,
the data are moderately skewed.
 If the skewness is less than -1 or greater than 1, the data are
highly skewed.

Kurtosis: It is the measurement of distributed data in graph .

 It is a measure of the combined weight of a distribution's tails

relative to the center of the distribution.
 It measure to describe the “tailedness” of the distribution
 It describes the shape of it. It is also a measure of the
“peakedness” of the distribution.
 A high kurtosis distribution has a sharper peak
 Kurtosis ranges from -5 to +5 standard deviation
.

Logistic : data will be predict in the form of binary.

Polynimal : Data will be in the form of exponential form.
Y= ao+a1x+a2x2+a3x3……

Photo.

Neural network:

Probability distributions: List of all random variables & the

probability that occur for that variable called Probability
distribution.

Random variable having finite no of chances then it is discrete

Probability distribution.
Random variable having infinite no of chances then it is continuous
Probability distribution.

 Normal distribution
 Bernoulli distribution
 Uniform distribution
 Binomial distribution
 Possion distribution
 Exponential
 Standard normal

Normal distribution: Normal distribution, also known as the

Gaussian distribution.

It is a probability distribution that is symmetric about the mean.

The mean, median and mode are exactly the same.

Bernoulli Distribution:

Bernoulli distribution for a Bernouilli trial has only two possible

outcomes success or failure.

For example, tossing a coin can only yield two outcomes heads or
tails.

The probability of getting head for a single unbiased coin toss will
be p=0.5 as there is an equal chance of getting a result.

Then (1-p) = 0.5 . So, P(x=1) = p(1) = p = ½.

Uniform Distribution:

Uniform distribution for discrete random variables is a symmetrical

probability distribution where a finite number of values is observed
equally.

For example, when we roll a dice or toss an unbiased coin,

the probability of getting these outcomes are equally likely.

For a random variable x, the uniform distribution function can be

defined as,

For example, by rolling an unbiased dice,

we get 6 possible values: {1,2,3,4,5,6}.

So there is an equally likely chance to get any one of the value.

So, f(X==x)=1/6 (prob, of getting a value

Poisson Distribution:

The Poisson distribution is a discrete distribution that was derived

by a mathematician called Dennish Poisson.
As the random variables are discrete, it can only be measured as
occurring or non-occurring.

A random variable x is said to follow a Poisson distribution when it

assumes only non-negative values and its probability function is

given by
λ = Poisson parameter

Binomial distribution:
Probability of getting x successor states out of one trail.
Exponential : the exponential distribution is a continuous
probability distribution.
That often concerns the amount of time until some specific event
happens.

Standard Normal : It does not depend up on Mean, Variance and

Standard Deviation.
It is a special normal distribution where the mean is 0 and the
standard deviation is 1.

Hypothesis Testing : A Hypothesis test is a statistical test that is

used to determine whether there is enough evidence in a sample of
data .
 It tests an assumption regarding a population parameter.
 This assumption is called the null hypothesis and is denoted
by H0.
 An alternative hypothesis (denoted Ha), which is the opposite
of what is stated in the null hypothesis.
 Null hypothesis Ho is equal to exact assumption of sample
data.
 When null hypothesis is rejected , alternate hypothesis is
accepts.
 It is not likely to get exactly.
 It is generally used when we were to compare: a single group
with an external standard. two or more groups with each
other.

There two types of errors.

 Type 1 error: It is the error which is used to reject the
hypothesis though hypothesis is true.
For example, the examiner is failing the student (Ho= pass
the student ) but really the student is going to pass.
 Type 2 error: the probability of not rejecting the hypothesis
though hypothesis is false.
For example , the examiner is passing the student but the
student is going to fail .

 Take the example of a claim that running 5 miles a day will

lead to a reduction of 10 kg of weight within a month.
 Now, this is the hypothesis or claim which is required to be
proved or otherwise.
 The alternate hypothesis will be formulated first as the
statement that “running 5 miles a day will lead to a reduction
of 10 kg of weight within a month”.
 Hence, the null hypothesis will be the opposite of the
alternate hypothesis and stated as the fact that “running 5
miles a day does not lead to a reduction of 10 kg of weight
within a month”.
Running 5 miles a day does
Null not result in the reduction of 10 kg
hypothesis of weight within a month.

Running 5 miles a day results in the

Alternate reduction of 10 kg of weight within
hypothesis a month.

Null Hypothesis (Ho): A hypothesis that says there is no

statistical significance between the two variables in the
hypothesis.
Alternative Hypothesis (Ha): A hypothesis that says there is a
statistical significance between the two variables in the
hypothesis.
Test Statistic (X): A test statistic is a quantity derived from the
sample used in hypothesis testing. We will calculate it in our
example.
P-value: P-Value is the probability of obtaining a value of your
test statistic .
 That is at least as extreme as what we observed, under
the assumption the null hypothesis is true.
 It is the probability of observation given null hypothesis
is true.
 Using this value we will either reject the null hypothesis
or accept it.
Significance level: The significance level is the probability of
rejecting the null hypothesis when it is true.

If p-value <= alpha: Reject the null hypothesis and accept the

alternate hypothesis

If p-value > alpha: Accept the null hypothesis

FOR example ,

 Now let us go back to our example and conduct the test of

hypothesis to find out if the claim of the food delivery
company is true or not. To conduct the test, we have taken a
sample of 300 deliveries and found out that the average
delivery time is 35 minutes. It is difficult to conduct the test for
the whole data so in such cases, we take out a sample and
conduct a test on that.
 The Null Hypothesis in our case is that the delivery time on
average is a maximum of 30 minutes i.e. μ ≤ 30.
 Alternative Hypothesis in our case that on average it takes
more than 30 minutes to deliver food i.e. μ > 30.
 P-value: We will calculate the p-value here as a
probability(X=35/null hypothesis). X is the test statistic which
is the observation. Here we observed that the average delivery
time is 35 minutes.
 We assume that the p-value we got here is 3%. We calculate
the p-value using a method - Resampling and
permutation, which we will explore in detail in the next
article.
 Significance Level: We will choose a significance level of 5%
here. 5% is usually a set standard used in the industry.
 As we observe that our p-value of 3% is less than the
significance level of 5%, we will reject the null hypothesis and
accept the alternative hypothesis.
 So we can conclude that the alternative hypothesis is true and
the research’s finding is true that on average, the company
takes more than 30 minutes to deliver food.

Hypothesis Testing Example

The best way to solve a problem on hypothesis testing is by applying the 5 steps
mentioned in the previous section. Suppose a researcher claims that the mean
average weight of men is greater than 100kgs with a standard deviation of
15kgs. 30 men are chosen with an average weight of 112.5 Kgs. Using hypothesis
testing, check if there is enough evidence to support the researcher's claim. The
confidence interval is given as 95%.

Step 1: This is an example of a right-tailed test. Set up the null hypothesis

as H0H0: μμ = 100.

Step 2: The alternative hypothesis is given by H1H1: μμ > 100.

Step 3: As this is a one-tailed test, αα = 100% - 95% = 5%. This can be used to
determine the critical value.

1 - αα = 1 - 0.05 = 0.95

0.95 gives the required area under the curve. Now using a normal distribution
table, the area 0.95 is at z = 1.645. A similar process can be followed for a t-test.
The only additional requirement is to calculate the degrees of freedom given by
n - 1.

Step 4: Calculate the z test statistic. This is because the sample size is 30.
Furthermore, the sample and population means are known along with the
standard deviation.

z = ¯¯¯x−μσ√nx¯−μσn.

μμ = 100, ¯¯¯xx¯ = 112.5, n = 30, σσ = 15

z = 112.5−10015√30112.5−1001530 = 4.56

Step 5: Conclusion. As 4.56 > 1.645 thus, the null hypothesis can be rejected.

Data Analyticsi Foundations
No ratings yet
Data Analyticsi Foundations
540 pages
Final SRB Unit 2
No ratings yet
Final SRB Unit 2
162 pages
Probability and Statistics I - 2023
No ratings yet
Probability and Statistics I - 2023
197 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
719 Final Syllabus Merged
No ratings yet
719 Final Syllabus Merged
200 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
1.1 CS3352-FDS - Unit 1
No ratings yet
1.1 CS3352-FDS - Unit 1
42 pages
Statistics
No ratings yet
Statistics
36 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
02 Data
No ratings yet
02 Data
36 pages
Data Science Dse
No ratings yet
Data Science Dse
24 pages
Statistics Guide
No ratings yet
Statistics Guide
27 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
Probability
No ratings yet
Probability
22 pages
Statistics Notes
No ratings yet
Statistics Notes
32 pages
Unit 3
No ratings yet
Unit 3
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Statistics and Econometrics
No ratings yet
Statistics and Econometrics
12 pages
Basic Stats Session
No ratings yet
Basic Stats Session
16 pages
What Exactly Is Data Science
No ratings yet
What Exactly Is Data Science
15 pages
Data Science Module 3 Q & A
No ratings yet
Data Science Module 3 Q & A
7 pages
Unit 1
No ratings yet
Unit 1
21 pages
Business Analytics
No ratings yet
Business Analytics
40 pages
Statistics and Probability Reviewer
No ratings yet
Statistics and Probability Reviewer
7 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
GFG DataScience Interview Questions
No ratings yet
GFG DataScience Interview Questions
64 pages
Prelim Coverage
No ratings yet
Prelim Coverage
6 pages
Data Science 5
100% (3)
Data Science 5
216 pages
Statistics 24 04 2021 20210618114031
No ratings yet
Statistics 24 04 2021 20210618114031
41 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Introduction To Predictive Analytics
No ratings yet
Introduction To Predictive Analytics
92 pages
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
100% (3)
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
90 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
Statistical Learning
No ratings yet
Statistical Learning
2 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
AS Level Mathematics Statistics (New)
No ratings yet
AS Level Mathematics Statistics (New)
49 pages
What Are Data Distributions, and Why Are They Important
No ratings yet
What Are Data Distributions, and Why Are They Important
4 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
9 pages
Assignments-Aurr (1) Priyabs-11
No ratings yet
Assignments-Aurr (1) Priyabs-11
4 pages
Statistics
No ratings yet
Statistics
23 pages
Notes
No ratings yet
Notes
12 pages
Statisitcs
No ratings yet
Statisitcs
22 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Mathematics Statistics
No ratings yet
Mathematics Statistics
4 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
Ass-3 Ds
No ratings yet
Ass-3 Ds
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DJ 14 Ai&ds 3

Uploaded by

DJ 14 Ai&ds 3

Uploaded by

3RD UNIT – Introduction to Data Science

What is Data Science

Data MODEL : Organizes the data elements and standardizes

Machine Learning : It is branch of AI that enables computers to Self

That allows software applications to become more accurate at

 Machine learning is totally an Automation.

Datafication :It refers to the fact that daily interactions of living

o twitter datafies stray thoughts .

The Current Landscape :

Statistics for Data Science:

Terminology & Concepts

Statistics : It is the branch of mathematics ,which collects the data ,

There are two types of Statistics

Modeling in Statistics: It is mathematical model which shows the

They are two types of models

Population: Collections of observations , objects, are anything

Normal Distribution: If each and every variable is in unimodel then

 Let’s assume a random variable X in the data has the following

Median :n= odd then the middle value is median .

n= even , then the middle two values ,should be the average

L+ (n/2-f)/F * c in the form of class intervals.

Mode : Frequently repeated value

Inter Quartive Range: Representing data in the form of percentile

 It is aslo called as the median of the data.

Standard Deviation: The standard deviation is simply the square

 The standard deviation defined by sigma can be expressed as

Kurtosis: It is the measurement of distributed data in graph .

 It is a measure of the combined weight of a distribution's tails

Logistic : data will be predict in the form of binary.

Probability distributions: List of all random variables & the

Random variable having finite no of chances then it is discrete

Normal distribution: Normal distribution, also known as the

It is a probability distribution that is symmetric about the mean.

The mean, median and mode are exactly the same.

Bernoulli distribution for a Bernouilli trial has only two possible

Then (1-p) = 0.5 . So, P(x=1) = p(1) = p = ½.

Uniform distribution for discrete random variables is a symmetrical

For example, when we roll a dice or toss an unbiased coin,

For a random variable x, the uniform distribution function can be

For example, by rolling an unbiased dice,

we get 6 possible values: {1,2,3,4,5,6}.

So there is an equally likely chance to get any one of the value.

So, f(X==x)=1/6 (prob, of getting a value

The Poisson distribution is a discrete distribution that was derived

A random variable x is said to follow a Poisson distribution when it

Standard Normal : It does not depend up on Mean, Variance and

Hypothesis Testing : A Hypothesis test is a statistical test that is

There two types of errors.

 Take the example of a claim that running 5 miles a day will

Running 5 miles a day results in the

Null Hypothesis (Ho): A hypothesis that says there is no

If p-value <= alpha: Reject the null hypothesis and accept the

If p-value > alpha: Accept the null hypothesis

 Now let us go back to our example and conduct the test of

Hypothesis Testing Example

Step 1: This is an example of a right-tailed test. Set up the null hypothesis

Step 2: The alternative hypothesis is given by H1H1: μμ > 100.

1 - αα = 1 - 0.05 = 0.95

μμ = 100, ¯¯¯xx¯ = 112.5, n = 30, σσ = 15

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.