0% found this document useful (0 votes)
29 views20 pages

DJ 14 Ai&ds 3

Data science is the field of applying advanced analytics and scientific principles to extract valuable information from data. It uses statistics, visualization, machine learning and other techniques. Data science is used in many industries to analyze past, current and predict future data. Statistics are used to describe, analyze and make inferences from data through measures like mean, median, mode, variance and distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views20 pages

DJ 14 Ai&ds 3

Data science is the field of applying advanced analytics and scientific principles to extract valuable information from data. It uses statistics, visualization, machine learning and other techniques. Data science is used in many industries to analyze past, current and predict future data. Statistics are used to describe, analyze and make inferences from data through measures like mean, median, mode, variance and distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

3RD UNIT – Introduction to Data Science

What is Data Science


 Data Science is the filed of applying advanced analytics and
scientific principles to extract valuable information from data
for business decision-makeing , strategic planning , and other
uses.
Definition: It is the study of predicting meaningful insights from
raw, structured and unstructured data.
 Data science is used in Marketing ,finance , and Human
Resources, healthcare, govt programmes, and other industry
that generates data.
 Statistics , Visualization , Deep Learning , Machine Learning
are important concepts .
 The term DATA SCIENCE was developed by PETER NAUR in
1974.
 Data science concentrate on the past data, current data, and
also how predict the future data with the help of raw data.
 Both BI & DA they are created in the multi-dimensio0nal
model [3d,4d….]
 Data science was approved by IASC [International Association
for Statistical Computing].
 Large number of algorithms are used by the data sciencists
like Linear Regression, KNN, K- means, Navie Bayes,…etc .

Data MODEL : Organizes the data elements and standardizes


how the data elements relate to each other .
Data Visualization: It is a graphical representation in which there
data will be converted in to charts, scattercharts, histograms, pie
charts etc.
Data Analysis : It is the core branch of the data science . Here we
can concentrate on the past data and current data .
Business Intelligence : it is only concentrated on the past data
(up to 10 years of data)
Big Data :It is a collection of data that is huge in volume , yet
growing exponentially with time.
 It is used to describe the massive amount of both
structured and unstructured data that is so large it is
difficult to process using traditional techniques.
 It can be described by in 4V’s – collection of data in the
form of Volume (size) , Velocity(how fast data is generated)
, Variety(structured and unstructured), Variability (how
much inconsistency is there).
 Approximately 30,000 hours of videos are uploaded in
Youtube per minute.
 It shows nearly 800 million videos on youtube that
much amount of huge data is organised by special
techniques .

Machine Learning : It is branch of AI that enables computers to Self


Learn from training data .

That allows software applications to become more accurate at


predicting outcomes.

 Machine learning is totally an Automation.


 Automatically takes the input.
 Automatically process the input.
 Automatically gives the output.
 Types of ML algorithms.
 Supervised Learn algorithms – [Trained data set /
labeled input & output data]
 Unsupervised learning algorithms – [ test data set]
 Semi supervised algorithms – [combination of both]

Datafication :It refers to the fact that daily interactions of living


things can be rendered into data format . for examples Netflix,
instagram, facebook ,social media platforms … etc.

 “ process of taking all the aspects of life and turning then into
Quantified data”.
 Once we datafy things , we can’t transform their purposes and
then the into new forms of value.
 Applications:
 HRM- human resource mgnt
 CRM – customer relationship mgnt
 Banking
 Insurance agency
 Copmmercial Real Estate .

For example,

o twitter datafies stray thoughts .


o linkedin datafies professional networks.

The Current Landscape :


Skills set needed for Data Science:

Statistics for Data Science:

Terminology & Concepts

Statistics : It is the branch of mathematics ,which collects the data ,


interprets the data , organise the raw , structured and unstructured
data and visualising the data in the form of graphs, charts etc.

There are two types of Statistics

 Descriptive :
 It is represented general property of data .
 It is used to summarize data in the form of names of
the variables, no of variables, no of samples.
 It can be represented in numerical quantitative ,
categoricals nominal data, discrete.
 Inferential probability distribution:
 It is used to predict the future of the sample of
the population.

Modeling in Statistics: It is mathematical model which shows the


relationship between two or more variables.

They are two types of models

 Supervisied model
 Linear regression model
 Logistic regression model
 Polynomial regression
 Classification models
 Decision tree model
 Bayes model
 Navie bayes model
 K nearest model
 Knn model
 Unsuperviused model
 Clustering model.
 Statistical models are used to make predictions or draw
conclusions.
 These predictions are based on how two random variables are
connected.
 The models will show a relationship between a variables.
For example

Age Weight

10 25

20 35

35 55

40 65

 Weight = Ao +A1(age)+sigma
 Y=mx+c
Y= response dependent
m= slope of the variable
x= predicted variable

Population: Collections of observations , objects, are anything


overall world , by at least one common characteristic for the
purposes of data collection and analysis. For example , no of mails=
4 billion.
Sample: sample is a subset of observations from the population
that ideally is a true representation of the population.

Normal Distribution: If each and every variable is in unimodel then


it is perfect Normal Distribution .
 Unimodel = mean = median = median
 How much data is pilled then it is easy to identify.
Measures of statistics:
Mean : The mean, also known as the average, is a central value of a
finite set of numbers.

 Let’s assume a random variable X in the data has the following


values x1,x2,x3,…..xn.
 Then the sample mean defined by μ, 

Median :n= odd then the middle value is median .

n= even , then the middle two values ,should be the average


then the value is median.

L+ (n/2-f)/F * c in the form of class intervals.

Mode : Frequently repeated value


 Unimodel : one data values is repeated
 Bimodal : two data values is repeated
 Multimodel : more than two data
 All the above models will be represented in range.

Measure of Dispersion:

Range: range is the function of setting the value of min – max values
interval.

Inter Quartive Range: Representing data in the form of percentile


quantile.

 It is aslo called as the median of the data.


 The main aim of the IQR is most probably data is represented
in plots.

Variance:The variance measures how far the data points are spread
out from the average value.

Standard Deviation: The standard deviation is simply the square


root of the variance and measures the extent to which data varies
from its mean.

 The standard deviation defined by sigma can be expressed as

follows:
Skewness: It describes how much statistical data distribution is
asymmetrical from the normal distribution, where distribution is
equally divided on each side

 If the skewness is between -0.5 and 0.5, the data are fairly
symmetrical.
 If the skewness is between -1 and – 0.5 or between 0.5 and 1,
the data are moderately skewed.
 If the skewness is less than -1 or greater than 1, the data are
highly skewed.

Kurtosis: It is the measurement of distributed data in graph .

 It is a measure of the combined weight of a distribution's tails


relative to the center of the distribution. 
 It measure to describe the “tailedness” of the distribution
 It describes the shape of it. It is also a measure of the
“peakedness” of the distribution.
 A high kurtosis distribution has a sharper peak
 Kurtosis ranges from -5 to +5 standard deviation
.

Logistic : data will be predict in the form of binary.


Polynimal : Data will be in the form of exponential form.
Y= ao+a1x+a2x2+a3x3……

Photo.

Neural network:

Probability distributions: List of all random variables & the


probability that occur for that variable called Probability
distribution.

Random variable having finite no of chances then it is discrete


Probability distribution.
Random variable having infinite no of chances then it is continuous
Probability distribution.

 Normal distribution
 Bernoulli distribution
 Uniform distribution
 Binomial distribution
 Possion distribution
 Exponential
 Standard normal

Normal distribution: Normal distribution, also known as the


Gaussian distribution.

It is a probability distribution that is symmetric about the mean.

The mean, median and mode are exactly the same.


Bernoulli Distribution:

Bernoulli distribution for a Bernouilli trial has only two possible


outcomes success or failure.

For example, tossing a coin can only yield two outcomes heads or
tails.

The probability of getting head for a single unbiased coin toss will
be p=0.5 as there is an equal chance of getting a result.

Then (1-p) = 0.5 . So, P(x=1) = p(1) = p = ½.

Uniform Distribution:

Uniform distribution for discrete random variables is a symmetrical


probability distribution where a finite number of values is observed
equally.

For example, when we roll a dice or toss an unbiased coin,


the probability of getting these outcomes are equally likely.

For a random variable x, the uniform distribution function can be


defined as,

For example, by rolling an unbiased dice,

we get 6 possible values: {1,2,3,4,5,6}.

So there is an equally likely chance to get any one of the value.

So, f(X==x)=1/6 (prob, of getting a value

Poisson Distribution:

The Poisson distribution is a discrete distribution that was derived


by a mathematician called Dennish Poisson.
As the random variables are discrete, it can only be measured as
occurring or non-occurring.

A random variable x is said to follow a Poisson distribution when it


assumes only non-negative values and its probability function is

given by
λ = Poisson parameter

Binomial distribution:
Probability of getting x successor states out of one trail.
Exponential :  the exponential distribution is a continuous
probability distribution.
That often concerns the amount of time until some specific event
happens.

Standard Normal : It does not depend up on Mean, Variance and


Standard Deviation.
It  is a special normal distribution where the mean is 0 and the
standard deviation is 1.

Hypothesis Testing : A Hypothesis test is a statistical test that is


used to determine whether there is enough evidence in a sample of
data .
 It tests an assumption regarding a population parameter.
 This assumption is called the null hypothesis and is denoted
by H0.
 An alternative hypothesis (denoted Ha), which is the opposite
of what is stated in the null hypothesis.
 Null hypothesis Ho is equal to exact assumption of sample
data.
 When null hypothesis is rejected , alternate hypothesis is
accepts.
 It is not likely to get exactly.
 It is generally used when we were to compare: a single group
with an external standard. two or more groups with each
other.

There two types of errors.


 Type 1 error: It is the error which is used to reject the
hypothesis though hypothesis is true.
For example, the examiner is failing the student (Ho= pass
the student ) but really the student is going to pass.
 Type 2 error: the probability of not rejecting the hypothesis
though hypothesis is false.
For example , the examiner is passing the student but the
student is going to fail .

 Take the example of a claim that running 5 miles a day will


lead to a reduction of 10 kg of weight within a month.
 Now, this is the hypothesis or claim which is required to be
proved or otherwise.
 The alternate hypothesis will be formulated first as the
statement that “running 5 miles a day will lead to a reduction
of 10 kg of weight within a month”.
 Hence, the null hypothesis will be the opposite of the
alternate hypothesis and stated as the fact that “running 5
miles a day does not lead to a reduction of 10 kg of weight
within a month”.
Running 5 miles a day does
Null not result in the reduction of 10 kg
hypothesis of weight within a month.

Running 5 miles a day results in the


Alternate reduction of 10 kg of weight within
hypothesis a month.

Null Hypothesis (Ho): A hypothesis that says there is no


statistical significance between the two variables in the
hypothesis.
Alternative Hypothesis (Ha): A hypothesis that says there is a
statistical significance between the two variables in the
hypothesis.
Test Statistic (X): A test statistic is a quantity derived from the
sample used in hypothesis testing. We will calculate it in our
example.
P-value: P-Value is the probability of obtaining a value of your
test statistic .
 That is at least as extreme as what we observed, under
the assumption the null hypothesis is true.
 It is the probability of observation given null hypothesis
is true.
 Using this value we will either reject the null hypothesis
or accept it.
Significance level: The significance level is the probability of
rejecting the null hypothesis when it is true.

If p-value <= alpha: Reject the null hypothesis and accept the


alternate hypothesis

If p-value > alpha: Accept the null hypothesis

FOR example ,

 Now let us go back to our example and conduct the test of


hypothesis to find out if the claim of the food delivery
company is true or not. To conduct the test, we have taken a
sample of 300 deliveries and found out that the average
delivery time is 35 minutes. It is difficult to conduct the test for
the whole data so in such cases, we take out a sample and
conduct a test on that.
 The Null Hypothesis in our case is that the delivery time on
average is a maximum of 30 minutes i.e. μ ≤ 30.
 Alternative Hypothesis in our case that on average it takes
more than 30 minutes to deliver food i.e. μ > 30.
 P-value: We will calculate the p-value here as a
probability(X=35/null hypothesis). X is the test statistic which
is the observation. Here we observed that the average delivery
time is 35 minutes.
 We assume that the p-value we got here is 3%. We calculate
the p-value using a method - Resampling and
permutation, which we will explore in detail in the next
article.
 Significance Level: We will choose a significance level of 5%
here. 5% is usually a set standard used in the industry.
 As we observe that our p-value of 3% is less than the
significance level of 5%, we will reject the null hypothesis and
accept the alternative hypothesis.
 So we can conclude that the alternative hypothesis is true and
the research’s finding is true that on average, the company
takes more than 30 minutes to deliver food.

Hypothesis Testing Example


The best way to solve a problem on hypothesis testing is by applying the 5 steps
mentioned in the previous section. Suppose a researcher claims that the mean
average weight of men is greater than 100kgs with a standard deviation of
15kgs. 30 men are chosen with an average weight of 112.5 Kgs. Using hypothesis
testing, check if there is enough evidence to support the researcher's claim. The
confidence interval is given as 95%.

Step 1: This is an example of a right-tailed test. Set up the null hypothesis


as H0H0: μμ = 100.

Step 2: The alternative hypothesis is given by H1H1: μμ > 100.

Step 3: As this is a one-tailed test, αα = 100% - 95% = 5%. This can be used to
determine the critical value.

1 - αα = 1 - 0.05 = 0.95

0.95 gives the required area under the curve. Now using a normal distribution
table, the area 0.95 is at z = 1.645. A similar process can be followed for a t-test.
The only additional requirement is to calculate the degrees of freedom given by
n - 1.

Step 4: Calculate the z test statistic. This is because the sample size is 30.
Furthermore, the sample and population means are known along with the
standard deviation.

z = ¯¯¯x−μσ√nx¯−μσn.

μμ = 100, ¯¯¯xx¯ = 112.5, n = 30, σσ = 15

z = 112.5−10015√30112.5−1001530 = 4.56

Step 5: Conclusion. As 4.56 > 1.645 thus, the null hypothesis can be rejected.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy