Unit-4 DS Student
Unit-4 DS Student
Unit-4 DS Student
Model Development
Simple and Multiple Regression, Model Evaluation using Visualization , Residual Plot,
Distribution Plot, Polynomial Regression and Pipelines, Measures for In-sample
Evaluation, Prediction and Decision Making.
Model:
Parameters:
Concept
Linear regression models provide a simple approach towards supervised learning. They are
simple yet effective.
Linear implies the following: arranged in or extending along a straight or nearly straight
line. Linear suggests that the relationship between dependent and independent variable can
be expressed in a straight line.
Recall the geometry lesson from high school. What is the equation of a line?
y = mx + c
● y is the dependent variable i.e. the variable that needs to be estimated and
predicted.
● x is the independent variable i.e. the variable that is controllable. It is the input.
● m is the slope. It determines what will be the angle of the line. It is the
parameter denoted as β.
This method uses a single independent variable to predict a dependent variable by fitting a
best linear relationship.
⚫ algorithms.
Linear regression is one of the easiest and most popular Machine Learning
⚫ makes
It is a statistical method that is used for predictive analysis. Linear regression
predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc.
⚫ Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.
⚫ The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
⚫
⚫ Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.
⚫ The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
⚫ The different values for weights or the coefficient of lines (a , a ) gives a different
0 1
line of regression, so we need to calculate the best values for a and a to find the
0 1
best fit line, so to calculate this we use cost function.
⚫ Cost function-
⚫ The different values for weights or coefficient of lines (a , a ) gives the different
0 1
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
⚫ We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable.
⚫ This mapping function is also known as Hypothesis function.
⚫ For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:
⚫
⚫ Where,
⚫ N=Total number of observation
⚫ Yi = Actual value
⚫ (a1x +a )= Predicted value.
i 0
⚫ Residuals: The
residual.
distance between the actual value and predicted values is called
⚫ Ifhigh,theandobserved points are far from the regression line, then the residual will be
so cost function will high.
⚫ small
If the scatter points are close to the regression line, then the residual will be
and hence the cost function.
⚫ Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best
possible result from the given dataset.
The data that can’t be controlled i.e. dependent variables need to predicted or estimated.
●
● Example!
● What is the Best Fit?
● It can be of any shape depending on the number of independent variables (a point on
the axis, a line in two dimensions, a plane in three dimensions, or a hyperplane in
higher dimensions).
Least Squares Method: The best fit is done by making sure that the sum of all the
distances between the shape and the actual observations at each point is as small as
possible. The fit of the shape is “best” in the sense that no other position would
produce less error given the choice of shape.
Linear regression models are not perfect. It tries to approximate the relationship between
dependent and independent variables in a straight line. Approximation leads to errors. Some
errors can be reduced. Some errors are inherent in the nature of the problem. These errors
cannot be eliminated. They are called as an irreducible error, the noise term in the true
relationship that cannot fundamentally be reduced by any model.
β0 and β1 are two unknown constants that represent the intercept and slope. They are the
parameters.
Formulation
Let us go through an example to explain the terms and workings of a Linear regression
model.
Fernando is a Data Scientist. He wants to buy a car. He wants to estimate or predict the car
price that he will have to pay. He has a friend at a car dealership company. He asks for prices
for various other cars along with a few characteristics of the car. His friend provides him with
some information.
First, Fernando wants to evaluate if indeed he can predict car price based on engine size. The
first set of analysis seeks the answers to the following questions:
Fernando does a correlation analysis. Correlation is a measure of how much the two variables
are related. It is measured by a metric called as the correlation coefficient. Its value is
between 0 and 1.
If the correlation coefficient is a large(> 0.7) +ve number, it implies that as one variable
increases, the other variable increases as well. A large -ve number indicates that as one
variable increases, the other variable decreases.
He does a correlation analysis. He plots the relationship between price and engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is
used for the test.
He builds a linear regression model. He uses a statistical package to create the model. The
model creates a linear equation that expresses price of the car as a function of engine size.
Following are the answers to the questions:
● A straight line can fit => A decent prediction of price can be made using
engine size.
Fernando now wants to build a linear regression model that will estimate the price of the car
price based on engine size. Superimposing the equation to the car price problem, Fernando
formulates the following equation for price prediction.
Model
The data needs to be split into training and testing set. The training data is used to learn about
the data. The training data is used to create the model. The testing data is used to evaluate the
model performance.
Fernando splits the data into training and test set. 75% of data is used for training. Remaining
is used for the test. He builds a linear regression model. He uses a statistical package to create
the model. The model produces a linear equation that expresses price of the car as a function
of engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is
used for the test.
He builds a linear regression model. He uses a statistical package to create the model. The
model creates a linear equation that expresses price of the car as a function of engine size.
The model estimates the parameters:
● β0 is estimated as -6870.1
● β1 is estimated as 156.9
Interpretation
The model provides the equation for the predicting the average car price given a specific
engine size. This equation means the following:
One unit increase in engine size will increase the average price of the car by 156.9 units.
Evaluation
The model is built. The robustness of the model needs to be evaluated. How can we be sure
that the model will be able to predict the price satisfactory? This evaluation is done in two
parts. First, test to establish the robustness of the model. Second, test to evaluate the accuracy
of the model.
Fernando first evaluates the model on the training data. He gets the following statistics.
There are a lot of statistics in there. Let us focus on key ones (marked in red). Recall the
discussion on hypothesis testing. The robustness of the model is evaluated using hypothesis
testing.
β1: The value of β1 determines the relationship between price and engine size. If β1 = 0 then
there is no relationship. In this case, β1 is positive. It implies that there is some relationship
between price and engine size.
t-stat: The t-stat value is how many standard deviations the coefficient estimate (β1) is far
away from zero. Further, it is away from zero stronger the relationship between price and
engine size. The coefficient is significant. In this case, t-stat is 21.09. It is far enough from
zero.
p-value: p-value is a probability value. It indicates the chance of seeing the given t-statistics,
under the assumption that NULL hypothesis is true. If the p-value is small e.g. < 0.0001, it
implies that the probability that this is by chance and there is no relation is very low. In this
case, the p-value is small. It means that relationship between price and engine is not by
chance.
With these metrics, we can safely reject the NULL hypothesis and accept the alternate
hypothesis. There is a robust relationship between price and engine size
The relationship is established. How about accuracy? How accurate is the model? To get a
feel for the accuracy of the model, a metric named R-squared or coefficient of
determination is important.
● Error (e) is the difference between the actual y and the predicted y. The
predicted y is denoted as ŷ. This error is evaluated for each observation. These
errors are also called as residuals.
● Then all the residual values are squared and added. This term is called
as Residual Sum of Squares (RSS). Lower the RSS, the better it is.
● There is another part of the equation of R-squared. To get the other part, first,
the mean value of the actual target is computed i.e. average value of the price
of the car is estimated. Then the differences between the mean value and actual
values are calculated. These differences are then squared and added. It is
the total sum of squares (TSS).
In the example above, RSS is computed based on the predicted price for three cars. RSS
value is 41450201.63. The mean value of the actual price is 11,021. TSS is calculated as
44,444,546. R-squared is computed as 6.737%. For these three specific data points, the model
is only able to explain 6.73% of the variation. Not good enough!!
However, for Fernando’s model, it is a different story. The R-squared for the training set
is 0.7503 i.e. 75.03%. It means that the model can explain more 75% of the variation.
⚫ So, the model assumes either little or no multicollinearity between the features or
independent variables.
⚫ Homoscedasticity Assumption:
⚫ Homoscedasticity
independent variables.
is a situation when the error term is the same for all the values of
⚫ Ifeithererrortooterms are not normally distributed, then confidence intervals will become
wide or too narrow, which may cause difficulties in finding coefficients.
⚫ Itdeviation,
can be checked using the q-q plot. If the plot shows a straight line without any
which means the error is normally distributed.
⚫ No autocorrelations:
⚫ The linear regression model assumes no autocorrelation in error terms.
⚫ accuracy
If there will be any correlation in the error term, then it will drastically reduce the
of the model.
This method uses more than one independent variable to predict a dependent variable by
fitting a best linear relationship.
⚫ Each feature variable must model the linear relationship with the dependent
variable.
In case of Multiple Regression, the parameters can be found in the same way as that in the
case of simple linear regression, by minimising the cost function using:
⚫
4. Stepwise Regression
This regression model is used when we have more than one independent variable. It uses
automatic procedure to select important independent variables and there is no human
intervention.
● Similarly, search through the remaining p-1 variables one by one, but this time
with that variable in the model which was selected in previous step. Now
choose the model which will be best among the p-1 models.
● Continue until some stopping rule is satisfied like some threshold value of the
number of variables to be selected.
● Now remove the variable with the largest p-value i.e. the least significant
predictor.
● The new model shall have (p-1) variables. Remove the variable with largest
p-value again.
● Continue until some stopping rule is satisfied like all variables have a p-value
smaller than a threshold value.
we can see, Stepwise Linear Regression is applying Multiple Linear Regression multiple
times and selecting the important variables or removing the least significant predictors each
time.
Note 1: For Backward Stepwise Linear Regression or Multiple Linear Regression to work
fine, the number of observations (n) should be more than the number of variables(p). It
is because we can do least squares regression only when n is greater than p. For p greater than
n, least squares model is not even defined.
Note 2: Automatic procedures may not choose the right significant variables from practical
point of view as they don’t have the special knowledge the analyst might have.
3.Logistic regression
⚫ Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
⚫ Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
⚫ Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
⚫ This sigmoid function is used to model the data in logistic regression. The function
can be represented as:
⚫ When we provide the input values (data) to the function, it gives the S-curve as
follows:
⚫ Itupuses the concept of threshold levels, values above the threshold level are rounded
to 1, and values below the threshold level are rounded up to 0.
⚫ Itvalueis similar to multiple linear regression, but it fits a non-linear curve between the
of x and corresponding conditional values of y.
⚫ Which means the datapoints are best fitted using a polynomial line.
⚫ The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b 0+b1x+ b2x2+ b3x3+.....+ bnxn.
⚫ The model is still linear as the coefficients are still linear with quadratic
⚫ regression,
This is different from Multiple Linear regression in such a way that in Polynomial
a single element has different degrees instead of multiple variables with
the same degree
⚫ addIt issome
also called the special case of Multiple Linear Regression in ML. Because we
polynomial terms to the Multiple Linear regression equation to convert it
into Polynomial Regression.
⚫ but if we apply the same model without any modification on a non-linear dataset, then
it will produce a drastic output.
⚫ Due to which loss function will increase, the error rate will be high, and accuracy will
be decreased.
⚫ Polynomial
So for such cases, where data points are arranged in a non-linear fashion, we need the
Regression model.
⚫ We can understand it in a better way using the below comparison diagram of the
linear dataset and non-linear dataset.
⚫ Socovers
if we try to cover it with a linear model, then we can clearly see that it hardly
any data point.
⚫ On the other hand, a curve is suitable to cover most of the data points, which is
of the Polynomial model.
⚫ Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
Gradient Descent
When there are one or more inputs you can use a process of optimizing the values of the
coefficients by iteratively minimizing the error of the model on your training data.
This operation is called Gradient Descent and works by starting with random values for each
coefficient. The sum of the squared errors are calculated for each pair of input and output
values. A learning rate is used as a scale factor and the coefficients are updated in the
direction towards minimizing the error. The process is repeated until a minimum sum squared
error is achieved or no further improvement is possible.
When using this method, you must select a learning rate (alpha) parameter that determines
the size of the improvement step to take on each iteration of the procedure.
Regularization
There are extensions of the training of the linear model called regularization methods. These
seek to both minimize the sum of the squared error of the model on the training data (using
ordinary least squares) but also to reduce the complexity of the model (like the number or
absolute size of the sum of all coefficients in the model).
● Lasso Regression: where Ordinary Least Squares is modified to also minimize the
absolute sum of the coefficients (called L1 regularization).
● Ridge Regression: where Ordinary Least Squares is modified to also minimize the
squared absolute sum of the coefficients (called L2 regularization).
These
methods are effective to use when there is collinearity in your input values and ordinary least
squares would over fit the training data.
Model evaluation through Visualization
One of the most important parts of any Data Science/ML project is model validation. For
regression, there are numerous methods to evaluate the goodness of your fit i.e. how well the
model fits the data. R² values are just one such measure. But they are not always the best at
making us feel confident about our model.
Residuals
A residual is a measure of how far away a point is vertically from the regression line. Simply,
it is the error between a predicted value and the observed actual value.
Residual Equation
Figure 1 is an example of how to visualize residuals against the line of best fit. The vertical
lines are the residuals.
Residual Plots
A typical residual plot has the residual values on the Y-axis and the independent variable on
the x-axis. Figure 2 below is a good example of how a typical residual plot looks like.
The most important assumption of a linear regression model is that the errors are
independent and normally distributed.
Every regression model inherently has some degree of error since you can never predict
something 100% accurately. More importantly, randomness and unpredictability are always a
part of the regression model. Hence, a regression model can be explained as:
Ideally, our linear equation model should accurately capture the predictive information.
Essentially, what this means is that if we capture all of the predictive information, all that is
left behind (residuals) should be completely random & unpredictable i.e stochastic. Hence,
we want our residuals to follow a normal distribution. And that is exactly what we look for in
a residual plot.
1. It has a high density of points close to the origin and a low density of points
away from the origin
2. It is symmetric about the origin
To explain why Fig. 3 is a good residual plot based on the characteristics above, we project
all the residuals onto the y-axis. As seen in Figure 3b, we end up with a normally distributed
curve; satisfying the assumption of the normality of the residuals.
Finally, one other reason this is a good residual plot is, that independent of the value of an
independent variable (x-axis), the residual errors are approximately distributed in the same
manner. In other words, we do not see any patterns in the value of the residuals as we move
along the x-axis.
Hence, this satisfies our earlier assumption that regression model residuals are independent
and normally distributed.
Using the characteristics described above, we can see why Figure 4 is a bad residual plot.
This plot has high density far away from the origin and low density close to the origin. Also,
when we project the residuals on the y-axis, we can see the distribution curve is not normal.
Example of Bad Residual plot
It is important to understand here that these plots signify that we have not completely
captured the predictive information of the data in our model, which is why it is “seeping” into
our residuals. A good model should always only have random error left after using the
predictive information
Data distributions
You may have noticed that numerical data is often summarized with the average value. For
example, the quality of a high school is sometimes summarized with one number: the average
score on a standardized test. Occasionally, a second number is reported: the standard
deviation. For example, you might read a report stating that scores were 680 plus or minus 50
(the standard deviation). The report has summarized an entire vector of scores with just two
numbers. Is this appropriate? Is there any important piece of information that we are missing
by only looking at this summary rather than the entire list?
Our first data visualization building block is learning to summarize lists of factors or numeric
vectors. More often than not, the best way to share or explore this summary is through data
visualization. The most basic statistical summary of a list of objects or numbers is its
distribution. Once a vector has been summarized as a distribution, there are several data
visualization techniques to effectively relay this information.
In this chapter, we first discuss properties of a variety of distributions and how to visualize
distributions using a motivating example of student heights. We then discuss
the ggplot2 geometries for these visualizations
Variable types
We will be working with two types of variables: categorical and numeric. Each can be
divided into two other groups: categorical can be ordinal or not, whereas numerical variables
can be discrete or continuous.
When each entry in a vector comes from one of a small number of groups, we refer to the
data as categorical data. Two simple examples are sex (male or female) and regions
(Northeast, South, North Central, West). Some categorical data can be ordered even if they
are not numbers per se, such as spiciness (mild, medium, hot). In statistics textbooks, ordered
categorical data are referred to as ordinal data.
Examples of numerical data are population sizes, murder rates, and heights. Some numerical
data can be treated as ordered categorical. We can further divide numerical data into
continuous and discrete. Continuous variables are those that can take any value, such as
heights, if measured with enough precision. For example, a pair of twins may be 68.12 and
68.11 inches, respectively. Counts, such as population sizes, are discrete because they have to
be round numbers.
Keep in mind that discrete numeric data can be considered ordinal. Although this is
technically true, we usually reserve the term ordinal data for variables belonging to a small
number of different groups, with each group having many members. In contrast, when we
have many groups with few cases in each group, we typically refer to them as discrete
numerical variables. So, for example, the number of packs of cigarettes a person smokes a
day, rounded to the closest pack, would be considered ordinal, while the actual number of
cigarettes would be considered a numerical variable. But, indeed, there are examples that can
be considered both numerical and ordinal when it comes to visualizing data.
Let’s start by understanding what exactly distribution mean . Term “distribution ” in data
science or statistics usually mean a probability distribution . Distribution is nothing but a
function which provide the possible value of variable and how often they occur. Probability
distribution is mathematical function which provide the possibilities of occurrence of various
possible outcome that can occur in an experiment.
There are many types of probability distribution . but is mainly considered in regression
1. Normal distribution
2. Binomial distribution
3. Bernoulli distribution
4. Uniform distribution
5. Poisson distribution
Normal distribution:
● Let’s consider X is random variable belongs to normal distribution with mean and
standard deviation . If we plot the histogram or pdf(probability density function) of
random variable ,it will look like bell curve as shown below:
● Following are three important properties of normal distribution. These properties also
called Emperical formula.
is equal to 68%.
It means 68% of data point belongs to X falls within range of 1 Standard Deviation.
2. Probability of variable that falls within range of 2 Standard Deviation is equal to 95%.
95% of data point belongs to random variable X falls within range of 2 Standard Deviation.
3. Probability of variable that falls within range of 3 Standard Deviation is equal to 99.7%.
99.7% of data point belongs to random variable X falls within range of 3 Standard Deviation.
During Exploratory Data Analysis , we try to plot the features. If it make the bell curve then
all above properties will be applied to it , because it’s a normal distribution or guassian
distribution.
OSEMN Pipeline
● E — Exploring / Visualizing our data will allow us to find patterns and trends
Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets
that are typically huge in amount. The field encompasses analysis, preparing data for
analysis, and presenting findings to inform high-level decisions in an organization. As such, it
incorporates skills from computer science, mathematics, statics, information visualization,
graphic, and business.
In simple words, a pipeline in data science is “a set of actions which changes the raw (and
confusing) data from various sources (surveys, feedbacks, list of purchases, votes, etc.), to
an understandable format so that we can store it and use it for analysis.”
But besides storage and analysis, it is important to formulate the questions that we will solve
using our data. And these questions would yield the hidden information which will give us
the power to predict results, just like a wizard. For instance:
After getting hold of our questions, now we are ready to see what lies inside the data science
pipeline. When the raw data enters a pipeline, it’s unsure of how much potential it holds
within. It is we data scientists, waiting eagerly inside the pipeline, who bring out its worth by
cleaning it, exploring it, and finally utilizing it in the best way possible. So, to understand its
journey let’s jump into the pipeline.
The raw data undergoes different stages within a pipeline which are:
This stage involves the identification of data from the internet or internal/external databases
and extracts into useful formats. Prerequisite skills:
This is the most time-consuming stage and requires more effort. It is further divided into two
stages:
● Examining Data:
● identifying errors
● Cleaning of data:
● Coding language: Python, R.
When data reaches this stage of the pipeline, it is free from errors and missing values, and
hence is suitable for finding patterns using visualizations and charts.
Prerequisite skills:
● R: GGplot2, Dplyr.
This is that stage of the data science pipeline where machine learning comes to play. With the
help of machine learning, we create data models. Data models are nothing but general rules in
a statistical sense, which is used as a predictive tool to enhance our business decision-making.
Prerequisite skills:
● Evaluation methods.
Similar to paraphrasing your data science model. Always remember, if you can’t explain it to
a six-year-old, you don’t understand it yourself. So, communication becomes the key!! This
is the most crucial stage of the pipeline, wherewith the use of psychological techniques,
correct business domain knowledge, and your immense storytelling abilities, you can explain
your model to the non-technical audience.
Prerequisite skills:
Video --
https://www.coursera.org/lecture/data-analysis-with-python/measures-for-in-sam
ple-evaluation-h6K6H
prediction and Decision Making.
Video--
https://www.coursera.org/lecture/data-analysis-with-python/prediction-and-decision-making-
4Li1D
Here’s how you can Data Science for Better Decision Making-
Netflix recommends you watch new movies based on your viewing history. But,
there’s a limitation. Netflix can only recommend the movies that it has in its library.
Let’s create a bigger list of movies we have watched and
By making a personal movie recommender system, you can avoid all of this hassle.
This may be a bit complicated for a beginner.
But, if you have an idea of basic ML models you can try this out in a different way.
Create a dataset by feeding-in all the movies you have watched along with their
IMDB scores, genre, major actors, language, director, etc. and give them a personal
rating out of 10. Use this personal rating as the target variable, pick a validation
approach, and then use the appropriate modeling technique on it.
In a similar manner, gather a list of all the movies that you want to watch and using
the above-fitted model, try to get the predicted ratings on each of them!
What’s next? Start watching these movies in descending order and enjoy it!
2. Doing a Self-Analysis
I have always been this guy who would avoid his emotions and would distract
himself if something serious would pop into his head. The pandemic made me realize
how important it is to face your emotions, scrutinize them and then let them go.
So, I am thinking about creating a journal wherein I would write the majority of
things I did throughout the day followed by a review of the day. You can do this too!
Learn some basic techniques on how to analyse textual data, also known as Natural
Language Processing or NLP and let’s try to put it into action. A sentimental
analysis can be done to understand how you were feeling when you wrote that
message and how often do you feel like that.
When this COVID pandemic outbroke, everything got affected, offices had to be
closed because we were all supposed to stay inside our homes and of course call
centres and service centres were closed as well. With limited number of people
present at their disposal, companies had to look for alternatives which could make
their jobs easier. Many companies used chatbots to talk to us and record our
problems. If the solution was easy and could be solved without any human assistance,
the chatbot itself would recommend some solutions.
You can also make a chatbot that answers for you when you’re busy. Let’s say that
you’re at office and someone texts you asking where you are, you don’t have to leave
your work and sit back to reply to that person. The chatbot that you program can
handle that effectively for you.
You can check this out to know more about Chatbots and how to implement them.
You are thinking about buying a smartphone and you google “Best phones under
35k”. You open the first link and the blog lists 5 choices for you, and the confusion
game starts. Also, it doesn’t talk about the features that matter to you.
To be honest, only you know what matters to you and those blogs are written from a
generic point of view. So how do you choose the best (for yourself) among thousands
of different options?
Let me tell how an analyst i.e. you could do that job. Prepare a list of all kinds of
smartphones available in the market along with the latest prices of the same. You
could do this activity efficiently if you are aware of Web Scraping Techniques.
Now, you have a structured dataset in front of you! What are you waiting for? Start
Digging! Apply all sorts of filters on Battery, Storage, Price etc. and viola! You have
‘Best smartphones under 35k.’
For these shortlisted smartphones, you can gather the top 500 or 1000 reviews and
then analyse using NLP techniques.
5. Investment
You have a great job but it is a task for you to save money. And to be very honest, your
money will not multiply if you keep the majority of it in your Savings Account. To be honest,
you will not be able to catch up with the inflation rate.
Stock Investing sounds very fancy right? By the way, it’s not just for investment
bankers. Anyone can do it! If you choose the right set of stocks, in just one month,
you could earn an amount of money equivalent to what you would get had you put
this money in a Savings Bank account.
Pick up various stocks, diversify your portfolio and make efficient use of your Data
skills.
It should be mentioned, however, that having data science alone is not always enough. As
Irina Peregud, of InData Labs, explains: “Data scientists analyze data to find insights but it’s
the job of product managers and business leaders to tell them what to look for.” Essentially,
business leaders and heads of governmental institutions need to know what the problem is
before they send in their data science troops to try and solve it. Data scientists are able to dig
up masses of information but it’s worth nothing unless they are led by someone who
understands the setting in which they are working — a leader with industry experience. Goals
need to be clearly set before data scientists are able to theorize ways to reach them.
Automatization
Building automated response systems is often seen as an end goal for many business leaders
seeking to invest in data science. Many small decisions can be automated with ease when the
right data is collected and utilized. For example, many banks that grant loans have for many
years now been using credit scoring systems to predict their clients ‘credit-worthiness’,
however, now, with the aid of data science, they are able to do this with a much higher degree
of accuracy, which has relieved their employees of some the decision-making process,
lowered the possibility of not getting a return on their loans if the customer was not ‘worthy’,
and also sped up the process as well.
On top of that, data science is also able to help automate much more complicated
decision-making processes, with the ability to provide numerous solid directions to choose
from with data as evidence for those possibilities. Using data science, it is possible to forecast
the impacts of decisions that are yet to even be made. There are many examples of this, but
perhaps some of the best known are of companies on the brink of collapse that placed their
trust in data science and were saved by remodeling their company based on what the data told
them would work, such as Dunkin’ Donuts and Timberland. The former, invested in a loyalty
system and the latter invested in identifying it’s ideal customer. Having the data to back big
decisions such as these, allows decision makers to feel more confident in what they are doing
and invest more in the idea financially as well as psychologically.
Healthcare is another where area data science has shown to be highly beneficial for
decision-making in a variety of sectors. Obviously providing adequate treatment is the
number one priority. Many healthcare providers now are moving towards evidence-based
medicine, which when used in conjunction with data science, enables physicians to provide
patients with a more personalized experience by accessing a larger pool of sources before
making a decision on treatment.
Health and life insurance are other areas of healthcare that benefit significantly from data
science. Similar to how banks grant loans based on a ‘credit-worthiness’ score as mentioned
above, health and life insurers are able to develop ‘well-being’ scores. To develop such a
score can involve the collection of data from a large number of places, including social
media, financial transactions, and even body sensors. This can also be seen throughout the
insurance industry as a whole. In fact, as Datafloq explains: “Insurance companies rely on
growing their number of customers by adapting insurance policies for each individual
customer”. By using data science, insurers are able to grant more personalized insurance
schemes that work for both the customer and the insurer. This makes the decision-making
process easier as it is no longer a question of insurers saying ‘yes’ or ‘no’ to customers, but
about questioning the terms that will work for both parties involved.
Smoother Operations
Data science is also well-known to those that aim to improve operational standards, too. By
applying data science to operational procedures, decision makers are able to implement
changes much more efficiently and monitor if they are successful or not much more closely
through trial and error. Such methods can be applied to hiring and firing employees by
collecting and measuring data to see who fits the job best, as well as measuring performance
targets to see who really deserves to get promoted. On top of this, it helps employers see
where work is really needed and where it can be cut.
William Edwards Deming once said: “In God we trust. All others must bring data.” Though
he passed away more than 24 years ago, his words now hold more truth today than when they
were spoken. With the aid of data science, decision makers — whatever industry they may be
in — can make much more precise choices than ever before. Or, in some situations, wipe out
the whole decision-making process by automating it.
By harnessing data science to its full potential, top-ranking decision makers in all industries,
not only make better-informed decisions but make them with clearer predictions of the future.
With that advantage on their side, they are able to stabilize businesses that have not always
had a clear vision and save businesses that are on the brink of collapse.
However, it should be mentioned again that data science is only an advantage when decision
makers know there is a problem to be solved and can give the data scientists under their
leadership goals to aim for. Once goals have been established, data scientists can work their
magic and theorize how to fix it. Data science alone is not an advantage for decision-making,
data science combined with good leadership is.
Prediction
Predictive analytics uses historical data to predict future events. Typically, historical data is
used to build a mathematical model that captures important trends. That predictive model is
then used on current data to predict what will happen next, or to suggest actions to take for
optimal outcomes.
Predictive analytics has received a lot of attention in recent years due to advances in
supporting technology, particularly in the areas of big data and machine learning.
Your aggregated data tells a complex story. To extract the insights it holds, you need an
accurate predictive model.
Learn more
Prescriptive analytics is a branch of data analytics that uses predictive models to suggest
actions to take for optimal outcomes. Prescriptive analytics relies on optimization and
rules-based techniques for decision making. Forecasting the load on the electric grid over the
next 24 hours is an example of predictive analytics, whereas deciding how to operate power
plants based on this forecast represents prescriptive analytics.