Sales and Advertising

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Introduction

In this research, I made a simple linear regression model to take a look at the linear
relation between a data-set of sales and advertising for a weight control food plan
product. I create the fundamentals of linear regression and its implementation within
the Python programming language the usage of Scikit-Learn. Scikit-Learn is a
popular device learning library for the Python programming language.

Python libraries
I have Pycharm Python distribution installed on my system. It comes with most of the
standard Python libraries I need for this research. The basic libraries that used in this
research are:

 Numpy
 pandas
 Scikit-Learn
 Matplotlib

The problem statement


The creation of this machine algorithm model has as its goal the solving of the
problem and the creation of a matrix to ensure the model's performance. The first
thing to do is introduce the problem to be solved in this research. As mentioned
above, the statement is to model and investigate the linear relationship between the
sales and advertising data set for a diet product for weight control. Here I am using a
couple of performance matrices, Root Mean Square Value and R2 Score value, to
get how our model is performing.
So here we build a Linear Regression model to examine the relationship between
Sales and Advertising data-set for a diet product for weight control.

Solution
Offered a machine learning-based strategy for identifying sales and advertising
relation. The authors used a dataset regarding sales and advertising. This dataset
includes recent sales from Europeans. To resolve the dataset's class imbalance, the
researcher applied the Synthetic Minority Oversampling Approach (SMOTE)
oversampling approach. The following machine learning techniques were employed
to assess the effectiveness of the proposed approach: RF, NB, and multilayer
perception (MLP). The testing results demonstrated that the RF algorithm performed
admirably, with a 99.96% accuracy rate for relation finding. The NB and MLP
techniques scored 99.23% and 99.93% accuracy, respectively. The authors
acknowledge that more research is required to create a feature selection technique
that could improve the precision of current ML systems(Khatri et al., 2020).

Linear Regression
Linear Regression is a statistical technique which is used to discover the linear
relation between dependent and one or more independent variables. This approach
is relevant for Supervised learning Regression examples where we try to expect a
continuous variable. Linear Regression may be further categorized into two parts –
Simple and multi Linear Regression. In this research , I am using SLR technique in
which i have one independent and one dependent variable. it's far the best shape of
Linear Regression where we match a directly line to the facts.

Independent and Dependent Variables


In this research, I refer Independent variable as featured variables and Dependent
variable as Targeted variables. These variables are also acknowledged by different
names as follows: -

Independent variable
Independent variable is also called Constituent variable and is denoted by X. In
practical applications, independent variable is also called featured variable or
Predictor variable. We can denote it as:
Independent or Constituent variable (X) = Featured variable = Predictor variable

Dependent variable
Dependent variable is also called End product variable and is denoted by y.
Dependent variable is also called Targeted variable or Response variable. It can be
denote it as follows: -

Dependent or End product variable (y) = Targeted variable = Response variable

Simple Linear Regression (SLR)


SLR is the simplest model in ML. It models the linear relation between the
independent and dependent variables.

In this research, there is one independent or constituent variable which represents


the Sales data and is denoted by X. Similarly, there is one dependent or end product
variable which represents the Advertising data and is denoted by y. We want to build
a linear relationship between these variables. This linear relations can be model by
numerical equation of the form:-

Y = β0 + β1*X ------------- (1)


In this mathematical statement, X and Y are called independent and dependent
variables severally, β1 is the constant for independent variable and β0 is the
constant term. β0 and β1 are called constant quantity of the model.

For simpleness, we can compare the above mathematical statement with the basic
line mathematical statement of the form:-

y = ax + b ----------------- (2)
We can look that
slope of the line is given by, a = β1, and
intercept of the line by b = β0.
In this SLR model, we want to fit a line which approximate the linear relation between
X and Y. So, the question of fitting reduces to approximating the constant of the
model β0 and β1.

Ordinary Least Square Method


As I have represented earlier, the Sales and Advertising data are given by X and y
severally. If we tie a scatter plot of X and y, then it correspond the diagram which
looks like as follows:-

Now, our work is to find a line which best fits the above scatter plot. This line will help
us to pretend the value of any Targeted variable for any given Featured variable.
This line is called regression line. We can define an error function for any line. Then,
the regression line is the one which minimizes the error function. Such an error
function is also called a Costor a Cost function.

Cost Function
We want the above line to correspond the data-set as much as possible. In other
words, we want the line to be as close to the actual data points as possible. This can
be achieved by minimizing the vertical distance between the actual data point and
the fitted line. We calculate the vertical distance between each data point and the
line. This distance is called the residual. So in a regression model, we try to minimize
the residuals by finding the best-fit line.

Diagrammatical representation of analysis is given below.


In this plot, the residuals are represented by the vertical dotted lines from actual data
points to the line.
We can try to minimize the sum of the residuals, but then a large positive residual
would cancel out a large negative residual. For this reason, we minimize the sum of
the squares of the residuals.
Mathematically, we denote actual data points yi and predicted data points ŷi. So the
residual for data point i would be given by as

di = yi - ŷI

Sum of the squares of the residuals is given as:

D = Ʃ di2 for all data points

This is the Cost function. It indicates the total error present in the model, which is the
sum of the total errors of each individual data point. We can diagram it like this:-
We can estimate the model parameters β0 and β1 by minimizing the error in the
model by minimizing D. Thus, we can find the regression line given by equation (1).
This method of finding the parameters of the model and thus the regression line is
called the ordinary least squares method.

About the dataset


The data set has been imported from the econometric website with the following link
http://www.econometrics.com/intro/SALES.txt
This data set contains Sales and Advertising expenditures for a dietary weight
control product. It contains monthly data for 36 months. The variables in this data set
are Sales and Advertising.

# Import the data


url = "http://www.econometrics.com/intro/SALES.tx"df = pd.read_csv(url, sep='\t', header=None)

Methodology

Data analysis
First, I import the data-set into a data-frame using the standard read_csv() function
of the pandas library and assign it to the df variable. I then performed exploratory
data analysis to get a sense of the data. I checked the dimensions of the data-frame
using the shape attribute of the data-frame. I looked at the top 5 rows of the data-
frame using the pandas head() method. I viewed the data-frame summary using the
pandas info() method and the descriptive statistics using the describe() method.

View the summary of the data-frame with the pandas info() method.

Looked at the descriptive statistics of the data-frame with the pandas describe()
method.

Visualization:
Visualized the relationship between X and y by plotting a scatter-plot between X and
y.
Mechanics of SLR
The mechanics of a simple linear regression model begin by dividing the data set
into two sets – training and testing. We instantiate the lm regressor and deploy it to
the training set using the fit method. In this step, the model learned the correlation
between the training data (X_train, y_train). Now the model is ready to predict the
test data (X_test). Therefore, I predict on the test data using the prediction method.

Slope and intercept


The model slope is given by lm.coef_ and model intercept is denoted by
lm.intercept_. The estimated model slope and intercept values are 1.60509347 and -
11.16003616.

So, the equation of the regression line is

y = 1.60509347 * x - 11.16003616

Predictions
I have predicted the Advertising values on first five 5 Sales data-sets by writing code

lm.predict(X) [0:5]

If I remove [0:5], then I will get predicted Advertising values for the whole Sales data-
set.
To make prediction, on an individual Sales value, I write

lm.predict(Xi)

where Xi is the Sales data value of the ith observation.

Regression matrix for performance


Now, it is the time to evaluate model performance. For regression problems, there
are two ways to compute the model performance. They are RMSE (Root Mean
Square Error) and R-Squared Value. These are explained below:-

RMSE
RMSE is the standard deviation of the residuals. So, RMSE gives us the standard
deviation of the unexplained variance by the model. It can be calculated by taking
square root of Mean Squared Error. RMSE is an absolute measure of fit. It gives us
how spread the residuals are, given by the standard deviation of the residuals. The
more concentrated the data is around the regression line, the lower the residuals and
hence lower the standard deviation of residuals. It results in lower values of RMSE.
So, lower values of RMSE indicate better fit of data.

R2 Score
R2 Score is another metric to evaluate performance of a regression model. It is also
called coefficient of determination. It gives us an idea of goodness of fit for the linear
regression models. It indicates the percentage of variance that is explained by the
model. Mathematically,

R2 Score = Explained Variation/Total Variation

In general, the higher the R2 Score value, the better the model fits the data. Usually,
its value ranges from 0 to 1. So, we want its value to be as close to 1. Its value can
become negative if our model is wrong.

Suggestion & Conclusion


The RMSE value has been found to be 11.2273. It means the standard deviation for
our prediction is 11.2273. So, sometimes we expect the predictions to be off by more
than 11.2273 and other times we expect less than 11.2273. So, the model is not
good fit to the data.

In business decisions, the benchmark for the R2 score value is 0.7. It means if R2
score value >= 0.7, then the model is good enough to deploy on unseen data
whereas if R2 score value < 0.7, then the model is not good enough to deploy. Our
R2 score value has been found to be .5789. It means that this model explains 57.89
% of the variance in our dependent variable. So, the R2 score value confirms that
the model is not good enough to deploy because it does not provide good fit to the
data.

Analysis
A linear regression model may not represent the data appropriately. The model may
be a poor fit to the data. So, we should validate our model by defining and examining
residual plots.

The difference between the observed value of the dependent variable (y) and the
predicted value (ŷi) is called the residual and is denoted by e. The scatter-plot of
these residuals is called residual plot.

If the data points in a residual plot are randomly dispersed around horizontal axis
and an approximate zero residual mean, a linear regression model may be
appropriate for the data. Otherwise a non-linear model may be more appropriate.

If we take a look at the generated ‘Residual errors’ plot, we can clearly see that the
train data plot pattern is non-random. Same is the case with the test data plot
pattern. So, it suggests a better-fit for a non-linear model.

Over-fitting and Under-fitting Testing


I calculate training set score as 0.2861. Similarly, I calculate test set score as
0.5789. The training set score is very poor. So, the model does not learn the
relationships appropriately from the training data. Thus, the model performs poorly
on the training data. It is a clear sign of Under-fitting. Hence, I validated my finding
that the linear regression model does not provide good fit to the data.

Under-fitting means our model performs poorly on the training data. It means the
model does not capture the relationships between the training data. This problem
can be improved by increasing model complexity. We should use more powerful
models like Polynomial regression to increase model complexity.

Model Assumptions
The Linear Regression Model is based on several assumptions which are listed
below:-

i. Linear relationship
ii. Multivariate normality
iii. No or little multicollinearity
iv. No auto-correlation
v. Homoscedasticity

Linear relationship
The relationship between response and feature variables should be linear. This
linear relationship assumption can be tested by plotting a scatter-plot between
response and feature variables.

Multivariate normality
The linear regression model requires all variables to be multivariate normal. A
multivariate normal distribution means a vector in multiple normally distributed
variables, where any linear combination of the variables is also normally distributed.

No or little multicollinearity
It is assumed that there is little or no multicollinearity in the data. Multicollinearity
occurs when the features (or independent variables) are highly correlated.

No auto-correlation
Also, it is assumed that there is little or no auto-correlation in the data.
Autocorrelation occurs when the residual errors are not independent from each
other.

Homoscedasticity
Homoscedasticity describes a situation in which the error term (that is, the noise in
the model) is the same across all values of the independent variables. It means the
residuals are same across the regression line. It can be checked by looking at
scatter plot.

References
The concepts and ideas in this research have been taken from the following :
1. Machine learning notes by Andrew Ng
2. Python Data Science Handbook by Jake VanderPlas
3. Hands-On Machine Learning with Scikit Learn and Tensorflow by Aurilien Geron
4. Introduction to Machine Learning with Python by Andreas C Muller and Sarah
Guido

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy