Sales and Advertising
Sales and Advertising
Sales and Advertising
In this research, I made a simple linear regression model to take a look at the linear
relation between a data-set of sales and advertising for a weight control food plan
product. I create the fundamentals of linear regression and its implementation within
the Python programming language the usage of Scikit-Learn. Scikit-Learn is a
popular device learning library for the Python programming language.
Python libraries
I have Pycharm Python distribution installed on my system. It comes with most of the
standard Python libraries I need for this research. The basic libraries that used in this
research are:
Numpy
pandas
Scikit-Learn
Matplotlib
Solution
Offered a machine learning-based strategy for identifying sales and advertising
relation. The authors used a dataset regarding sales and advertising. This dataset
includes recent sales from Europeans. To resolve the dataset's class imbalance, the
researcher applied the Synthetic Minority Oversampling Approach (SMOTE)
oversampling approach. The following machine learning techniques were employed
to assess the effectiveness of the proposed approach: RF, NB, and multilayer
perception (MLP). The testing results demonstrated that the RF algorithm performed
admirably, with a 99.96% accuracy rate for relation finding. The NB and MLP
techniques scored 99.23% and 99.93% accuracy, respectively. The authors
acknowledge that more research is required to create a feature selection technique
that could improve the precision of current ML systems(Khatri et al., 2020).
Linear Regression
Linear Regression is a statistical technique which is used to discover the linear
relation between dependent and one or more independent variables. This approach
is relevant for Supervised learning Regression examples where we try to expect a
continuous variable. Linear Regression may be further categorized into two parts –
Simple and multi Linear Regression. In this research , I am using SLR technique in
which i have one independent and one dependent variable. it's far the best shape of
Linear Regression where we match a directly line to the facts.
Independent variable
Independent variable is also called Constituent variable and is denoted by X. In
practical applications, independent variable is also called featured variable or
Predictor variable. We can denote it as:
Independent or Constituent variable (X) = Featured variable = Predictor variable
Dependent variable
Dependent variable is also called End product variable and is denoted by y.
Dependent variable is also called Targeted variable or Response variable. It can be
denote it as follows: -
For simpleness, we can compare the above mathematical statement with the basic
line mathematical statement of the form:-
y = ax + b ----------------- (2)
We can look that
slope of the line is given by, a = β1, and
intercept of the line by b = β0.
In this SLR model, we want to fit a line which approximate the linear relation between
X and Y. So, the question of fitting reduces to approximating the constant of the
model β0 and β1.
Now, our work is to find a line which best fits the above scatter plot. This line will help
us to pretend the value of any Targeted variable for any given Featured variable.
This line is called regression line. We can define an error function for any line. Then,
the regression line is the one which minimizes the error function. Such an error
function is also called a Costor a Cost function.
Cost Function
We want the above line to correspond the data-set as much as possible. In other
words, we want the line to be as close to the actual data points as possible. This can
be achieved by minimizing the vertical distance between the actual data point and
the fitted line. We calculate the vertical distance between each data point and the
line. This distance is called the residual. So in a regression model, we try to minimize
the residuals by finding the best-fit line.
di = yi - ŷI
This is the Cost function. It indicates the total error present in the model, which is the
sum of the total errors of each individual data point. We can diagram it like this:-
We can estimate the model parameters β0 and β1 by minimizing the error in the
model by minimizing D. Thus, we can find the regression line given by equation (1).
This method of finding the parameters of the model and thus the regression line is
called the ordinary least squares method.
Methodology
Data analysis
First, I import the data-set into a data-frame using the standard read_csv() function
of the pandas library and assign it to the df variable. I then performed exploratory
data analysis to get a sense of the data. I checked the dimensions of the data-frame
using the shape attribute of the data-frame. I looked at the top 5 rows of the data-
frame using the pandas head() method. I viewed the data-frame summary using the
pandas info() method and the descriptive statistics using the describe() method.
View the summary of the data-frame with the pandas info() method.
Looked at the descriptive statistics of the data-frame with the pandas describe()
method.
Visualization:
Visualized the relationship between X and y by plotting a scatter-plot between X and
y.
Mechanics of SLR
The mechanics of a simple linear regression model begin by dividing the data set
into two sets – training and testing. We instantiate the lm regressor and deploy it to
the training set using the fit method. In this step, the model learned the correlation
between the training data (X_train, y_train). Now the model is ready to predict the
test data (X_test). Therefore, I predict on the test data using the prediction method.
y = 1.60509347 * x - 11.16003616
Predictions
I have predicted the Advertising values on first five 5 Sales data-sets by writing code
lm.predict(X) [0:5]
If I remove [0:5], then I will get predicted Advertising values for the whole Sales data-
set.
To make prediction, on an individual Sales value, I write
lm.predict(Xi)
RMSE
RMSE is the standard deviation of the residuals. So, RMSE gives us the standard
deviation of the unexplained variance by the model. It can be calculated by taking
square root of Mean Squared Error. RMSE is an absolute measure of fit. It gives us
how spread the residuals are, given by the standard deviation of the residuals. The
more concentrated the data is around the regression line, the lower the residuals and
hence lower the standard deviation of residuals. It results in lower values of RMSE.
So, lower values of RMSE indicate better fit of data.
R2 Score
R2 Score is another metric to evaluate performance of a regression model. It is also
called coefficient of determination. It gives us an idea of goodness of fit for the linear
regression models. It indicates the percentage of variance that is explained by the
model. Mathematically,
In general, the higher the R2 Score value, the better the model fits the data. Usually,
its value ranges from 0 to 1. So, we want its value to be as close to 1. Its value can
become negative if our model is wrong.
In business decisions, the benchmark for the R2 score value is 0.7. It means if R2
score value >= 0.7, then the model is good enough to deploy on unseen data
whereas if R2 score value < 0.7, then the model is not good enough to deploy. Our
R2 score value has been found to be .5789. It means that this model explains 57.89
% of the variance in our dependent variable. So, the R2 score value confirms that
the model is not good enough to deploy because it does not provide good fit to the
data.
Analysis
A linear regression model may not represent the data appropriately. The model may
be a poor fit to the data. So, we should validate our model by defining and examining
residual plots.
The difference between the observed value of the dependent variable (y) and the
predicted value (ŷi) is called the residual and is denoted by e. The scatter-plot of
these residuals is called residual plot.
If the data points in a residual plot are randomly dispersed around horizontal axis
and an approximate zero residual mean, a linear regression model may be
appropriate for the data. Otherwise a non-linear model may be more appropriate.
If we take a look at the generated ‘Residual errors’ plot, we can clearly see that the
train data plot pattern is non-random. Same is the case with the test data plot
pattern. So, it suggests a better-fit for a non-linear model.
Under-fitting means our model performs poorly on the training data. It means the
model does not capture the relationships between the training data. This problem
can be improved by increasing model complexity. We should use more powerful
models like Polynomial regression to increase model complexity.
Model Assumptions
The Linear Regression Model is based on several assumptions which are listed
below:-
i. Linear relationship
ii. Multivariate normality
iii. No or little multicollinearity
iv. No auto-correlation
v. Homoscedasticity
Linear relationship
The relationship between response and feature variables should be linear. This
linear relationship assumption can be tested by plotting a scatter-plot between
response and feature variables.
Multivariate normality
The linear regression model requires all variables to be multivariate normal. A
multivariate normal distribution means a vector in multiple normally distributed
variables, where any linear combination of the variables is also normally distributed.
No or little multicollinearity
It is assumed that there is little or no multicollinearity in the data. Multicollinearity
occurs when the features (or independent variables) are highly correlated.
No auto-correlation
Also, it is assumed that there is little or no auto-correlation in the data.
Autocorrelation occurs when the residual errors are not independent from each
other.
Homoscedasticity
Homoscedasticity describes a situation in which the error term (that is, the noise in
the model) is the same across all values of the independent variables. It means the
residuals are same across the regression line. It can be checked by looking at
scatter plot.
References
The concepts and ideas in this research have been taken from the following :
1. Machine learning notes by Andrew Ng
2. Python Data Science Handbook by Jake VanderPlas
3. Hands-On Machine Learning with Scikit Learn and Tensorflow by Aurilien Geron
4. Introduction to Machine Learning with Python by Andreas C Muller and Sarah
Guido