Data Science Interview Questions #Week1
Data Science Interview Questions #Week1
Interview Questions
(30 days of Interview Preparation)
INEURON.AI
Q1. What is the difference between AI, Data Science, ML, and DL?
Ans 1 :
Artificial Intelligence: AI is purely math and scientific exercise, but when it became computational, it
started to solve human problems formalized into a subset of computer science. Artificial intelligence has
changed the original computational statistics paradigm to the modern idea that machines could mimic
actual human capabilities, such as decision making and performing more “human” tasks. Modern AI into
two categories
1. General AI - Planning, decision making, identifying objects, recognizing sounds, social &
business transactions
2. Applied AI - driverless/ Autonomous car or machine smartly trade stocks
Machine Learning: Instead of engineers “teaching” or programming computers to have what they need
to carry out tasks, that perhaps computers could teach themselves – learn something without being
explicitly programmed to do so. ML is a form of AI where based on more data, and they can change
actions and response, which will make more efficient, adaptable and scalable. e.g., navigation apps and
recommendation engines. Classified into:-
1. Supervised
2. Unsupervised
3. Reinforcement learning
Data Science: Data science has many tools, techniques, and algorithms called from these fields, plus
others –to handle big data
The goal of data science, somewhat similar to machine learning, is to make accurate predictions and to
automate and perform transactions in real-time, such as purchasing internet traffic or automatically
generating content.
Page 2
INEURON.AI
Data science relies less on math and coding and more on data and building new systems to process the
data. Relying on the fields of data integration, distributed architecture, automated machine learning, data
visualization, data engineering, and automated data-driven decisions, data science can cover an entire
spectrum of data processing, not only the algorithms or statistics related to data.
Q2. What is the difference between Supervised learning, Unsupervised learning and
Reinforcement learning?
Ans 2:
Machine Learning
Machine learning is the scientific study of algorithms and statistical models that computer systems use to
effectively perform a specific task without using explicit instructions, relying on patterns and inference
instead.
Building a model by learning the patterns of historical data with some relationship between data to make
a data-driven prediction.
Supervised learning
In a supervised learning model, the algorithm learns on a labeled dataset, to generate reasonable
predictions for the response to new data. (Forecasting outcome of new data)
• Regression
• Classification
Page 3
INEURON.AI
Unsupervised learning
An unsupervised model, in contrast, provides unlabelled data that the algorithm tries to make sense of by
extracting features, co-occurrence and underlying patterns on its own. We use unsupervised learning for
• Clustering
• Anomaly detection
• Association
• Autoencoders
Reinforcement Learning
Reinforcement learning is less supervised and depends on the learning agent in determining the output
solutions by arriving at different possible ways to achieve the best possible solution.
Business understanding: Understand the give use case, and also, it's good to know more about the
domain for which the use cases are built.
Data Acquisition and Understanding: Data gathering from different sources and understanding the
data. Cleaning the data, handling the missing data if any, data wrangling, and EDA( Exploratory data
analysis).
Page 4
INEURON.AI
Modeling: Feature Engineering - scaling the data, feature selection - not all features are important. We
use the backward elimination method, correlation factors, PCA and domain knowledge to select the
features.
Model Training based on trial and error method or by experience, we select the algorithm and train with
the selected features.
Model evaluation Accuracy of the model , confusion matrix and cross-validation.
If accuracy is not high, to achieve higher accuracy, we tune the model...either by changing the algorithm
used or by feature selection or by gathering more data, etc.
Deployment - Once the model has good accuracy, we deploy the model either in the cloud or Rasberry
py or any other place. Once we deploy, we monitor the performance of the model.if its good...we go live
with the model or reiterate the all process until our model performance is good.
It's not done yet!!!
What if, after a few days, our model performs badly because of new data. In that case, we do all the
process again by collecting new data and redeploy the model.
Ans 4:
Linear Regression tends to establish a relationship between a dependent variable(Y) and one or more
independent variable(X) by finding the best fit of the straight line.
The equation for the Linear model is Y = mX+c, where m is the slope and c is the intercept
In the above diagram, the blue dots we see are the distribution of 'y' w.r.t 'x.' There is no straight line that
runs through all the data points. So, the objective here is to fit the best fit of a straight line that will try to
minimize the error between the expected and actual value.
Page 5
INEURON.AI
Ans 5:
OLS is a stats model, which will help us in identifying the more significant features that can has an
influence on the output. OLS model in python is executed as:
lm = smf.ols(formula = 'Sales ~ am+constant', data = data).fit() lm.conf_int() lm.summary()
And we get the output as below,
The higher the t-value for the feature, the more significant the feature is to the output variable. And
also, the p-value plays a rule in rejecting the Null hypothesis(Null hypothesis stating the features has zero
significance on the target variable.). If the p-value is less than 0.05(95% confidence interval) for a
feature, then we can consider the feature to be significant.
Ans 6:
The main objective of creating a model(training data) is making sure it fits the data properly and reduce
the loss. Sometimes the model that is trained which will fit the data but it may fail and give a poor
performance during analyzing of data (test data). This leads to overfitting. Regularization came to
overcome overfitting.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute value of
magnitude” of coefficient, as penalty term to the loss function.
Page 6
INEURON.AI
Lasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether. So,
this works well for feature selection in case we have a huge number of features.
Methods like Cross-validation, Stepwise Regression are there to handle overfitting and perform feature
selection work well with a small set of features. These techniques are good when we are dealing with a
large set of features.
Along with shrinking coefficients, the lasso performs feature selection, as well. (Remember the
‘selection‘ in the lasso full-form?) Because some of the coefficients become exactly zero, which is
equivalent to the particular feature being excluded from the model.
Ans 7:
Overfitting happens when the model learns signal as well as noise in the training data and wouldn’t
perform well on new/unseen data on which model wasn’t trained on.
To avoid overfitting your model on training data like cross-validation sampling, reducing the number
of features, pruning, regularization, etc.
Page 7
INEURON.AI
Regularization adds the penalty as model complexity increases. The regularization parameter
(lambda) penalizes all the parameters except intercept so that the model generalizes the data and
won’t overfit.
Ridge regression adds “squared magnitude of the coefficient" as penalty term to the loss
function. Here the box part in the above image represents the L2 regularization element/term.
Lambda is a hyperparameter.
Page 8
INEURON.AI
If lambda is zero, then it is equivalent to OLS. But if lambda is very large, then it will add too much
weight, and it will lead to under-fitting.
Ridge regularization forces the weights to be small but does not make them zero and does not give
the sparse solution.
Ridge is not robust to outliers as square terms blow up the error differences of the outliers, and the
regularization term tries to fix it by penalizing the weights
Ridge regression performs better when all the input features influence the output, and all with weights
are of roughly equal size.
L2 regularization can learn complex data patterns.
Ans 8.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.
The definition of R-squared is the percentage of the response variable variation that is explained by a
linear model.
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Page 9
INEURON.AI
There is a problem with the R-Square. The problem arises when we ask this question to ourselves.** Is it
good to help as many independent variables as possible?**
The answer is No because we understood that each independent variable should have a meaningful
impact. But, even** if we add independent variables which are not meaningful**, will it improve R-Square
value?
Yes, this is the basic problem with R-Square. How many junk independent variables or important
independent variable or impactful independent variable you add to your model, the R-Squared value will
always increase. It will never decrease with the addition of a newly independent variable, whether it could
be an impactful, non-impactful, or bad variable, so we need another way to measure equivalent R-
Square, which penalizes our model with any junk independent variable.
So, we calculate the Adjusted R-Square with a better adjustment in the formula of generic R-square.
The mean squared error tells you how close a regression line is to a set of points. It does this by
taking the distances from the points to the regression line (these distances are the “errors”) and
squaring them.
Giving an intuition
The line equation is y=Mx+B. We want to find M (slope) and B (y-intercept) that minimizes the
squared error.
Page 10
INEURON.AI
Q10. Why Support Vector Regression? Difference between SVR and a simple regression
model?
Ans 10:
In simple linear regression, try to minimize the error rate. But in SVR, we try to fit the error within
a certain threshold.
Main Concepts:-
1. Boundary
2. Kernel
3. Support Vector
4. Hyper Plane
Page 11
INEURON.AI
Our best fit line is the one where the hyperplane has the maximum number of points.
We are trying to do here is trying to decide a decision boundary at ‘e’ distance from the original
hyperplane such that data points closest to the hyperplane or the support vectors are within that
boundary line
Page 12
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
#DAY 02
P a g e 1 | 22
Q1. What is Logistic Regression?
Answer:
The logistic regression technique involves the dependent variable, which can be represented in the
binary (0 or 1, true or false, yes or no) values, which means that the outcome could only be in either
one form of two. For example, it can be utilized when we need to find the probability of a
successful or fail event.
Model
Output = 0 or 1
Z = WX + B
hΘ(x) = sigmoid (Z)
hΘ(x) = log(P(X) / 1 - P(X) ) = WX +B
If ‘Z’ goes to infinity, Y(predicted) will become 1, and if ‘Z’ goes to negative infinity, Y(predicted)
will become 0.
The output from the hypothesis is the estimated probability. This is used to infer how confident can
predicted value be actual value when given an input X.
P a g e 2 | 22
Cost Function
This implementation is for binary logistic regression. For data with more than 2 classes, softmax re
gression has to be used.
Answer:
Linear and Logistic regression are the most basic form of regression which are commonly used. The
essential difference between these two is that Logistic regression is used when the dependent variable
is binary. In contrast, Linear regression is used when the dependent variable is continuous, and the
nature of the regression line is linear.
Linear regression models data using continuous numeric value. As against, logistic regression models
the data in the binary values.
Linear regression requires to establish the linear relationship among dependent and independent
variables, whereas it is not necessary for logistic regression.
In linear regression, the independent variable can be correlated with each other. On the contrary, in
the logistic regression, the variable must not be correlated with each other.
Answer:-
With linear regression you fit a polynomial through the data - say, like on the example below, we fit
a straight line through {tumor size, tumor type} sample set:
P a g e 3 | 22
Above, malignant tumors get 1, and non-malignant ones get 0, and the green line is our hypothesis
h(x). To make predictions, we may say that for any given tumor size x, if h(x) gets bigger than 0.5,
we predict malignant tumors. Otherwise, we predict benignly.
It looks like this way, we could correctly predict every single training set sample, but now let's change
the task a bit.
Intuitively it's clear that all tumors larger certain threshold are malignant. So let's add another sample
with huge tumor size, and run linear regression again:
Now our h(x)>0.5→malignant doesn't work anymore. To keep making correct predictions, we need
to change it to h(x)>0.2 or something - but that not how the algorithm should work.
We cannot change the hypothesis each time a new sample arrives. Instead, we should learn it off the
training set data, and then (using the hypothesis we've learned) make correct predictions for the data
we haven't seen before.
Linear regression is unbounded.
A decision tree is a type of supervised learning algorithm that can be used in classification as well as
regressor problems. The input to a decision tree can be both continuous as well as categorical. The
decision tree works on an if-then statement. Decision tree tries to solve a problem by using tree
representation (Node and Leaf)
Assumptions while creating a decision tree: 1) Initially all the training set is considered as a root 2)
Feature values are preferred to be categorical, if continuous then they are discretized 3) Records are
P a g e 4 | 22
distributed recursively on the basis of attribute values 4) Which attributes are considered to be in root
node or internal node is done by using a statistical approach.
1) ID3(Iterative Dichotomiser 3): This solution uses Entropy and Information gain as metrics
to form a better decision tree. The attribute with the highest information gain is used as a root
node, and a similar approach is followed after that. Entropy is the measure that characterizes
the impurity of an arbitrary collection of examples.
Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is
equal. In this way, entropy will give a measure of impurity in the dataset.
Steps to decide which attribute to split:
P a g e 5 | 22
1. Compute the entropy for the dataset
2. For every attribute:
2) CART Algorithm (Classification and Regression trees): In CART, we use the GINI index as
a metric. Gini index is used as a cost function to evaluate split in a dataset
Steps to calculate Gini for a split:
1. Calculate Gini for subnodes, using formula sum of the square of probability for success
and failure (p2+q2).
2. Calculate Gini for split using weighted Gini score of each node of that split.
Split on Gender:
Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
P a g e 6 | 22
Weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Here Weighted Gini is high for gender, so we consider splitting based on gender
Answer:
To control the leaf size, we can set the parameters:-
1. Maximum depth :
Maximum tree depth is a limit to stop the further splitting of nodes when the specified tree depth has
been reached during the building of the initial decision tree.
NEVER use maximum depth to limit the further splitting of nodes. In other words: use the
largest possible value.
1. Pre-pruning is also known as the early stopping criteria. As the name suggests, the criteria
are set as parameter values while building the model. The tree stops growing when it meets
any of these pre-pruning criteria, or it discovers the pure classes.
P a g e 7 | 22
2. In Post-pruning, the idea is to allow the decision tree to grow fully and observe the CP value.
Next, we prune/cut the tree with the optimal CP(Complexity Parameter) value as the
parameter.
The CP (complexity parameter) is used to control tree growth. If the cost of adding a variable is
higher, then the value of CP, tree growth stops.
1. If the feature is categorical, the split is done with the elements belonging to a particular
class.
2. If the feature is continuous, the split is done with the elements higher than a threshold.
At every split, the decision tree will take the best variable at that moment. This will be done according
to an impurity measure with the split branches. And the fact that the variable used to do split is
categorical or continuous is irrelevant (in fact, decision trees categorize continuous variables by
creating binary regions with the threshold).
At last, the good approach is to always convert your categoricals to continuous
using LabelEncoder or OneHotEncoding.
P a g e 8 | 22
Q8. What is the Random Forest Algorithm?
Answer:
Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The
base estimators in the random forest are decision trees. Random forest randomly selects a set of
features that are used to decide the best split at each node of the decision tree.
Looking at it step-by-step, this is what a random forest model does:
1. Random subsets are created from the original dataset (bootstrapping).
2. At each node in the decision tree, only a random set of features are considered to decide the
best split.
3. A decision tree model is fitted on each of the subsets.
4. The final prediction is calculated by averaging the predictions from all decision trees.
To sum up, the Random forest randomly selects data points and features and builds multiple trees
(Forest).
Random Forest is used for feature importance selection. The attribute (.feature_importances_) is
used to find feature importance.
Answer:
In predicting models, the prediction error is composed of two different errors
1. Bias
2. Variance
P a g e 9 | 22
It is important to understand the variance and bias trade-off which tells about to minimize the Bias
and Variance in the prediction and avoids overfitting & under fitting of the model.
Bias: It is the difference between the expected or average prediction of the model and the correct
value which we are trying to predict. Imagine if we are trying to build more than one model by
collecting different data sets, and later on, evaluating the prediction, we may end up by different
prediction for all the models. So, bias is something which measures how far these model prediction
from the correct prediction. It always leads to a high error in training and test data.
Variance: Variability of a model prediction for a given data point. We can build the model multiple
times, so the variance is how much the predictions for a given point vary between different
realizations of the model.
P a g e 10 | 22
Q10. What are Ensemble Methods?
Answer
1. Bagging and Boosting
Decision trees have been around for a long time and also known to suffer from bias and variance.
You will have a large bias with simple trees and a large variance with complex trees.
Ensemble methods - which combines several decision trees to produce better predictive
performance than utilizing a single decision tree. The main principle behind the ensemble model is
that a group of weak learners come together to form a strong learner.
Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree.
Here the idea is to create several subsets of data from the training sample chosen randomly with
replacement. Now, each collection of subset data is used to train their decision trees. As a result, we
end up with an ensemble of different models. Average of all the predictions from different trees are
used which is more robust than a single decision tree.
Boosting is another ensemble technique to create a collection of predictors. In this technique, learners
are learned sequentially with early learners fitting simple models to the data and then analyzing data
P a g e 11 | 22
for errors. In other words, we fit consecutive trees (random sample), and at every step, the goal is to
solve for net error from the prior tree.
When a hypothesis misclassifies an input, its weight is increased, so that the next hypothesis is more
likely to classify it correctly. By combining the whole set at the end converts weak learners into a
better performing model.
Answer:
SVM or Large margin classifier is a supervised learning algorithm that uses a powerful technique
called SVM for classification.
1) Linear SVM: In Linear SVM, the data points are expected to be separated by some apparent
gap. Therefore, the SVM algorithm predicts a straight hyperplane dividing the two classes. The
hyperplane is also called as maximum margin hyperplane
P a g e 12 | 22
2) Non-Linear SVM: It is possible that our data points are not linearly separable in a p-
dimensional space, but can be linearly separable in a higher dimension. Kernel tricks make it
possible to draw nonlinear hyperplanes. Some standard kernels are a) Polynomial Kernel b) RBF
kernel(mostly used).
P a g e 13 | 22
Answer:
Bayes’ Theorem finds the probability of an event occurring given the probability of another event
that has already occurred. Bayes’ theorem is stated mathematically as the following equation:
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P(y|X) = {P(X|y) P(y)}/{P(X)}
where, y is class variable and X is a dependent feature vector (of size n) where:
X = (x_1,x_2,x_3,.....,x_n)
To clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of
the dataset)
P a g e 14 | 22
X = (Rainy, Hot, High, False) y = No So basically, P(X|y) here means, the probability of “Not
playing golf” given that the weather conditions are “Rainy outlook”, “Temperature is hot”, “high
humidity” and “no wind”.
This is as simple as calculating the mean and standard deviation values of each input variable (x) for
each class value.
Mean (x) = 1/n * sum(x)
Where n is the number of instances, and x is the values for an input variable in your training data.
We can calculate the standard deviation using the following equation:
Standard deviation(x) = sqrt (1/n * sum(xi-mean(x)^2 ))
When to use what? Standard Naive Bayes only supports categorical features, while Gaussian Naive
Bayes only supports continuously valued features.
Answer:
A confusion matrix is a table that is often used to describe the performance of a classification model
(or “classifier”) on a set of test data for which the true values are known. It allows the visualization
of the performance of an algorithm.
P a g e 15 | 22
A confusion matrix is a summary of prediction results on a classification problem. The number of
correct and incorrect predictions are summarized with count values and broken down by each class.
Here,
Class 1: Positive
Class 2: Negative
Definition of the Terms:
1. Positive (P): Observation is positive (for example: is an apple).
2. Negative (N): Observation is not positive (for example: is not an apple).
3. True Positive (TP): Observation is positive, and is predicted to be positive.
4. False Negative (FN): Observation is positive, but is predicted negative.
5. True Negative (TN): Observation is negative, and is predicted to be negative.
6. False Positive (FP): Observation is negative, but is predicted positive.
Answer:
Accuracy
Accuracy is defined as the ratio of the sum of True Positive and True
Negative by Total(TP+TN+FP+FN)
P a g e 16 | 22
However, there are problems with accuracy. It assumes equal costs for both kinds of
errors. A 99% accuracy can be excellent, good, mediocre, poor, or terrible depending upon
the problem.
Misclassification Rate
Misclassification Rate is defined as the ratio of the sum of False Positive and False
Negative by Total(TP+TN+FP+FN)
Misclassification Rate is also called Error Rate.
Answer:
P a g e 17 | 22
True Negative Rate
Specificity (SP) is calculated as the number of correct negative predictions divided by the
total number of negatives. It is also called a true negative rate (TNR). The best specificity is
1.0, whereas the worst is 0.0.
False positive rate (FPR) is calculated as the number of incorrect positive predictions divided by the
total number of negatives. The best false positive rate is 0.0, whereas the worst is 1.0. It can also be
calculated as 1 – specificity.
P a g e 18 | 22
Q16. What are F1 Score, precision and recall?
Recall:-
Recall can be defined as the ratio of the total number of correctly classified positive examples
divide to the total number of positive examples.
1. High Recall indicates the class is correctly recognized (small number of FN).
2. Low Recall indicates the class is incorrectly recognized (large number of FN).
Precision:
To get the value of precision, we divide the total number of correctly classified positive examples
by the total number of predicted positive examples.
1. High Precision indicates an example labeled as positive is indeed positive (a small number
of FP).
2. Low Precision indicates an example labeled as positive is indeed positive (large number of
FP).
Remember:-
High recall, low precision: This means that most of the positive examples are correctly recognized
(low FN), but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN), but those
we predict as positive are indeed positive (low FP).
F-measure/F1-Score:
P a g e 19 | 22
Since we have two measures (Precision and Recall), it helps to have a measurement that represents
both of them. We calculate an F-measure, which uses Harmonic Mean in place of Arithmetic
Mean as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.
Answer:
Randomized search CV is used to perform a random search on hyperparameters. Randomized
search CV uses a fit and score method, predict proba, decision_func, transform, etc..,
The parameters of the estimator used to apply these methods are optimized by cross-validated
search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of
parameter settings is sampled from the specified distributions. The number of parameter settings that
are tried is given by n_iter.
Code Example :
Answer:
P a g e 20 | 22
Grid search is the process of performing hyperparameter tuning to determine the optimal values for
a given model.
CODE Example:-
Grid search runs the model on all the possible range of hyperparameter values and outputs the best
model
Answer:
Bayesian search, in contrast to the grid and random search, keeps track of past evaluation results,
which they use to form a probabilistic model mapping hyperparameters to a probability of a score on
the objective function.
Code:
from skopt import BayesSearchCV
opt = BayesSearchCV(
SVC(),
P a g e 21 | 22
{
'C': (1e-6, 1e+6, 'log-uniform'),
'gamma': (1e-6, 1e+1, 'log-uniform'),
'degree': (1, 8), # integer valued parameter
'kernel': ['linear', 'poly', 'rbf']
},
n_iter=32,
cv=3)
Answer:
Zero Component Analysis:
Making the co-variance matrix as the Identity matrix is called whitening. This will remove the first
and second-order statistical structure
ZCA transforms the data to zero means and makes the features linearly independent of each other
In some image analysis applications, especially when working with images of the color and tiny typ
e, it is frequently interesting to apply some whitening to the data before, e.g. training a classifier.
P a g e 22 | 22
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# DAY 03
P a g e 1 | 18
Q1. How do you treat heteroscedasticity in regression?
Heteroscedasticity means unequal scattered distribution. In regression analysis, we generally talk about
the heteroscedasticity in the context of the error term. Heteroscedasticity is the systematic change in the
spread of the residuals or errors over the range of measured values. Heteroscedasticity is the problem
because Ordinary least squares (OLS) regression assumes that all residuals are drawn from a random
population that has a constant variance.
Pure heteroscedasticity:- It refers to cases where we specify the correct model and let us observe the
non-constant variance in residual plots.
Impure heteroscedasticity:- It refers to cases where you incorrectly specify the model, and that causes
the non-constant variance. When you leave an important variable out of a model, the omitted effect is
absorbed into the error term. If the effect of the omitted variable varies throughout the observed range of
data, it can produce the telltale signs of heteroscedasticity in the residual plots.
P a g e 2 | 18
differential. To do this, change the model from using the raw measure to using rates and per capita values.
Of course, this type of model answers a slightly different kind of question. You’ll need to determine
whether this approach is suitable for both your data and what you need to learn.
Weighted regression:
It is a method that assigns each data point to a weight based on the variance of its fitted value. The idea
is to give small weights to observations associated with higher variances to shrink their squared residuals.
Weighted regression minimizes the sum of the weighted squared residuals. When you use the correct
weights, heteroscedasticity is replaced by homoscedasticity.
Multicollinearity means independent variables are highly correlated to each other. In regression
analysis, it's an important assumption that the regression model should not be faced with a problem of
multicollinearity.
If two explanatory variables are highly correlated, it's hard to tell, which affects the dependent variable.
Let's say Y is regressed against X1 and X2 and where X1 and X2 are highly correlated. Then the effect
of X1 on Y is hard to distinguish from the effect of X2 on Y because any increase in X1 tends to be
associated with an increase in X2.
Another way to look at the multicollinearity problem is: Individual t-test P values can be misleading. It
means a P-value can be high, which means the variable is not important, even though the variable is
important.
Correcting Multicollinearity:
1) Remove one of the highly correlated independent variables from the model. If you have two or more
factors with a high VIF, remove one from the model.
2) Principle Component Analysis (PCA) - It cut the number of interdependent variables to a smaller set
of uncorrelated components. Instead of using highly correlated variables, use components in the model
that have eigenvalue greater than 1.
3) Run PROC VARCLUS and choose the variable that has a minimum (1-R2) ratio within a cluster.
4) Ridge Regression - It is a technique for analyzing multiple regression data that suffer from
multicollinearity.
5) If you include an interaction term (the product of two independent variables), you can also reduce
multicollinearity by "centering" the variables. By "centering," it means subtracting the mean from the
values of the independent variable before creating the products.
P a g e 3 | 18
When is multicollinearity not a problem?
1) If your goal is to predict Y from a set of X variables, then multicollinearity is not a problem. The
predictions will still be accurate, and the overall R2 (or adjusted R2) quantifies how well the model
predicts the Y values.
2) Multiple dummy (binary) variables that represent a categorical variable with three or more categories.
Market basket analysis is the study of items that are purchased or grouped in a single transaction or
multiple, sequential transactions. Understanding the relationships and the strength of those relationships
is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons,
etc.
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations
between items. It works by looking for combinations of items that occur together frequently in
transactions. To put it another way, it allows retailers to identify relationships between the items that
people buy.
Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an
item based on the occurrences of other items in the transaction.
The technique of association rules is widely used for retail basket analysis. It can also be used for
classification by using rules with class labels on the right-hand side. It is even used for outlier detection
with rules indicating infrequent/abnormal association.
Association analysis also helps us to identify cross-selling opportunities, for example, we can use the
rules resulting from the analysis to place associated products together in a catalog, in the supermarket,
or the Webshop, or apply them when targeting a marketing campaign for product B at customers who
have already purchased product A.
P a g e 4 | 18
Association rules are given in the form as below:
A=>B[Support,Confidence] The part before => is referred to as if (Antecedent) and the part after => is
referred to as then (Consequent).
Where A and B are sets of items in the transaction data, a and B are disjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%] Above rule says:
1. 20% transaction show Anti-virus software is bought with purchase of a Computer
2. 60% of customers who purchase Anti-virus software is bought with purchase of a Computer
An example of Association Rules * Assume there are 100 customers
1. 10 of them bought milk, 8 bought butter and 6 bought both of them 2 .bought milk => bought
butter
2. support = P(Milk & Butter) = 6/100 = 0.06
3. confidence = support/P(Butter) = 0.06/0.08 = 0.75
4. lift = confidence/P(Milk) = 0.75/0.10 = 7.5
KNN means K-Nearest Neighbour Algorithm. It can be used for both classification and regression.
It is the simplest machine learning algorithm. Also known as lazy learning (why? Because it does not
create a generalized model during the time of training, so the testing phase is very important where it
does the actual job. Hence Testing is very costly - in terms of time & money). Also called an instance-
based or memory-based learning
In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its
neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is
a positive integer, typically small). If k = 1, then the object is assigned to the class of that single nearest
neighbor.
P a g e 5 | 18
In k-NN regression, the output is the property value for the object. This value is the average of the
values of k nearest neighbors.
All three distance measures are only valid for continuous variables. In the instance of categorical
variables, the Hamming distance must be used.
How to choose the value of K: K value is a hyperparameter which needs to choose during the time of
model building
Also, a small number of neighbors are most flexible fit, which will have a low bias, but the high variance
and a large number of neighbors will have a smoother decision boundary, which means lower variance
but higher bias.
We should choose an odd number if the number of classes is even. It is said the most common values are
to be 3 & 5.
A pipeline is what chains several steps together, once the initial exploration is done. For example, some
codes are meant to transform features — normalize numerically, or turn text into vectors, or fill up
missing data, and they are transformers; other codes are meant to predict variables by fitting an algorithm,
P a g e 6 | 18
such as random forest or support vector machine, they are estimators. Pipeline chains all these together,
which can then be applied to training data in block.
Example of a pipeline that imputes data with the most frequent value of each column, and then fit a
decision tree classifier.
From sklearn.pipeline import Pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('clf', DecisionTreeClassifier())]
pipeline = Pipeline(steps)
clf = pipeline.fit(X_train,y_train)```
Instead of fitting to one model, it can be looped over several models to find the best one.
classifiers = [ KNeighborsClassifier(5), RandomForestClassifier(), GradientBoostingClassifier()]
for clf in classifiers:
steps = [('imputation', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('clf', clf)]
pipeline = Pipeline(steps)
I also learned the pipeline itself can be used as an estimator and passed to cross-validation or grid
search.
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)
print(results.mean())
The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set
consisting of many variables correlated with each other, either heavily or lightly, while retaining the
variation present in the dataset, up to the maximum extent. The same is done by transforming the
variables to a new set of variables, which are known as the principal components (or simply, the PCs)
and are orthogonal, ordered such that the retention of variation present in the original variables decreases
as we move down in the order. So, in this way, the 1st principal component retains maximum variation
that was present in the original components. The principal components are the eigenvectors of a
covariance matrix, and hence they are orthogonal.
P a g e 7 | 18
Main important points to be considered:
1. Normalize the data
2. Calculate the covariance matrix
3. Calculate the eigenvalues and eigenvectors
4. Choosing components and forming a feature vector
5. Forming Principal Components
Understanding VIF
If the variance inflation factor of a predictor variable is 5 this means that variance for the coefficient of
that predictor variable is 5 times as large as it would be if that predictor variable were uncorrelated with
the other predictor variables.
In other words, if the variance inflation factor of a predictor variable is 5 this means that the standard
error for the coefficient of that predictor variable is 2.23 times (√5 = 2.23) as large as it would be if that
predictor variable were uncorrelated with the other predictor variables.
P a g e 8 | 18
Weight of evidence (WOE) and information value (IV) are simple, yet powerful techniques to
perform variable transformation and selection.
P a g e 9 | 18
Q10: How to evaluate that data does not have any outliers ?
In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation
that lies far away from other values. An outlier is an observation that diverges from otherwise well-
structured data.
Detection:
Therefore, if you have any data point that is more than 3 times the standard deviation, then those points
are very likely to be anomalous or outliers.
Method 2 — Boxplots: Box plots are a graphical depiction of numerical data through their quantiles. It
is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the
boundaries of the data distribution. Any data points that show above or below the whiskers can be
considered outliers or anomalous.
P a g e 10 | 18
Method 3 - Violin Plots: Violin plots are similar to box plots, except that they also show the probability
density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin
plot will include all the data that is in a box plot: a marker for the median of the data, a box or marker
indicating the interquartile range, and possibly all sample points if the number of samples is not too high.
Method 4 - Scatter Plots: A scatter plot is a type of plot or mathematical diagram using Cartesian
coordinates to display values for typically two variables for a set of data. The data are displayed as a
collection of points, each having the value of one variable determining the position on the horizontal axis
and the value of the other variable determining the position on the vertical axis.
The points which are very far away from the general spread of data and have a very few neighbors are
considered to be outliers
P a g e 11 | 18
Q11: What you do if there are outliers?
Q12: What are the encoding techniques you have applied with
Examples ?
In many practical data science activities, the data set will contain categorical variables. These variables
are typically stored as text values". Since machine learning is based on mathematical equations, it would
cause a problem when we keep categorical variables as is.
Let's consider the following dataset of fruit names and their weights.
Some of the common encoding techniques are:
Label encoding: In label encoding, we map each category to a number or a label. The labels chosen for
the categories have no relationship. So categories that have some ties or are close to each other lose such
information after encoding.
One - hot encoding: In this method, we map each category to a vector that contains 1 and 0 denoting
the presence of the feature or not. The number of vectors depends on the categories which we want to
keep. For high cardinality features, this method produces a lot of columns that slows down the learning
significantly.
Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance).
The prediction error for any machine learning algorithm can be broken down into three parts:
Bias Error
Variance Error
Irreducible Error
P a g e 12 | 18
The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced
from the chosen framing of the problem and may be caused by factors like unknown variables that
influence the mapping of the input variables to the output variable.
Bias: Bias means that the model favors one result more than the others. Bias is the simplifying
assumptions made by a model to make the target function easier to learn. The model with high bias pays
very little attention to the training data and oversimplifies the model. It always leads to a high error in
training and test data.
Variance: Variance is the amount that the estimate of the target function will change if different training
data was used. The model with high variance pays a lot of attention to training data and does not
generalize on the data which it hasn’t seen before. As a result, such models perform very well on training
data but have high error rates on test data.
So, the end goal is to come up with a model that balances both Bias and Variance. This is called Bias
Variance Trade-off. To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error.
Q14: What is the difference between Type 1 and Type 2 error and
severity of the error?
Type I Error
A Type I error is often referred to as a “false positive" and is the incorrect rejection of the true null
hypothesis in favor of the alternative.
In the example above, the null hypothesis refers to the natural state of things or the absence of the tested
effect or phenomenon, i.e., stating that the patient is HIV negative. The alternative hypothesis states that
the patient is HIV positive. Many medical tests will have the disease they are testing for as the alternative
hypothesis and the lack of that disease as the null hypothesis.
P a g e 13 | 18
A Type I error would thus occur when the patient doesn’t have the virus, but the test shows that they do.
In other words, the test incorrectly rejects the true null hypothesis that the patient is HIV negative.
Type II Error
A Type II error is the inverse of a Type I error and is the false acceptance of a null hypothesis that is not
true, i.e., a false negative. A Type II error would entail the test telling the patient they are free of HIV
when they are not.
Considering this HIV example, which error type do you think is more acceptable? In other words, would
you rather have a test that was more prone to Type I or Types II error? With HIV, the momentary stress
of a false positive is likely better than feeling relieved at a false negative and then failing to take steps to
treat the disease. Pregnancy tests, blood tests, and any diagnostic tool that has serious consequences for
the health of a patient are usually overly sensitive for this reason – they should err on the side of a false
positive.
But in most fields of science, Type II errors are seen as less serious than Type I errors. With the Type II
error, a chance to reject the null hypothesis was lost, and no conclusion is inferred from a non-rejected
null. But the Type I error is more serious because you have wrongly rejected the null hypothesis and
ultimately made a claim that is not true. In science, finding a phenomenon where there is none is more
egregious than failing to find a phenomenon where there is.
P a g e 14 | 18
Q16: What is the Mean Median Mode standard deviation for the
sample and population?
Mean It is an important technique in statistics. Arithmetic Mean can also be called an average. It is the
number of the quantity obtained by summing two or more numbers/variables and then dividing the sum
by the number of numbers/variables.
Mode The mode is also one of the types for finding the average. A mode is a number that occurs most
frequently in a group of numbers. Some series might not have any mode; some might have two modes,
which is called a bimodal series.
In the study of statistics, the three most common ‘averages’ in statistics are mean, median, and mode.
Median is also a way of finding the average of a group of data points. It’s the middle number of a set of
numbers. There are two possibilities, the data points can be an odd number group, or it can be an even
number group.
If the group is odd, arrange the numbers in the group from smallest to largest. The median will be the
one which is exactly sitting in the middle, with an equal number on either side of it. If the group is even,
arrange the numbers in order and pick the two middle numbers and add them then divide by 2. It will be
the median number of that set.
Standard Deviation (Sigma) Standard Deviation is a measure of how much your data is spread out in
statistics.
What is Absolute Error? Absolute Error is the amount of error in your measurements. It is the
difference between the measured value and the “true” value. For example, if a scale states 90 pounds,
but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1
lbs.
This can be caused by your scale, not measuring the exact amount you are trying to measure. For
example, your scale may be accurate to the nearest pound. If you weigh 89.6 lbs, the scale may “round
up” and give you 90 lbs. In this case the absolute error is 90 lbs – 89.6 lbs = .4 lbs.
Mean Absolute Error The Mean Absolute Error(MAE) is the average of all absolute errors. The formula
is: mean absolute error
P a g e 15 | 18
Where,
n = the number of errors, Σ = summation symbol (which means “add them all up”), |xi – x| = the absolute
errors. The formula may look a little daunting, but the steps are easy:
Find all of your absolute errors, xi – x. Add them all up. Divide by the number of errors. For example, if
you had 10 measurements, divide by 10.
Q18: What is the difference between long data and wide data?
There are many different ways that you can present the same dataset to the world. Let's take a look at
one of the most important and fundamental distinctions, whether a dataset is wide or long.
The difference between wide and long datasets boils down to whether we prefer to have more columns
in our dataset or more rows.
Wide Data A dataset that emphasizes putting additional data about a single subject in columns is called
a wide dataset because, as we add more columns, the dataset becomes wider.
Long Data Similarly, a dataset that emphasizes including additional data about a subject in rows is called
a long dataset because, as we add more rows, the dataset becomes longer. It's important to point out that
there's nothing inherently good or bad about wide or long data.
In the world of data wrangling, we sometimes need to make a long dataset wider, and we sometimes need
to make a wide dataset longer. However, it is true that, as a general rule, data scientists who embrace the
concept of tidy data usually prefer longer datasets over wider ones.
Q19: What are the data normalization method you have applied, and
why?
Normalization is a technique often applied as part of data preparation for machine learning. The goal of
normalization is to change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. For machine learning, every dataset does not require
normalization. It is required only when features have different ranges.
In simple words, when multiple attributes are there, but attributes have values on different scales, this
may lead to poor data models while performing data mining operations. So they are normalized to bring
all the attributes on the same scale, usually something between (0,1).
It is not always a good idea to normalize the data since we might lose information about maximum and
minimum values. Sometimes it is a good idea to do so.
For example, ML algorithms such as Linear Regression or Support Vector Machines typically converge
faster on normalized data. But on algorithms like K-means or K Nearest Neighbours, normalization could
P a g e 16 | 18
be a good choice or a bad depending on the use case since the distance between the points plays a key
role here.
Types of Normalisation :
1 Min-Max Normalization: In most cases, standardization is used feature-wise
2 Z-score normalization In this technique, values are normalized based on a mean and standard
deviation of the data
v’, v is new and old of each entry in data respectively. σA, A is the standard deviation and mean of A
respectively.
standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the
properties of a standard normal distribution with
μ=0 and σ=1 where μ is the mean (average) and σ is the standard deviation from the mean; standard
scores (also called z scores) of the samples are calculated as follows:
z=(x−μ)/σ
P a g e 17 | 18
Q20: What is the difference between normalization and
Standardization with example?
In ML, every practitioner knows that feature scaling is an important issue. The two most discussed
scaling methods are Normalization and Standardization. Normalization typically means it rescales the
values into a range of [0,1].
It is an alternative approach to Z-score normalization (or standardization) is the so-called Min-Max
scaling (often also called “normalization” - a common cause for ambiguities). In this approach, the data
is scaled to a fixed range - usually 0 to 1. Scikit-Learn provides a transformer called MinMaxScaler for
this. A Min-Max scaling is typically done via the following equation:
Xnorm = X-Xmin/Xmax-Xmin
Example with sample data: Before Normalization: Attribute Price in Dollars Storage Space Camera
Standardization (or Z-score normalization) typically means rescales data to have a mean of 0 and a
standard deviation of 1 (unit variance) Formula: Z or X_new=(x−μ)/σ where μ is the mean (average),
and σ is the standard deviation from the mean; standard scores (also called z scores) Scikit-Learn
provides a transformer called StandardScaler for standardization Example: Let’s take an approximately
normally distributed set of numbers: 1, 2, 2, 3, 3, 3, 4, 4, and 5. Its mean is 3, and its standard deviation:
1.22. Now, let’s subtract the mean from all data points. we get a new data set of: -2, -1, -1, 0, 0, 0, 1, 1,
and 2. Now, let’s divide each data point by 1.22. As you can see in the picture below, we get: -1.6, -0.82,
-0.82, 0, 0, 0, 0.82, 0.82, and 1.63
P a g e 18 | 18
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# DAY 04
Q1. What is upsampling and downsampling with examples?
Hypothesis Testing?
Null: Two samples mean are equal. Alternate: Two samples mean are not
equal.
For rejecting the null hypothesis, a test is calculated. Then the test statistic
is compared with a critical value, and if found to be greater than the critical
value, the hypothesis will be rejected.
Critical Value:-
Critical values are the point beyond which we reject the null hypothesis.
Critical value tells us, what is the probability of N number of samples,
belonging to the same distribution. Higher, the critical value which means
lower the probability of N number of samples belonging to the same
distribution.
Critical values can be used to do hypothesis testing in the following way.
IMP-If the test statistic is lower than the critical value, accept the hypothesis
or else reject the hypothesis.
Chi-Square Test:-
A chi-square test is used if there is a relationship between two categorical
variables.
Chi-Square test is used to determine whether there is a significant
difference between the expected frequency and the observed frequency in
one or more categories. Chi-square is also called the non-parametric test
as it will not use any parameter
2-Anova test:-
ANOVA, also called an analysis of variance, is used to compare multiples
(three or more) samples with a single test.
Useful when there are more than three populations. Anova compares the
variance within and between the groups of the population. If the variation is
much larger than the within variation, the means of different samples will
not be equal. If the between and within variations are approximately the
same size, then there will be no significant difference between sample
means. Assumptions of ANOVA: 1-All populations involved follow a normal
distribution. 2-All populations have the same variance (or standard
deviation). 3-The samples are randomly selected and independent of one
another.
ANOVA uses the mean of the samples or the population to reject or
support the null hypothesis. Hence it is called parametric testing.
3-Z Statics:-
In a z-test, the samples are assumed to be normal distributed. A z score is
calculated with population parameters as “population mean” and
“population standard deviation” and it is used to validate a hypothesis that
the sample drawn belongs to the same population.
The statistics used for this hypothesis testing is called z-statistic, the score
for which is calculated as z = (x — μ) / (σ / √n), where x= sample mean μ =
population mean σ / √n = population standard deviation If the test statistic is
lower than the critical value, accept the hypothesis or else reject the
hypothesis
4- T Statics:-
A t-test used to compare the mean of the given samples. Like z-test, t-test
also assumed a normal distribution of the samples. A t-test is used when
the population parameters (mean and standard deviation) are unknown.
There are three versions of t-test
5- F Statics:-
The F-test is designed to test if the two population variances are equal. It
compares the ratio of the two variances. Therefore, if the variances are
equal, then the ratio of the variances will be 1.
The F-distribution is the ratio of two independent chi-square variables
divided by their respective degrees of freedom.
F = s1^2 / s2^2 and where s1^2 > s2^2.
If the null hypothesis is true, then the F test-statistic given above can be
simplified. This ratio of sample variances will be tested statistic used. If the
null hypothesis is false, then we will reject the null hypothesis that the ratio
was equal to 1 and our assumption that they were equal.
µ X̄ = µ
Where,
µ X̄ = Mean of the sample means µ= Population mean and, the standard
deviation of the sample mean is denoted as:
σ X̄ = σ/sqrt(n)
Where,
σ X̄ = Standard deviation of the sample mean σ = Population standard
deviation n = sample size
A sufficiently large sample size can predict the characteristics of a
population accurately. For Example, we shall take a uniformly distributed
data:
Randomly distributed data: Even for a randomly (Exponential) distributed
data the plot of the means is normally distributed.
The advantage of CLT is that we need not worry about the actual data
since the means of it will always be normally distributed. With this, we can
create component intervals, perform T-tests and ANOVA tests from the
given samples.
Meaning
learning?
Machine Learning is a technique to learn from that data and then apply wha
t has been learnt to make an informed decision | The main difference betwe
en deep and machine learning is, machine learning models become better
progressively but the model still needs some guidance. If a machine-
learning model returns an inaccurate prediction then the programmer need
s to fix that problem explicitly but in the case of deep learning, the model do
es it by himself.
>Machine Learning can perform well with small size data also | Deep Learn
ing does not perform as good with smaller datasets.
>It is generally recommended to break the problem into smaller chunks, sol
ve them and then combine the results | It generally focusses on solving the
problem end to end
>Results are more interpretable | Results Maybe more accurate but less int
erpretable
> Solves comparatively less complex problems | Solves more complex prob
lems.
Though traditional ML algorithms solve a lot of our cases, they are not
useful while working with high dimensional data that is where we have a
large number of inputs and outputs. For example, in the case of
handwriting recognition, we have a large amount of input where we will
have different types of inputs associated with different types of handwriting.
The second major challenge is to tell the computer what are the features it
should look for that will play an important role in predicting the outcome as
well as to achieve better accuracy while doing so.
Image recognition
Object Detection
Sigmoid or Logistic
Mathematical expression?
Gradient descent is an optimisation algorithm used to minimize some
function by iteratively moving in the direction of steepest descent as
defined by negative of the gradient. In machine learning, we used gradient
descent to update the parameters of our model. Parameters refer to
coefficients in the Linear Regression and weights in neural networks.
The size of these steps called the learning rate. With the high learning rate,
we can cover more ground each step, but we risk overshooting the lower
point since the slope of the hill is constantly changing. With a very lower
learning rate, we can confidently move in the direction of the negative
gradient because we are recalculating it so frequently. The Lower learning
rate is more precise, but calculating the gradient is time-consuming, so it
will take a very large time to get to the bottom.
Math
Now let’s run gradient descent using new cost function. There are two
parameters in cost function we can control: m (weight) and b (bias). Since
we need to consider that the impact each one has on the final prediction,
we need to use partial derivatives. We calculate the partial derivative of the
cost function concerning each parameter and store the results in a
gradient.
Math
It does not need any special mentions of the features of the function
to be learned.
2. Multiply the sample with the square root of (2/ni). Where ni is the
number of input units for that layer.
2. Multiply the sample with the square root of (1/ni) where ni is several
input units for that layer.
Q13: What is optimiser is deep learning, and which one is the best?
Gradient Descent
Where θ is the weight parameter, η is the learning rate, and ∇J(θ;x,y) is the
gradient of weight parameter θ
In the batch gradient, we use the entire dataset to compute the gradient of
the cost function for each iteration for gradient descent and then update the
weights.
Mini -batch gradient descent is widely used and converges faster and is
more stable.
An input layer, a hidden layer which is also known as encoding layer, and a
decoding layer. This network is trained to reconstruct its inputs, which
forces the hidden layer to try to learn good representations of the inputs.
An autoencoder neural network is an unsupervised Machine-learning
algorithm that applies backpropagation, setting the target values to be
equal to the inputs. An autoencoder is trained to attempts to copy its input
to its output. Internally, it has a hidden layer which describes a code used
to represent the input.
Autoencoder Components:
1- Encoder: In this, the model learns how to reduce the input dimensions
and compress the input data into an encoded representation.
3- Decoder: In this, the model learns how to reconstruct the data from the
encod represented to be as close to the original inputs as possible.
Types of Autoencoders :
Convolutional layers are the major building blocks which are used in
convolutional neural networks.
1. Input Layer: It holds the raw input of image with width 32, height 32
and depth 3.
4. Pool Layer: This layer is periodically inserted within the covnets, and
its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
Pooling Layer
The most common form is a pooling layer with filters of size 2x2 applied
with a stride of 2 downsamples every depth slice in the input by two along
both width and height, discarding 75% of the activations. Every MAX
operation would, in this case, be taking a max over four numbers (little 2x2
region in some depth slice). The depth dimension remains unchanged.
LeNet in 1998
AlexNet in 2012
VGG in 2014
VGG was submitted in the year 2013, and it became a runner up in the
ImageNet contest in 2014. It is widely used as a simple architecture
compared to AlexNet.
GoogleNet in 2014
In 2014, several great models were developed like VGG, but the winner of
the ImageNet contest was GoogleNet.
ResNet in 2015
There are 152 layers in the Microsoft ResNet. The authors showed
empirically that if you keep on adding layers, the error rate should keep on
decreasing in contrast to “plain nets” we're adding a few layers resulted in
higher training and test errors.
Learning Rate
The learning rate controls how much we should adjust the weights
concerning the loss gradient. Learning rates are randomly initialised.
Lower the values of the learning rate slower will be the convergence to
global minima.
Higher values for the learning rate will not allow the gradient descent to
converge
Since our goal is to minimise the function cost to find the optimised value
for weights, we run multiples iteration with different weights and calculate
the cost to arrive at a minimum cost
----------------------------------------------------------------------------------------------------
------------------------
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# Day-5
Since one epoch is too large to feed to the computer at once, we divide it into several smaller batches.
We always use more than one Epoch because one epoch leads to underfitting.
As the number of epochs increases, several times the weight are changed in the neural network and the
curve goes from underfitting up to optimal to overfitting curve.
Unlike the learning rate hyperparameter where its value doesn’t affect computational time, the batch
sizes must be examined in conjunctions with the execution time of training. The batch size is limited by
hardware’s memory, while the learning rate is not. Leslie recommends using a batch size that fits in
hardware’s memory and enables using larger learning rate.
If our server has multiple GPUs, the total batch size is the batch size on a GPU multiplied by the numbers
of GPU. If the architectures are small or your hardware permits very large batch sizes, then you might
compare the performance of different batch sizes. Also, recall that small batch sizes add regularization
while large batch sizes add less, so utilize this while balancing the proper amount of regularization. It is
often better to use large batch sizes so a larger learning rate can be used.
More technically, at each training stage, individual nodes are either dropped out of the net with
probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges
to a dropped-out node are also removed.
Where to use
It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and
recurrent layers such as the long short-term memory network layer.
Dropout may be implemented on any or all hidden layers in the network as well as the visible or input
layer. It is not used on the output layer.
Benefits:-
1. Dropout forces a neural network to learn more robust features that are very useful in conjunction
with different random subsets of the other neurons.
2. Dropout generally doubles the number of iterations required to converge. However, the training
time for each epoch is less.
• Grid Search
• Random Search
1. Observe and understand the clues available during training by monitoring validation/test
loss early in training, tune your architecture and hyper-parameters with short runs of a few
epochs.
2. Signs of underfitting or overfitting of the test or validation loss early in the training process are
useful for tuning the hyper-parameters.
• Sage Maker
• Comet.ml
• Weights &Biases
• Deep Cognition
• Azure ML
In most learning networks, an error is calculated as the difference between the predicted output and the
actual output.
The function that is used to compute this error is known as Loss Function J(.). Different loss functions
will give different errors for the same prediction, and thus have a considerable effect on the performance
of the model. One of the most widely used loss function is mean square error, which calculates the square
of the difference between the actual values and predicted value. Different loss functions are used to deals
with a different type of tasks, i.e. regression and classification.
Absolute error
1. Binary Cross-Entropy
2. Negative Log-Likelihood
3. Margin Classifier
4. Soft Margin Classifier
Activation functions decide whether a neuron should be activated or not by calculating a weighted sum
and adding bias with it. The purpose of the activation function is to introduce non-linearity into the output
of a neuron.
In a neural network, we would update the weights and biases of the neurons based on the error at the
outputs. This process is known as back-propagation. Activation function makes the back-propagation
possible since the gradients are supplied along with the errors to update the weights and biases.
A neural network without activation functions is essentially a linear regression model. The activation
functions do the non-linear transformation to the input, making it capable of learning and performing
more complex tasks.
1. Identity
2. Binary Step
3. Sigmoid
4. Tanh
5. ReLU
6. Leaky ReLU
The activation functions do the non-linear transformation to the input, making it capable of learning and
performing more complex tasks.
Q7: What do you under by vanishing gradient problem and how can
Do we solve that?
The problem:
As more layers using certain activation function are added to neural networks, the gradients of the loss
function approach zero, making the networks tougher to train.
Why:
Certain activation functions, like the sigmoid function, squishes a large input space into a small input
space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
For shallow networks with only a few layers that use these activations, this isn’t a big problem. However,
when more layers are used, it can cause the gradient to be too small for training to work effectively.
However, when n hidden layers use an activation like the sigmoid function, n small derivatives are
multiplied together. Thus, the gradient decreases exponentially as we propagate down to the initial layers.
The simplest solution is to use other activation functions, such as ReLU, which doesn’t cause a small
derivative.
Residual networks are another solution, as they provide residual connections straight to earlier layers.
It is a popular approach in deep learning where pre-trained models are used as the starting point on
computer vision and natural language processing tasks given the vast compute and time resources
required to develop neural network models on these problems
Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a
second related task.
Transfer learning is an optimization that allows rapid progress or improved performance when modelling
the second task.
Transfer learning only works in deep learning if the model features learned from the first task are general.
This architecture is from the VGG group, Oxford. It improves AlexNet by replacing the large kernel-
sized filter with multiple 3X3 kernel-sized filters one after another. With a given receptive field(the
effective area size of input image on which output depends), multiple stacked smaller size kernel is better
than the one with a larger size kernel because multiple non-linear layers increases the depth of the
network which enables it to learn more complex features, and that too at a lower cost.
Three fully connected layers follow the VGG convolutional layers. The width of the networks starts at
the small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer. It achieves
the top-5 accuracy of 92.3 % on ImageNet.
Experiments in paper four can judge the power of the residual network. The plain 34 layer network had
high validation error than the 18 layers plain network. This is where we realize the degradation problems.
And the same 34 layers network when converted to the residual network has much less training error
than the 18 layers residual network.
When we hear the about“ImageNet” in the context of deep learning and Convolutional Neural Network,
we are referring to ImageNet Large Scale Visual Recognition Challenge.
The main aim of this image classification challenge is to train the model that can correctly classify an
input image into the 1,000 separate objects category.
These 1,000 image categories represent object classes that we encounter in our day-to-day lives, such as
species of dogs, cats, various household objects, vehicle types, and much more.
When it comes to the image classification, the ImageNet challenge is the “de facto “ benchmark for
computer vision classification algorithms — and the leaderboard for this challenge has
been dominated by Convolutional Neural Networks and Deep learning techniques since 2012.
Clone the repo locally, and you have it. To compile it, run a make. But first, if you intend to use the GPU
capability, you need to edit the Makefile in the first two lines, where you tell it to compile for GPU usage
with CUDA drivers.
Q13: What is YOLO and explain the architecture of YOLO (you only
The first YOLO You only look once (YOLO) version came about May 2016 and sets the core of the
algorithm, the following versions are improvements that fix some drawbacks.
Core Concept:-
The algorithm works off by dividing the image into the grid of the cells, for each cell bounding boxes
and their scores are predicted, alongside class probabilities. The confidence is given in terms of IOU
(intersection over union), metric, which is measuring how much the detected object overlaps with the
ground truth as a fraction of the total area spanned by the two together (the union).
YOLO v2-
This improves on some of the shortcomings of the first version, namely the fact that it is not very good
at detecting objects that are very near and tends to make some of the mistakes on localization.
It introduces a few newer things: Which are anchor boxes (pre-determined sets of boxes such that the
network moves from predicting the bounding boxes to predicting the offsets from these) and the use of
features that are more fine-grained so smaller objects can be predicted better.
YOLO v3-
YOLOv3 came about April 2018, and it adds small improvements, including the fact that bounding boxes
get predicted at the different scales. The underlying meaty part of the YOLO network, Darknet, is
expanded in this version to have 53 convolutional layers
# DAY 06
Q1. What is NLP?
Natural language processing (NLP): It is the branch of artificial intelligence that helps computers
understand, interpret and manipulate human language. NLP draws from many disciplines, including
computer science and computational linguistics, in its pursuit to fill the gap between human
communication and computer understanding.
• Counting the number of times that each word appears in the document.
• I am calculating the frequency that each word appears in a document out of all the words in
the document.
Q7.What do you understand by TF-IDF?
TF-IDF: It stands for the term of frequency-inverse document frequency.
TF-IDF weight: It is a statistical measure used to evaluate how important a word is to a document in
a collection or corpus. The importance increases proportionally to the number of times a word appears
in the document but is offset by the frequency of the word in the corpus.
• Term Frequency (TF): is a scoring of the frequency of the word in the current document.
Since every document is different in length, it is possible that a term would appear much
more times in long documents than shorter ones. The term frequency is often divided by the
document length to normalise.
• Inverse Document Frequency (IDF): It is a scoring of how rare the word is across the
documents. It is a measure of how rare a term is, Rarer the term, and more is the IDF score.
Thus,
In Skip-gram model, we take a centre word and a window of context (neighbour) words, and
we try to predict the context of words out to some window size for each centre word. So, our
model is going to define a probability distribution, i.e. probability of a word appearing in the
context given a centre word and we are going to choose our vector representations to maximise
the probability.
The basic idea behind PV-DM is inspired by Word2Vec. In CBOW model of Word2Vec, the
model learns to predict a centre word based on the contexts. For example- given a sentence
“The cat sat on the table”, CBOW model would learn to predict the words “sat” given the
context words — the cat, on and table. Similarly,in PV-DM the main idea is: randomly sample
consecutive words from the paragraph and predict a centre word from the randomly sampled
set of words by taking as the input — the context words and the paragraph id.
Let’s have a look at the model diagram for some more clarity. In this given model, we see
Paragraph matrix, (Average/Concatenate) and classifier sections.
Paragraph matrix: It is the matrix where each column represents the vector of a paragraph.
Average/Concatenate: It means that whether the word vectors and paragraph vector are
averaged or concatenated.
Classifier: In this, it takes the hidden layer vector (the one that was concatenated/averaged) as
input and predicts the Centre word.
In the matrix D, It has the embeddings for “seen” paragraphs (i.e. arbitrary length
documents), the same way Word2Vec models learns embeddings for words. For unseen
paragraphs, the model is again run through gradient descent (5 or so iterations) to infer a
document vector.
Time series forecasting is a technique for the prediction of events through a sequence of
time. The technique is used across many fields of study, from the geology to behaviour to
economics. The techniques predict future events by analysing the trends of the past, on the
assumption that future trends will hold similar to historical trends.
Time-series:
1. Whenever data is recorded at regular intervals of time.
2. Time-series forecast is Extrapolation.
3. Time-series refers to an ordered series of data.
Regression:
1. Whereas in regression, whether data is recorded at regular or irregular intervals of time,
we can apply.
2. Regression is Interpolation.
3. Regression refer both ordered and unordered series of data.
Q11. What is the difference between stationery and non-
stationary data?
Stationary: A series is said to be "STRICTLY STATIONARY” if the Mean, Variance &
Covariance is constant over some time or time-invariant.
Non-Stationary:
o Most models assume stationary of data. In other words, standard techniques are
invalid if data is "NON-STATIONARY".
o Autocorrelation may result due to "NON-STATIONARY".
o Non-stationary processes are a random walk with or without a drift (a slow, steady
change).
o Deterministic trends (trends that are constant, positive or negative, independent of
time for the whole life of the series).
-------------------------------------------------------------------------------------------------------------------
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 07
Q1. What is the process to make data stationery from non-
stationary in time series?
Ans:
The two most common ways to make a non-stationary time series stationary are:
Differencing
Transforming
Differencing:
To make your series stationary, you take a difference between the data points. So let us say, your
original time series was:
Once, you make the difference, plot the series and see if there is any improvement in the ACF curve.
If not, you can try a second or even a third-order differencing. Remember, the more you difference,
the more complicated your analysis is becoming.
Transforming:
If we cannot make a time series stationary, you can try out transforming the variables. Log transform
is probably the most commonly used transformation if we see the diverging time series.
However, it is suggested that you use transformation only in case differencing is not working.
Ans:
Stationary series: It is one in which the properties – mean, variance and covariance, do not vary
with time.
In the first plot, we can see that the mean varies (increases) with time, which results in an
upward trend. This is the non-stationary series.
For the series classification as stationary, it should not exhibit the trend.
Moving on to the second plot, we do not see a trend in the series, but the variance of the series
is a function of time. As mentioned previously, a stationary series must have a constant
variance.
If we look at the third plot, the spread becomes closer, as the time increases, which implies that
covariance is a function of time.
These three plots refer to the non-stationary time series. Now give your attention to fourth:
In this case, Mean, Variance and Covariance are constant with time. This is how a stationary time
series looks like.
Most of the statistical models require the series to be stationary to make an effective and precise
prediction.
The various process you can use to find out your data is stationary or not by the following terms:
1. Visual Test
2. Statistical Test
3. ADF(Augmented Dickey-Fuller) Test
4. KPSS(Kwiatkowski-Phillips-Schmidt-Shin) Test
Hmm, this looks like there is a trend. To build up confidence, let's add a linear regression for this
graph:
Great, now it’s clear theirs a trend in the graph by adding Linear Regression.
Q5. What is the Augmented Dickey-Fuller Test?
Ans:
The Dickey-Fuller test: It is one of the most popular statistical tests. It is used to determine the
presence of unit root in a series, and hence help us to understand if the series is stationary or not.
The null and alternate hypothesis for this test is:
Null Hypothesis: The series has a unit root (value of a =1)
Alternate Hypothesis: The series has no unit root.
If we fail to reject the null hypothesis, we can say that the series is non-stationary. This means that
the series can be linear or difference stationary.
k log(n)- 2log(L(θ̂)).
Here n is the sample size.
K is the number of parameters which your model estimates.
L (θ̂) represents the likelihood of the model tested, when evaluated at maximum likelihood values
of θ.
Quality of descriptive model is determined by how well it describes all available data and the
interpretation it provides to inform the problem domain better.
Q9. Give some examples of the Time-Series forecast?
Ans:
There is almost an endless supply of the time series forecasting problems. Below are ten examples
from a range of industries to make the notions of time series analysis and forecasting more
concrete.
1. Forecasting the corn yield in tons by the state each year.
2. Forecasting whether an EEG trace in seconds indicates a patient is having a seizure or not.
3. Forecasting the closing price of stocks every day.
4. Forecasting the birth rates at all hospitals in the city every year.
5. Forecasting product sales in the units sold each day for the store.
6. Forecasting the number of passengers through the train station each day.
7. Forecasting unemployment for a state each quarter.
8. Forecasting the utilisation demand on the server every hour.
9. Forecasting the size of the rabbit populations in the state each breeding season.
10. Forecasting the average price of gasoline in a city each day.
Although simple, this model might be surprisingly good, and it represents a good starting point.
Otherwise, the moving average can be used to identify interesting trends in the data. We can define
a window to apply the moving average model to smooth the time series and highlight different trends.
Example of a moving average on a 24h window
In the plot above, we applied the moving average model to a 24h window. The green
line smoothed the time series, and we can see that there are two peaks in the 24h period.
The longer the window, the smoother the trend will be.
Here, alpha is the smoothing factor which takes values between 0 to 1. It determines how fast the
weight will decrease for the previous observations.
From the above plot, the dark blue line represents the exponential smoothing of the time series
using a smoothing factor of 0.3, and the orange line uses a smoothing factor of 0.05. As we can see,
the smaller the smoothing factor, the smoother the time series will be. Because as smoothing factor
approaches 0, we approach to the moving average model
------------------------------------------------------------------------------------------------------------------------