Interview Questions ML
Interview Questions ML
Interview Questions ML
Algorithms/Theory
These algorithms questions will test your grasp of the theory behind machine learning.
Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm
you’re using. This can lead to the model underfitting your data, making it hard for it to
have high predictive accuracy and for you to generalize your knowledge from the
training set to the test set.
Variance is error due to too much complexity in the learning algorithm you’re using. This
leads to the algorithm being highly sensitive to high degrees of variation in your training
data, which can lead your model to overfit the data. You’ll be carrying too much noise
from your training data for your model to be very useful for your test data.
The bias-variance decomposition essentially decomposes the learning error from any
algorithm by adding the bias, the variance and a bit of irreducible error due to noise in
the underlying dataset. Essentially, if you make the model more complex and add more
variables, you’ll lose bias but gain some variance — in order to get the optimally
reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either
high bias or high variance in your model.
More reading: What is the difference between supervised and unsupervised machine
learning? (Quora)
More reading: How is the k-nearest neighbor algorithm different from k-means
clustering? (Quora)
The critical difference here is that KNN needs labeled points and is thus supervised
learning, while k-means doesn’t — and is thus unsupervised learning.
The ROC curve is a graphical representation of the contrast between true positive rates
and the false positive rate at various thresholds. It’s often used as a proxy for the trade-
off between the sensitivity of the model (true positives) vs the fall-out or the probability it
will trigger a false alarm (false positives).
Recall is also known as the true positive rate: the amount of positives your model claims
compared to the actual number of positives there are throughout the data. Precision is
also known as the positive predictive value, and it is a measure of the amount of
accurate positives your model claims compared to the number of positives it actually
claims. It can be easier to think of recall and precision in the context of a case where
you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d
have perfect recall (there are actually 10 apples, and you predicted there would be 10)
but 66.7% precision because out of the 15 events you predicted, only 10 (the apples)
are correct.
Bayes’ Theorem gives you the posterior probability of an event given what is known as
prior knowledge.
Mathematically, it’s expressed as the true positive rate of a condition sample divided by
the sum of the false positive rate of the population and the true positive rate of a
condition. Say you had a 60% chance of actually having the flu after a flu test, but out of
people who had the flu, the test will be false 50% of the time, and the overall population
only has a 5% chance of having the flu. Would you actually have a 60% chance of
having the flu after having a positive test?
Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a
Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95)
(False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.
Bayes’ Theorem is the basis behind a branch of machine learning that most notably
includes the Naive Bayes classifier. That’s something important to consider when you’re
faced with machine learning interview questions.
Despite its practical applications, especially in text mining, Naive Bayes is considered
“Naive” because it makes an assumption that is virtually impossible to see in real-life
data: the conditional probability is calculated as the pure product of the individual
probabilities of components. This implies the absolute independence of features — a
condition probably never met in real life.
As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that
you liked pickles and ice cream would probably naively recommend you a pickle ice
cream.
Q9- What’s your favorite algorithm, and can you explain it to me in less than a
minute?
This type of question tests your understanding of how to communicate complex and
technical nuances with poise and the ability to summarize quickly and efficiently. Make
sure you have a choice and make sure you can explain different algorithms so simply
and effectively that a five-year-old could grasp the basics!
Type I error is a false positive, while Type II error is a false negative. Briefly stated,
Type I error means claiming something has happened when it hasn’t, while Type II error
means that you claim nothing is happening when in fact something is.
A clever way to think about this is to think of Type I error as telling a man he is
pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a
baby.
More reading: What is the difference between “likelihood” and “probability”? (Cross
Validated)
Q13- What is deep learning, and how does it contrast with other machine learning
algorithms?
Deep learning is a subset of machine learning that is concerned with neural networks:
how to use backpropagation and certain principles from neuroscience to more
accurately model large sets of unlabelled or semi-structured data. In that sense, deep
learning represents an unsupervised learning algorithm that learns representations of
data through the use of neural nets.
Q15- What cross-validation technique would you use on a time series dataset?
Instead of using standard k-folds cross-validation, you have to pay attention to the fact
that a time series is not randomly distributed data — it is inherently ordered by
chronological order. If a pattern emerges in later time periods for example, your model
may still pick up on it even if that effect doesn’t hold in earlier years!
You’ll want to do something like forward chaining where you’ll be able to model on past
data then look at forward-facing data.
Pruning is what happens in decision trees when branches that have weak predictive
power are removed in order to reduce the complexity of the model and increase the
predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-
down, with approaches such as reduced error pruning and cost complexity pruning.
Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t
decrease predictive accuracy, keep it pruned. While simple, this heuristic actually
comes pretty close to an approach that would optimize for maximum accuracy.
This question tests your grasp of the nuances of machine learning model performance!
Machine learning interview questions often look towards the details. There are models
with higher accuracy that can perform worse in predictive power — how does that make
sense?
Well, it has everything to do with how model accuracy is only a subset of model
performance, and at that, a sometimes misleading one. For example, if you wanted to
detect fraud in a massive dataset with a sample of millions, a more accurate model
would most likely predict no fraud at all if only a vast minority of cases were fraud.
However, this would be useless for a predictive model — a model designed to find fraud
that asserted there was no fraud at all! Questions like this help you demonstrate that
you understand model accuracy isn’t the be-all and end-all of model performance.
An imbalanced dataset is when you have, for example, a classification test and 90% of
the data is in one class. That leads to problems: an accuracy of 90% can be skewed if
you have no predictive power on the other category of data! Here are a few tactics to
get over the hump:
What’s important here is that you have a keen sense for what damage an unbalanced
dataset can cause, and how to balance that.
Classification produces discrete values and dataset to strict categories, while regression
gives you continuous results that allow you to better distinguish differences between
individual points. You would use classification over regression if you wanted your results
to reflect the belongingness of data points in your dataset to certain explicit categories
(ex: If you wanted to know whether a name was male or female rather than just how
correlated they were with male and female names.)
You could list some examples of ensemble methods, from bagging to boosting to a
“bucket of models” method and demonstrate how they could increase predictive power.
1- Keep the model simpler: reduce variance by taking into account fewer variables and
parameters, thereby removing some of the noise in the training data.
Q23- What evaluation approaches would you work to gauge the effectiveness of a
machine learning model?
You would first split the dataset into training and test sets, or perhaps use cross-
validation techniques to further segment the dataset into composite sets of training and
test sets within the data. You should then implement a choice selection of performance
metrics: here is a fairly comprehensive list. You could use measures such as the F1
score, the accuracy, and the confusion matrix. What’s important here is to demonstrate
that you understand the nuances of how a model is measured and how to choose the
right performance measures for the right situations.
Q24- How would you evaluate a logistic regression model?
The Kernel trick involves kernel functions that can enable in higher-dimension spaces
without explicitly calculating the coordinates of points within that dimension: instead,
kernel functions compute the inner products between the images of all pairs of data in a
feature space. This allows them the very useful attribute of calculating the coordinates
of higher dimensions while being computationally cheaper than the explicit calculation of
said coordinates. Many algorithms can be expressed in terms of inner products. Using
the kernel trick enables us effectively run algorithms in a high-dimensional space with
lower-dimensional data.
(Learn about Springboard’s AI / Machine Learning Bootcamp, the first of its kind to
come with a job guarantee.)
You could find missing/corrupted data in a dataset and either drop those rows or
columns, or decide to replace them with another value.
In Pandas, there are two very useful methods: isnull() and dropna() that will help you
find columns of data with missing or corrupted data and drop those values. If you want
to fill the invalid values with a placeholder value (for example, 0), you could use the
fillna() method.
Q27- Do you have experience with Spark or big data tools for machine learning?
More reading: 50 Top Open Source Tools for Big Data (Datamation)
You’ll want to get familiar with the meaning of big data for different companies and the
different tools they’ll want. Spark is the big data tool most in demand now, able to
handle immense datasets with speed. Be honest if you don’t have experience with the
tools demanded, but also take a look at job descriptions and see what tools pop up:
you’ll want to invest in familiarizing yourself with them.
This kind of question demonstrates your ability to think in parallelism and how you could
handle concurrency in programming implementations dealing with big data. Take a look
at pseudocode frameworks such as Peril-L and visualization tools such as Web
Sequence Diagrams to help you demonstrate your ability to write code that reflects
parallelism.
Q29- What are some differences between a linked list and an array?
A hash table is a data structure that produces an associative array. A key is mapped to
certain values through the use of a hash function. They are often used for tasks such as
database indexing.
Q31- Which data visualization libraries do you use? What are your thoughts on
the best data visualization tools?
What’s important here is to define your views on how to properly visualize data and your
personal preferences when it comes to tools. Popular tools include R’s ggplot, Python’s
seaborn and matplotlib, and tools such as Plot.ly and Tableau.
Q32- How would you implement a recommendation system for our company’s
users?
A lot of machine learning interview questions of this type will involve implementation of
machine learning models to a company’s problems. You’ll have to research the
company and its industry in-depth, especially the revenue drivers the company has, and
the types of users the company takes on in the context of the industry it’s in.
Q33- How can we use your machine learning skills to generate revenue?
More reading: Startup Metrics for Startups (500 Startups)
This is a tricky question. The ideal answer would demonstrate knowledge of what drives
the business and how your skills could relate. For example, if you were interviewing for
music-streaming startup Spotify, you could remark that your skills at developing a better
recommendation model would increase user retention, which would then increase
revenue in the long run.
The startup metrics Slideshare linked above will help you understand exactly what
performance indicators are important for startups and tech companies as they think
about revenue and growth.
Q35- What are the last machine learning papers you’ve read?
More reading: What are some of the best research papers/books for machine learning?
Keeping up with the latest scientific literature on machine learning is a must if you want
to demonstrate interest in a machine learning position. This overview of deep learning in
Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can
be a good reference paper and an overview of what’s happening in deep learning —
and the kind of paper you might want to cite.
Related to the last point, most organizations hiring for machine learning positions will
look for your formal experience in the field. Research papers, co-authored or supervised
by leaders in the field, can make the difference between you being hired and not. Make
sure you have a summary of your research experience and papers ready — and an
explanation for your background and lack of formal research experience if you don’t.
Q37- What are your favorite use cases of machine learning models?
More reading: What are the typical use cases for different machine learning algorithms?
(Quora)
The Quora thread above contains some examples, such as decision trees that
categorize people into different tiers of intelligence based on IQ scores. Make sure that
you have a few examples in mind and describe what resonated with you. It’s important
that you demonstrate an interest in how machine learning is implemented.
The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a better
collaborative filtering algorithm. The team that won called BellKor had a 10%
improvement and used an ensemble of different methods to win. Some familiarity with
the case and its solution will help demonstrate you’ve paid attention to machine learning
for a while.
More reading: 19 Free Public Data Sets For Your First Data Science Project
(Springboard)
Machine learning interview questions like these try to get at the heart of your machine
learning interest. Somebody who is truly passionate about machine learning will have
gone off and done side projects on their own, and have a good idea of what great
datasets are out there. If you’re missing any, check out Quandl for economic and
financial data, and Kaggle’s Datasets collection for another great list.
Q40- How do you think Google is training data for self-driving cars?
Machine learning interview questions like this one really test your knowledge of different
machine learning methods, and your inventiveness if you don’t know the answer.
Google is currently using recaptcha to source labeled data on storefronts and traffic
signs. They are also building on training data collected by Sebastian Thrun at GoogleX
— some of which was obtained by his grad students driving buggies on desert dunes!
Q41- How would you simulate the approach AlphaGo took to beat Lee Sidol at
Go?
More reading: Mastering the game of Go with deep neural networks and tree search
(Nature)
AlphaGo beating Lee Sidol, the best human player at Go, in a best-of-five series was a
truly seminal event in the history of machine learning and deep learning. The Nature
paper above describes how this was accomplished with “Monte-Carlo tree search with
deep neural networks that have been trained by supervised learning, from human
expert games, and by reinforcement learning from games of self-play.”
What is the difference between supervised and unsupervised machine learning?
Supervised learning requires training using labelled data. For example, in order to
do classification, which is a supervised learning task, you’ll first need to label the
data you’ll use to train the model to classify data into your labelled groups.
Unsupervised learning, in divergence, does not require labeling data explicitly.
Bias is error due to over simplistic assumptions in the learning algorithm that you
are using, which can lead to model under fitting your data and making it hard for
model to give accurate predictions.
Variance, on the other hand, is error due to way too much complexity in your
learning algorithm. Due to this complexity, the algorithm is highly sensitive to high
degrees of variation, which can lead your model to over fit the data. In addition,
you will be carrying too much noise from your training data for your model to be
useful.
Bayes’ Theorem gives you the posterior probability of an event given what is
known as prior knowledge. Also, it is the basis behind Naive Bayes classifier. That’s
something important to know when you’re faced with machine learning interview
questions and answers. For detailed and simple explanation of Naïve Bayes
visit here.
First, regularization is the technique which helps to solve over fitting problem in
Machine Learning. L2 regularization incline to spread error among all the terms,
while L1 is more binary, with most variables either being assigned a 1 or 0 in
weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2
corresponds to a Gaussian prior. I would say as a rule of thumb, one should always
go with L2 in practice.
What is the difference between Type I and Type II error?
Don’t think this as something high level stuff, interviewers ask questions in such
terms just to know that you have all the bases and you are on the top.
Type I error is false positive, while Type II is false negative. Type I error is claiming
something has happened when it hasn’t. For instance, telling a man he is pregnant.
On the other hand, Type II error means you claim nothing is happened but in fact
something is. To exemplify, you tell a pregnant lady she isn’t carrying baby.
Most people don’t know this but Machine Learning and Deep Learning is not two
different things, but Deep learning is a subset of Machine learning. It mostly deals
with neural networks: how to use back propagation and other certain principles
from neuroscience to more accurately model large sets of unlabeled data. In a
nutshell, Deep Learning represents unsupervised learning algorithm that learns
data representation mainly through neural networks. Explore a little about neural
nets to answer deep learning interview questions effectively.
Observation 1
Observation 2
Observation 3
But, a Time series data-set is different. Time series adds an explicit order
dependence between observations: a time dimension. This additional dimension is
both a constraint and a structure that provides a source of additional information.
Time 1, Observation
Time 2, Observation
Time 3, Observation
Imbalanced data is, for example, you have 90% of the data in one class and 10%
in other. Which leads to problems such as, no predictive power on the other
category of data. Here are few techniques to get over it,
Pruning is you remove branches that have weak predictive power in order to
reduce the complexity of the model and in addition increase the predictive
accuracy of a decision tree model. There are several flavors which includes,
bottom-up and top-down pruning, with approaches such as reduced error pruning
and cost complexity pruning.
In your opinion which one is more important: Model accuracy or Model Performance?
Questions like these tests your grasp over Machine Learning model performance
and often look towards details. There are models with higher accuracy that can
perform worse in predictive power, how does that make sense?
Well, it has everything to do with how model accuracy is only a subset of model
performance, and at that, a sometimes misleading one. For example, if you wanted
to detect fraud in a massive data-set with a sample of millions, a more accurate
model would most likely predict no fraud at all if only a vast minority of cases were
fraud. However, this would be useless for a predictive model — a model designed
to find fraud that asserted there was no fraud at all! Questions like this help you
demonstrate that you understand model accuracy isn’t the be-all and end-all of
model performance.
Convex hull represents the outer boundaries of the two group of data points. Once
convex hull is created, we get maximum margin hyperplane (MMH), which
attempts to create the greatest separation between two groups, as a perpendicular
bisector between two convex hulls.
General Questions
Machine Learning is emerging and no one wants novice players in their teams.
Most employers hiring for Machine Learning position will look for your experience
in the field. Research papers, co-authored or supervised by leaders in the field,
can set you apart from the herd. Make sure you are ready with all the summary
and justification of the work you have done in the past years.
What are the last Machine Learning papers you read? Why you think that was important?
As this field is emerging day by day, it is crucial to keep up with the latest scientific
literatures to show that you are really into Machine Learning and not here just
because it is the latest buzzword. Some good books to start with includes Deep
Learning by Ian Goodfellow.
The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a
better collaborative filtering algorithm. The team that won called BellKor had a
10% improvement and used an ensemble of different methods to win. Some
familiarity with the case and its solution will help demonstrate you’ve paid attention
to machine learning for a while.
What’s your favorite algorithm, and can you explain it to me in less than a minute?
This type of question mainly tests your ability of communicating complex and
technical nuances with poise and the ability to summarize quickly and efficiently.
Make sure you have a choice of algorithm which you can explain easily. Try to
explain different algorithms so simply and effectively that a five-year-old could
grasp the basics.
This type of questions are the real tie-breakers. If someone is going for an
interview, he/she must know the drill of some related question. It is questions like
this which purely illustrates your interest in Machine Learning. See my post for
detailed answer on where to find machine learning data-sets.
Questions like this check your understanding of current affairs in the industry and
how things at certain level works. Google is currently using recaptcha to source
labelled data on storefronts and traffic signs. They are also building on training
data collected by Sebastian Thrun at GoogleX.
Industry Specific Questions
How would you implement a recommendation system for our company’s users?
There will be a lot of questions like this which will involve implementation of
machine learning models to their company’s problems. You should definitely study
company’s profile and its products before going in. In addition, factors such as,
financials of the company, in which the company operates, what are their users
will help you get a clearer picture.
This is a tricky question, I would say. The ideal answer would demonstrate
knowledge of what drives the business and how your skills could relate. To
exemplify, if you were interviewing for Spotify, you could remark that your skills
at developing a better recommendation model would remarkably increase user
retention, which would then increase revenue in the long run. Or something like
that.
Practical/Programming Questions
One can find missing data in a data-set and either drop those rows or columns, or
decide to replace them with another value. In python library Pandas there are two
useful functions which will be helpful, isnull() and dropna().
A hash table is a data structure that produces an associative array. A key is mapped
to certain values through the use of a hash function. They are often used for tasks
such as database indexing.
Which data visualization libraries do you use and why they are useful?
What’s important here is to define your views on how to properly visualize data
and your personal preferences when it comes to tools. Popular tools include R’s
ggplot, Python’s seaborn and matplotlib, and tools such as Plot.ly and Tableau.
Do you have experience with Spark or big data tools for machine learning?
Spark is the big data tool most in demand now, able to handle immense data-sets
with speed. Be honest if you don’t have experience with the tools demanded, but
also take a look at job descriptions and see what tools pop up: you’ll want to invest
in familiarizing yourself with them.
Considering the long list of machine learning algorithm, given a data set, how do you decide which
one to use?
This is one more tricky question. Given what type of data there is, discrete, time
series, continuous, you should give your answers.
Explain machine learning to me like a 5-year-old. (To check the ability of explaining complex
concepts in simple terms)
There are a lot of definition of Machine Learning are out there, which one you
think is the easiest and at the same time covers the entire meaning of Machine
Learning? Tell us in the comments and let other readers give insights to how you
think.
Q1. You are given a train data set having 1000 columns and 1 million rows. The data
set is based on a classification problem. Your manager has asked you to reduce the
dimension of this data so that model computation time can be reduced. Your machine
has memory constraints. What would you do? (You are free to make practical
assumptions.)
1. Since we have lower RAM, we should close all other applications in our machine,
including the web browser, so that most of the memory can be put to use.
2. We can randomly sample the data set. This means, we can create a smaller data set,
let’s say, having 1000 variables and 300000 rows and do the computations.
3. To reduce dimensionality, we can separate the numerical and categorical variables
and remove the correlated variables. For numerical variables, we’ll use correlation.
For categorical variables, we’ll use chi-square test.
4. Also, we can use PCA and pick the components which can explain the maximum
variance in the data set.
5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible
option.
6. Building a linear model using Stochastic Gradient Descent is also helpful.
7. We can also apply our business understanding to estimate which all predictors can
impact the response variable. But, this is an intuitive approach, failing to identify useful
predictors might result in significant loss of information.
Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic
Gradient Descent. These are advanced methods.
Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the
components?
If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select
more number of components to explain variance in the data set.
Answer: This question has enough hints for you to start thinking! Since, the data is spread
across median, let’s assume it’s a normal distribution. We know, in a normal distribution,
~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves
~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by
missing values.
Q4. You are given a data set on cancer detection. You’ve build a classification model
and achieved an accuracy of 96%. Why shouldn’t you be happy with your model
performance? What can you do about it?
Answer: If you have worked on enough data sets, you should deduce that cancer detection
results in imbalanced data. In an imbalanced data set, accuracy should not be used as a
measure of performance because 96% (as given) might only be predicting majority class
correctly, but our class of interest is minority class (4%) which is the people who actually got
diagnosed with cancer. Hence, in order to evaluate model performance, we should use
Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine
class wise performance of the classifier. If the minority class performance is found to to be
poor, we can undertake the following steps:
Answer: naive Bayes is so ‘naive’ because it assumes that all of the features in a data set
are equally important and independent. As we know, these assumption are rarely true in real
world scenario.
Q7. You are working on a time series data set. You manager has asked you to build a
high accuracy model. You start with the decision tree algorithm, since you know it
works fairly well on all kinds of data. Later, you tried a time series regression model
and got higher accuracy than decision tree model. Can this happen? Why?
Answer: Time series data is known to posses linearity. On the other hand, a decision tree
algorithm is known to work best to detect non – linear interactions. The reason why decision
tree failed to provide robust predictions because it couldn’t map the linear relationship as
good as a regression model did. Therefore, we learned that, a linear regression model can
provide robust prediction given the data set satisfies its linearity assumptions.
Q8. You are assigned a new project which involves helping a food delivery company
save more money. The problem is, company’s delivery team aren’t able to deliver food
on time. As a result, their customers get unhappy. And, to keep them happy, they end
up delivering food for free. Which machine learning algorithm can save them?
Answer: You might have started hopping through the list of ML algorithms in your mind. But,
wait! Such questions are asked to test your machine learning fundamentals.
This is not a machine learning problem. This is a route optimization problem. A machine
learning problem consist of three things:
Always look for these three factors to decide if machine learning is a tool to solve a particular
problem.
Q9. You came to know that your model is suffering from low bias and high variance.
Which algorithm should you use to tackle it? Why?
Answer: Low bias occurs when the model’s predicted values are near to actual values. In
other words, the model becomes flexible enough to mimic the training data distribution. While
it sounds like great achievement, but not to forget, a flexible model has no generalization
capabilities. It means, when this model is tested on an unseen data, it gives disappointing
results.
In such situations, we can use bagging algorithm (like random forest) to tackle high variance
problem. Bagging algorithms divides a data set into subsets made with repeated randomized
sampling. Then, these samples are used to generate a set of models using a single learning
algorithm. Later, the model predictions are combined using voting (classification) or averaging
(regression).
Q10. You are given a data set. The data set contains many variables, some of which
are highly correlated and you know about it. Your manager has asked you to run PCA.
Would you remove correlated variables first? Why?
Answer: Chances are, you might be tempted to say No, but that would be incorrect.
Discarding correlated variables have a substantial effect on PCA because, in presence of
correlated variables, the variance explained by a particular component gets inflated.
For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on
this data set, the first principal component would exhibit twice the variance than it would
exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more
importance on those variable, which is misleading.
Q11. After spending several hours, you are now anxious to build a high accuracy
model. As a result, you build 5 GBM models, thinking a boosting algorithm would do
the magic. Unfortunately, neither of models could perform better than benchmark
score. Finally, you decided to combine those models. Though, ensembled models are
known to return high accuracy, but you are unfortunate. Where did you miss?
Answer: As we know, ensemble learners are based on the idea of combining weak learners
to create strong learners. But, these learners provide superior result when the combined
models are uncorrelated. Since, we have used 5 GBM models and got no accuracy
improvement, suggests that the models are correlated. The problem with correlated models
is, all the models provide same information.
For example: If model 1 has classified User1122 as 1, there are high chances model 2 and
model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners
are built on the premise of combining weak uncorrelated models to obtain better predictions.
Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental
difference between both these algorithms is, kmeans is unsupervised in nature and kNN is
supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression)
algorithm.
kmeans algorithm partitions a data set into clusters such that a cluster formed is
homogeneous and the points in each cluster are close to each other. The algorithm tries to
maintain enough separability between these clusters. Due to unsupervised nature, the
clusters have no labels.
kNN algorithm tries to classify an unlabeled observation based on its k (can be any number )
surrounding neighbors. It is also known as lazy learner because it involves minimal training
of model. Hence, it doesn’t use training data to make generalization on unseen data set.
Q13. How is True Positive Rate and Recall related? Write the equation.
Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).
Q14. You have built a multiple regression model. Your model R² isn’t as good as you
wanted. For improvement, your remove the intercept term, your model R² becomes 0.8
from 0.3. Is it possible? How?
Q15. After analyzing the model, your manager has informed that your regression
model is suffering from multicollinearity. How would you check if he’s true? Without
losing any information, can you still build a better model?
Answer: To check multicollinearity, we can create a correlation matrix to identify & remove
variables having correlation above 75% (deciding a threshold is subjective). In addition, we
can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF
value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious
multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.
But, removing correlated variables might lead to loss of information. In order to retain those
variables, we can use penalized regression models like ridge or lasso regression. Also, we
can add some random noise in correlated variable so that the variables become different from
each other. But, adding noise might affect the prediction accuracy, hence this approach
should be carefully used.
Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of
few variables with medium / large sized effect, use lasso regression. In presence of many
variables with small / medium sized effect, use ridge regression.
Conceptually, we can say, lasso regression (L1) does both variable selection and parameter
shrinkage, whereas Ridge regression only does parameter shrinkage and end up including
all the coefficients in the model. In presence of correlated variables, ridge regression might
be the preferred choice. Also, ridge regression works best in situations where the least square
estimates have higher variance. Therefore, it depends on our model objective.
Q17. Rise in global average temperature led to decrease in number of pirates around
the world. Does that mean that decrease in number of pirates caused the climate
change?
Answer: After reading this question, you should have understood that this is a classic case
of “causation and correlation”. No, we can’t conclude that decrease in number of pirates
caused the climate change because there might be other factors (lurking or confounding
variables) influencing this phenomenon.
Therefore, there might be a correlation between global average temperature and number of
pirates, but based on this information we can’t say that pirated died because of rise in global
average temperature.
Q18. While working on a data set, how do you select important variables? Explain
your methods.
Answer: Following are the methods of variable selection you can use:
Covariances are difficult to compare. For example: if we calculate the covariances of salary
($) and age (years), we’ll get different covariances which can’t be compared because of
having unequal scales. To combat such situation, we calculate correlation to get a value
between -1 and 1, irrespective of their respective scale.
Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association
between continuous and categorical variables.
Q21. Both being tree based algorithm, how is random forest different from Gradient
boosting algorithm (GBM)?
Answer: The fundamental difference is, random forest uses bagging technique to make
predictions. GBM uses boosting techniques to make predictions.
In bagging technique, a data set is divided into n samples using randomized sampling. Then,
using a single learning algorithm a model is build on all samples. Later, the resultant
predictions are combined using voting or averaging. Bagging is done is parallel. In boosting,
after the first round of predictions, the algorithm weighs misclassified predictions higher, such
that they can be corrected in the succeeding round. This sequential process of giving higher
weights to misclassified predictions continue until a stopping criterion is reached.
Random forest improves model accuracy by reducing variance (mainly). The trees grown are
uncorrelated to maximize the decrease in variance. On the other hand, GBM improves
accuracy my reducing both bias and variance in a model.
Q22. Running a binary classification tree algorithm is the easy part. Do you know how
does a tree splitting takes place i.e. how does the tree decide which variable to split at
the root node and succeeding nodes?
Answer: A classification trees makes decision based on Gini Index and Node Entropy. In
simple words, the tree algorithm find the best possible feature which can divide the data
set into purest possible children nodes.
Gini index says, if we select two items from a population at random then they must be of same
class and probability for this is 1 if population is pure. We can calculate Gini as following:
1. Calculate Gini for sub-nodes, using formula sum of square of probability for success
and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Here p and q is probability of success and failure respectively in that node. Entropy is zero
when a node is homogeneous. It is maximum when a both the classes are present in a node
at 50% – 50%. Lower entropy is desirable.
Q23. You’ve built a random forest model with 10000 trees. You got delighted after
getting training error as 0.00. But, the validation error is 34.23. What is going on?
Haven’t you trained your model perfectly?
Answer: The model has overfitted. Training error 0.00 means the classifier has mimiced the
training data patterns to an extent, that they are not available in the unseen data. Hence,
when this classifier was run on unseen sample, it couldn’t find those patterns and returned
prediction with higher error. In random forest, it happens when we use larger number of trees
than necessary. Hence, to avoid these situation, we should tune number of trees using cross
validation.
Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation).
Why is OLS as bad option to work with? Which techniques would be best to use? Why?
Answer: In such high dimensional data sets, we can’t use classical regression techniques,
since their assumptions tend to fail. When p > n, we can no longer calculate a unique least
square coefficient estimate, the variances become infinite, so OLS cannot be used at all.
To combat this situation, we can use penalized regression methods like lasso, LARS, ridge
which can shrink the coefficients to reduce variance. Precisely, ridge regression works best
in situations where the least square estimates have higher variance.
Answer: In case of linearly separable data, convex hull represents the outer boundaries of
the two group of data points. Once convex hull is created, we get maximum margin
hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line
which attempts to create greatest separation between two groups.
Q26. We know that one hot encoding increasing the dimensionality of a data set. But,
label encoding doesn’t. How ?
Answer: Don’t get baffled at this question. It’s a simple question asking the difference
between the two.
Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased
because it creates a new variable for each level present in categorical variables. For example:
let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green.
One hot encoding ‘color’ variable will generate three new variables
as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.
In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new
variable is created. Label encoding is majorly used for binary variables.
Q27. What cross validation technique would you use on time series data set? Is it k-
fold or LOOCV?
Answer: Neither.
In time series problem, k fold can be troublesome because there might be some pattern in
year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we
might end up validation on past years, which is incorrect. Instead, we can use forward
chaining strategy with 5 fold as shown below:
Q28. You are given a data set consisting of variables having more than 30% missing
values? Let’s say, out of 50 variables, 8 variables have missing values higher than
30%. How will you deal with them?
29. ‘People who bought this, also bought…’ recommendations seen on amazon is a
result of which algorithm?
Answer: The basic idea for this kind of recommendation engine comes from collaborative
filtering.
Collaborative Filtering algorithm considers “User Behavior” for recommending items. They
exploit behavior of other users and items in terms of transaction history, ratings, selection and
purchase information. Other users behaviour and preferences over the items are used to
recommend items to the new users. In this case, features of the items are not known.
Answer: Type I error is committed when the null hypothesis is true and we reject it, also
known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and
we accept it, also known as ‘False Negative’.
In the context of confusion matrix, we can say Type I error occurs when we classify a value
as positive (1) when it is actually negative (0). Type II error occurs when we classify a value
as negative (0) when it is actually positive(1).
Q31. You are working on a classification problem. For validation purposes, you’ve
randomly sampled the training data set into train and validation. You are confident that
your model will work incredibly well on unseen data since your validation accuracy is
high. However, you get shocked after getting poor test accuracy. What went wrong?
Answer: In case of classification problem, we should always use stratified sampling instead
of random sampling. A random sampling doesn’t takes into consideration the proportion of
target classes. On the contrary, stratified sampling helps to maintain the distribution of target
variable in the resultant distributed samples also.
Q32. You have been asked to evaluate a regression model based on R², adjusted R²
and tolerance. What will be your criteria?
Q33. In k-means or kNN, we use euclidean distance to calculate the distance between
nearest neighbors. Why not manhattan distance ?
Example: Think of a chess board, the movement made by a bishop or a rook is calculated by
manhattan distance because of their respective vertical & horizontal movements.
Answer: It’s simple. It’s just like how babies learn to walk. Every time they fall down, they
learn (unconsciously) & realize that their legs should be straight and not in a bend position.
The next time they fall down, they feel pain. They cry. But, they learn ‘not to stand like that
again’. In order to avoid that pain, they try harder. To succeed, they even seek support from
the door or wall or anything near them, which helps them stand firm.
This is how a machine works & develops intuition from its environment.
Note: The interview is only trying to test if have the ability of explain complex concepts in
simple terms.
Q35. I know that a linear regression model is generally evaluated using Adjusted R² or
F value. How would you evaluate a logistic regression model?
1. Since logistic regression is used to predict probabilities, we can use AUC-ROC curve
along with confusion matrix to determine its performance.
2. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the
measure of fit which penalizes model for the number of model coefficients. Therefore,
we always prefer model with minimum AIC value.
3. Null Deviance indicates the response predicted by a model with nothing but an
intercept. Lower the value, better the model. Residual deviance indicates the response
predicted by a model on adding independent variables. Lower the value, better the
model.
Q36. Considering the long list of machine learning algorithm, given a data set, how do
you decide which one to use?
Answer: You should say, the choice of machine learning algorithm solely depends of the
type of data. If you are given a data set which is exhibits linearity, then linear regression would
be the best algorithm to use. If you given to work on images, audios, then neural network
would help you to build a robust model.
If the data comprises of non linear interactions, then a boosting or bagging algorithm should
be the choice. If the business requirement is to build a model which can be deployed, then
we’ll use regression or a decision tree model (easy to interpret and explain) instead of black
box algorithms like SVM, GBM etc.
In short, there is no one master algorithm for all situations. We must be scrupulous enough
to understand which algorithm to use.
Q37. Do you suggest that treating a categorical variable as continuous variable would
result in a better predictive model?
Answer: The error emerging from any model can be broken down into three components
mathematically. Following are these component :
Bias error is useful to quantify how much on an average are the predicted values different
from the actual value. A high bias error means we have a under-performing model which
keeps on missing important trends. Varianceon the other side quantifies how are the
prediction made on same observation different from each other. A high variance model will
over-fit on your training population and perform badly on any observation beyond training.
Answer: OLS and Maximum likelihood are the methods used by the respective regression
methods to approximate the unknown parameter (coefficient) value. In simple words,
Ordinary least square(OLS) is a method used in linear regression which approximates the
parameters resulting in minimum distance between actual and predicted values. Maximum
Likelihood helps in choosing the the values of parameters which maximizes the likelihood that
the parameters are most likely to produce observed data.
Machine learning is a branch of computer science which deals with system programming in
order to automatically learn and improve with experience. For example: Robots are
programed so that they can perform the task based on data they gather from sensors. It
automatically learns programs from data.
Machine learning relates with the study, design and development of the algorithms that
give computers the capability to learn without being explicitly programmed. While, data
mining can be defined as the process in which the unstructured data tries to extract
knowledge or unknown interesting patterns. During this process machine, learning
algorithms are used.
3) What is ‘Overfitting’ in Machine learning?
In machine learning, when a statistical model describes random error or noise instead of
underlying relationship ‘overfitting’ occurs. When a model is excessively complex,
overfitting is normally observed, because of having too many parameters with respect to
the number of training data types. The model exhibits poor performance which has been
overfit.
The possibility of overfitting exists as the criteria used for training the model is not the
same as the criteria used to judge the efficacy of a model.
In this technique, a model is usually given a dataset of a known data on which training
(training data set) is run and a dataset of unknown data against which the model is tested.
The idea of cross validation is to define a dataset to “test” the model in the training phase.
The inductive machine learning involves the process of learning by examples, where a
system, from a set of observed instances tries to induce a general rule.
a) Decision Trees
c) Probabilistic networks
d) Nearest Neighbor
a) Supervised Learning
b) Unsupervised Learning
c) Semi-supervised Learning
d) Reinforcement Learning
e) Transduction
f) Learning to Learn
9) What are the three stages to build the hypotheses or model in machine learning?
a) Model building
b) Model testing
The standard approach to supervised learning is to split the set of example into the training
set and the test.
In various areas of information science like machine learning, a set of data is used to
discover the potentially predictive relationship known as ‘Training Set’. Training set is an
examples given to the learner, while Test set is used to test the accuracy of the hypotheses
generated by the learner, and it is the set of example held back from the learner. Training
set are distinct from Test set.
a) Artificial Intelligence
a) Classifications
b) Speech recognition
c) Regression
e) Annotate strings
17) What is the difference between artificial learning and machine learning?
Designing and developing algorithms according to the behaviours based on empirical data
are known as Machine Learning. While artificial intelligence in addition to machine learning,
it also covers other aspects like knowledge representation, natural language processing,
planning, robotics etc.
In Naïve Bayes classifier will converge quicker than discriminative models like logistic
regression, so you need less training data. The main advantage is that it can’t learn
interactions between features.
a) Computer Vision
b) Speech Recognition
c) Data Mining
d) Statistics
e) Informal Retrieval
f) Bio-Informatics
21) What is Genetic Programming?
Genetic programming is one of the two techniques used in machine learning. The model is
based on the testing and selecting the best choice among a set of results.
Inductive Logic Programming (ILP) is a subfield of machine learning which uses logical
programming representing background knowledge and examples.
The process of selecting models among different mathematical models, which are used to
describe the same data set is known as Model Selection. Model selection is applied to the
fields of statistics, machine learning and data mining.
24) What are the two methods used for the calibration in Supervised Learning?
The two methods used for predicting good probabilities in Supervised Learning are
a) Platt Calibration
b) Isotonic Regression
These methods are designed for binary classification, and it is not trivial.
When there is sufficient data ‘Isotonic Regression’ is used to prevent an overfitting issue.
26) What is the difference between heuristic for rule learning and heuristics for
decision trees?
The difference is that the heuristics for decision trees evaluate the average quality of a
number of disjointed sets while rule learners only evaluate the quality of the set of
instances that is covered with the candidate rule.
Bayesian logic program consists of two components. The first component is a logical one ;
it consists of a set of Bayesian Clauses, which captures the qualitative structure of the
domain. The second component is a quantitative one, it encodes the quantitative
information about the domain.
30) Why instance based learning algorithm sometimes referred as Lazy learning
algorithm?
Instance based learning algorithm is also referred as Lazy learning algorithm as they delay
the induction or generalization process until classification is performed.
31) What are the two classification methods that SVM ( Support Vector Machine) can
handle?
Ensemble learning is used when you build component classifiers that are more accurate and
independent from each other.
36) What is the general principle of an ensemble method and what is bagging and
boosting in ensemble method?
The expected error of a learning algorithm can be decomposed into bias and variance. A
bias term measures how closely the average classifier produced by the learning algorithm
matches the target function. The variance term measures how much the learning
algorithm’s prediction fluctuates for different training sets.
Incremental learning method is the ability of an algorithm to learn from new data that may
be available after classifier has already been generated from already available dataset.
PCA (Principal Components Analysis), KPCA ( Kernel based Principal Component Analysis)
and ICA ( Independent Component Analysis) are important feature extraction techniques
used for dimensionality reduction.
In Machine Learning and statistics, dimension reduction is the process of reducing the
number of random variables under considerations and can be divided into feature selection
and feature extraction
Support vector machines are supervised learning algorithms used for classification and
regression analysis.
a) Data Acquisition
d) Query Type
e) Scoring Metric
f) Significance Test
43) What are the different methods for Sequential Supervised Learning?
a) Sliding-window methods
44) What are the areas in robotics and information processing where sequential
prediction problem arises?
The areas in robotics and information processing where sequential prediction problem
arises are
a) Imitation Learning
b) Structured prediction
Statistical learning techniques allow learning a function or predictor from a set of observed
data that can make predictions about unseen or future data. These techniques provide
guarantees on the performance of the learned predictor on the future unseen data based
on a statistical assumption on the data generating process.
PAC (Probably Approximately Correct) learning is a learning framework that has been
introduced to analyze learning algorithms and their statistical efficiency.
47) What are the different categories you can categorized the sequence learning
process?
a) Sequence prediction
b) Sequence generation
c) Sequence recognition
d) Sequential decision
a) Genetic Programming
b) Inductive Learning
50) Give a popular application of machine learning that you see on day to day basis?
The recommendation engine implemented by major ecommerce websites uses Machine
Learning
1. What is Data Science? Also, list the differences between supervised and
unsupervised learning.
Data Science is a blend of various tools, algorithms, and machine learning principles with
the goal to discover hidden patterns from the raw data. How is this different from what
statisticians have been doing for years?
2. What are the important skills to have in Python with regard to data analysis?
The following are some of the important skills to possess which will come handy when
performing data analysis using Python.
Good understanding of the built-in data types especially lists, dictionaries, tuples, and sets.
Knowing that you should use the Anaconda distribution and the conda package manager.
Ability to write small, clean functions (important for any developer), preferably pure
Knowing how to profile the performance of a Python script and how to optimize
bottlenecks.
The following will help to tackle any problem in data analytics and machine learning.
In the wide format, a subject’s repeated responses will be in a single row, and each
response is in a separate column. In the long format, each row is a one-time point per
subject. You can recognize data in wide format by the fact that columns generally
represent groups.
5. What do you understand by the term Normal Distribution?
Data is usually distributed in different ways with a bias to the left or to the right or it can
all be jumbled up.
However, there are chances that data is distributed around a central value without any
bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.
The random variables are distributed in the form of a symmetrical bell-shaped curve.
Properties of Nornal Distribution:
It is a statistical hypothesis testing for a randomized experiment with two variables A and
B.
The goal of A/B Testing is to identify any changes to the web page to maximize or
increase the outcome of an interest. A/B testing is a fantastic method for figuring out the
best online promotional and marketing strategies for your business. It can be used to test
everything from website copy to sales emails to search ads
An example of this could be identifying the click-through rate for a banner ad.
Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the
events which were true and model also predicted them as true.
*where true positives are positive events which are correctly classified as positives.
In statistics and machine learning, one of the most common tasks is to fit a model to a set
of training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying
relationship. Overfitting occurs when a model is excessively complex, such as having too
many parameters relative to the number of observations. A model that has been overfit
has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture
the underlying trend of the data. Underfitting would occur, for example, when fitting a
linear model to non-linear data. Such a model too would have poor
predictive performance.
Python would be the best option because it has Pandas library that provides easy to use
Cleaning data from multiple sources helps to transform it into a format that data analysts
Data Cleaning helps to increase the accuracy of the model in machine learning.
It is a cumbersome process because as the number of data sources increases, the time
taken to clean the data increases exponentially due to the number of sources and the
It might take up to 80% of the time for just cleaning data making it a critical part of analysis
task.
The bivariate analysis attempts to understand the difference between two variables at a
time as in a scatterplot. For example, analyzing the volume of sale and spending can be
considered as an example of bivariate analysis.
Multivariate analysis deals with the study of more than two variables to understand the
effect of variables on the responses.
Cluster sampling is a technique used when it becomes difficult to study the target
population spread across a wide area and simple random sampling cannot be
applied. Cluster Sample is a probability sample where each sampling unit is a collection
or cluster of elements.
For eg., A researcher wants to survey the academic performance of high school
students in Japan. He can divide the entire population of Japan into different clusters
(cities). Then the researcher selects a number of clusters depending on his research
through simple or systematic random sampling.
Let’s continue our Data Science Interview Questions blog with some more statistics
questions.
15. Can you cite some examples where a false positive is important than a false
negative?
Let us first understand what false positives and false negatives are.
False Positives are the cases where you wrongly classified a non-event as an event a.k.a
Type I error.
False Negatives are the cases where you wrongly classify events as non-events, a.k.a
Type II error.
Example 1: In the medical field, assume you have to give chemotherapy to patients.
Assume a patient comes to that hospital and he is tested positive for cancer, based on
the lab prediction but he actually doesn’t have cancer. This is a case of false positive.
Here it is of utmost danger to start chemotherapy on this patient when he actually does
not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage
to his normal healthy cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the
customers whom they assume to purchase at least $10,000 worth of items. They send
free voucher mail directly to 100 customers without any minimum purchase condition
because they assume to make at least 20% profit on sold items above $10,000. Now the
issue is if we send the $1000 gift vouchers to customers who have not actually purchased
anything but are marked as having made $10,000 worth of purchase.
16. Can you cite some examples where a false negative important than a false
positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and
based on certain characteristics they identify whether a particular passenger can be a
threat or not. Due to a shortage of staff, they decide to scan passengers being predicted
as risk positives by their predictive model. What will happen if a true threat customer is
being flagged as non-threat by airport model?
Example 3: What if you rejected to marry a very good person based on your predictive
model and you happen to meet him/her after a few years and realize that you had a
false negative?
17. Can you cite some examples where both false positive and false negatives are
equally important?
In the Banking industry giving loans is the primary source of making money but at the
same time if your repayment rate is not good you will not make any profit, rather you will
risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want
to acquire bad customers. In this scenario, both the false positives and false negatives
become very important to measure.
18. Can you explain the difference between a Validation Set and a Test Set?
A Validation set can be considered as a part of the training set as it is used for parameter
selection and to avoid overfitting of the model being built.
On the other hand, a Test Set is used for testing or evaluating the performance of a
trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters
i.e. weights and test set is to assess the performance of the model i.e. evaluating the
predictive power and generalization.
The goal of cross-validation is to term a data set to test the model in the training phase
(i.e. validation data set) in order to limit problems like overfitting and get an insight on how
the model will generalize to an independent data set.
Machine Learning explores the study and construction of algorithms that can learn from
and make predictions on data. Closely related to computational statistics. Used to devise
complex models and algorithms that lend themselves to a prediction which in commercial
use is known as predictive analytics.
Figure: Applications of Machine Learning
Supervised learning is the machine learning task of inferring a function from labeled
training data. The training data consist of a set of training examples.
E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and
this is a banana”, based on showing the classifier examples of apples, oranges and
bananas.
Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models
E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots
of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.
23. What are the various classification algorithms?
24. What is logistic regression? State an example when you have used logistic
regression recently.
Logistic Regression often referred as logit model is a technique to predict the binary
outcome from a linear combination of predictor variables.
For example, if you want to predict whether a particular political leader will win the election
or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor
variables here would be the amount of money spent for election campaigning of a
particular candidate, the amount of time spent in campaigning, etc.
Recommender Systems are a subclass of information filtering systems that are meant
to predict the preferences or ratings that a user would give to a product. Recommender
systems are widely used in movies, news, research articles, products, social tags, music,
etc.
The process of filtering used by most of the recommender systems to find patterns or
information by collaborating viewpoints, various data sources and multiple agents.
An example of collaborative filtering can be to predict the rating of a particular user based
on his/her ratings for other movies and others’ ratings for all movies. This concept is
widely used in recommending movies in IMDB, Netflix & BookMyShow, product
recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video
recommendations and game recommendations in Xbox.
28. How can outlier values be treated?
Outlier values can be identified by using univariate or any other graphical analysis
method. If the number of outlier values is few then they can be assessed individually but
for a large number of outliers, the values can be substituted with either the 99th or the 1st
percentile values.
All extreme values are not outlier values. The most common ways to treat outlier values
The extent of the missing values is identified after identifying the variables with missing
values. If any patterns are identified the analyst has to concentrate on them as it could
lead to interesting and meaningful business insights.
If there are no patterns identified, then the missing values can be substituted with mean
or median values (imputation) or they can simply be ignored. Assigning a default value
which can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned
a default value. If you have a distribution of data coming, for normal distribution give the
mean value.
If 80% of the values for a variable are missing then you can answer that you would be
dropping the variable instead of treating the missing values.
Data Scientist Masters ProgramWatch The Course Preview
31. How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question is mostly in reference to
K-Means clustering where “K” defines the number of clusters. The objective of clustering
is to group similar entities in a way that the entities within a group are similar to each other
but the groups are different from each other.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you
This is the widely used approach but few data scientists also use Hierarchical clustering
first to create dendrograms and identify the distinct groups from there.
Now that we have seen the Machine Learning Questions, Let’s continue our Data
Science Interview Questions blog with some Probability questions.
PROBABILITY INTERVIEW QUESTIONS
32. In any 15-minute interval, there is a 20% probability that you will see at least
one shooting star. What is the probability that you see at least one shooting star in
the period of an hour?
Probability of not seeing any shooting star in 15 minutes is
Probability of not seeing any shooting star in the period of one hour
= (0.8) ^ 4 = 0.4096
33. How can you generate a random number between 1 – 7 with only a die?
Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single
rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36
different outcomes.
can thus consider only 35 outcomes and exclude the other one.
A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6
appears twice.
All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each.
This way all the seven sets of outcomes are equally likely.
34. A certain couple tells you that they have two children, at least one of which is a
girl. What is the probability that they have two girls?
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the remaining 3
possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
Thus, P(Having two girls given one girl) = 1/3
35. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin
at random, and toss it 10 times. Given that you see 10 heads, what is the probability
that the next toss of that coin is also a head?
There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick
the one with two heads.
Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair
coin
36. What do you mean by Deep Learning and Why has it become popular now?
Deep Learning is nothing but a paradigm of machine learning which has shown incredible
promise in recent years. This is because of the fact that Deep Learning shows a great
analogy with the functioning of the human brain.
Now although Deep Learning has been around for many years, the major breakthroughs
from these techniques came just in recent years. This is because of two main reasons:
The increase in the amount of data generated through various sources
GPUs are multiple times faster and they help us build bigger and deeper deep learning
models in comparatively less time than we required previously
Artificial Neural networks are a specific set of algorithms that have revolutionized machine
learning. They are inspired by biological neural networks. Neural Networks can adapt to
changing input so the network generates the best possible result without needing to
redesign the output criteria.
Artificial Neural Networks works on the same principle as a biological Neural Network. It
consists of inputs which get processed with weighted sums and Bias, with the help of
Activation Functions.
Gradient Descent can be thought of climbing down to the bottom of a valley, instead of
climbing up a hill. This is because it is a minimization algorithm that minimizes a given
function (Activation Function).
Stochastic Gradient Descent: We use only single training example for calculation of
Batch Gradient Descent: We calculate the gradient for the whole dataset and perform
Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a
variant of Stochastic Gradient Descent and here instead of single training example, mini-
Pytorch
TensorFlow
Keras
Caffe
Chainer
The Activation function is used to introduce non-linearity into the neural network helping
it to learn more complex function. Without which the neural network would be only able
to learn linear function which is a linear combination of its input data. An activation
function is a function in an artificial neuron that delivers an output based on inputs
Autoencoders are simple learning networks that aim to transform inputs into outputs
with the minimum possible error. This means that we want the output to be as close to
input as possible. We add a couple of layers between the input and the output, and the
sizes of these layers are smaller than the input layer. The autoencoder receives
unlabeled input which is then encoded to reconstruct the input.
Boltzmann machines have a simple learning algorithm that allows them to discover
interesting features that represent complex regularities in the training data. The
Boltzmann machine is basically used to optimize the weights and the quantity for the
given problem. The learning algorithm is very slow in networks with many layers of feature
detectors. “Restricted Boltzmann Machines” algorithm has a single layer of feature
detectors which makes it faster than the rest.
1. Question 1. Explain How We Can Capture The Correlation Between Continuous
And Categorical Variable?
Answer :
Yes, it is possible by using ANCOVA technique. It stands for Analysis of Covariance.
It is used to calculate the association between continuous and categorical variables.
2. Question 2. How To Handle Or Missing Data In A Dataset?
Answer :
An individual can easily find missing or corrupted data in a data set either by
dropping the rows or columns. On contrary, they can decide to replace the data with
another value.
In Pandas they are two ways to identify the missing data, these two methods are very
useful.
isnull() and dropna().
3. Question 3. Define What Is Fourier Transform In A Single Sentence?
Answer :
A process of decomposing generic functions into a superposition of symmetric
functions is considered to be a Fourier Transform.
4. Question 4. What Is Deep Learning?
Answer :
Deep learning is a process where it is considered to be a subset of machine learning
process.
5. Question 5. What Is The Difference Between An Array And Linked List?
Answer :
An array is an ordered fashion of collection of objects. A linked list is a series of
objects that are processed in a sequential order.
6. Question 6. Define A Hash Table?
Answer :
They are generally used for database indexing.
A hash table is nothing but a data structure that produces an associative array.
7. Question 7. Mention Any One Of The Data Visualization Tools That You Are
Familiar With?
Answer :
This is another question where one has to be completely honest and also giving out
your personal experience with these type of tools are really important. Some of the
data visualization tools are Tableau, Plot.ly, and matplotlib.
8. Question 8. Is Rotation Necessary In Pca?
Answer :
Yes, the rotation is definitely necessary because it maximizes the differences
between the variance captured by the components.
9. Question 9. How Is F1 Score Is Used?
Answer :
The average of Precision and Recall of a model is nothing but F1 score measure.
Based on the results, the F1 score is 1 then it is classified as best and 0 being the
worst.
10. Question 10. How Recall And True Positive Rate Are Related?
Answer :
The relation is
True Positive Rate = Recall.
11. Question 11. Assume That You Are Working On A Data Set, Explain How Would
You Select Important Variables?
Answer :
The following are few methods can be used to select important variables:
o Use of Lasso Regression method.
o Using Random Forest, plot variable imprtance chart.
o Using Linear regression.
12. Question 12. Explain The Concept Of Machine Learning And Assume That You
Are Explaining This To A 5-year-old Baby?
Answer :
Yes, Machine learning is exactly the same way how babies do their day to day
activities, the way they walk or sleep etc. It is a common fact that babies cannot walk
straight away and they fall and then they get up again and then try. This is the same
thing when it comes to machine learning, it is all about how the algorithm is working
and at the same time redefining every time to make sure the end result is as perfect
as possible.
13. Question 13. What Is The Difference Between Machine Learning And Data
Mining?
Answer :
Data mining is about working on unstructured data and then extract it to a level
where the interesting and unknown patterns are identified.
Machine learning is a process or a study whether it closely relates to design,
development of the algorithms that provide an ability to the machines to capacity to
learn.
14. Question 14. Please State Few Popular Machine Learning Algorithms?
Answer :
o Nearest Neighbour
o Neural Networks
o Decision Trees etc
o Support vector machines
15. Question 15. What Are The Three Stages To Build The Model In Machine
Learning?
Answer :
o Model building
o Model testing
o Applying the model
16. Question 16. What Is The Difference Between Supervised And Unsupervised
Machine Learning?
Answer :
A Supervised learning is a process where it requires training labeled data. When it
comes to Unsupervised learning it doesn’t require data labeling.
17. Question 17. What Is The Difference Between Bias And Variance?
Answer :
Bias:Bias can be defined as a situation where an error has occurred due to use of
assumptions in the learning algorithm.
Variance: Variance is an error caused because of the complexity of the algorithm
that is been used to analyze the data.