50 Deep Learning Technical Interview Questions With Answers
50 Deep Learning Technical Interview Questions With Answers
50 Deep Learning Technical Interview Questions With Answers
Deep learning job interviews. A necessary evil. Most beginners in the industry break out in a
cold sweat at the mere thought of a machine learning or a deep learning job interview. How
do I prepare for my upcoming deep learning job interview? What kind of deep learning
interview questions they are going to ask me? What questions should I ask them? These are
just a few thoughts that run through the mind of any interviewee. The problem with most
machine learning or deep learning interviews is that you never know whether you’ve to bring
your lucky whiteboard marker or your lucky keyboard. Not to mention the deep learning
questions that you will be asked in your next job interview are hardly predictable.
The good news? We’ve collated 100 deep learning interview questions from the insights of
our industry experts on what kind of questions they ask most often. So, keep calm and read
on to see what kind of questions you can expect in the hot seat in your next deep learning job
interview. Ready to dive in? Then let’s get started!
1) What kind of a neural network will you use in deep learning regression via Keras-
TensorFlow? Or How will you decide the best neural network model for a given
problem?
The foremost step when deciding on choosing a neural network model is to have a good
know-how of the data and then decide the best model for it. Also, factoring in whether it is a
linearly separable problem or not is important when deciding on a neural network model. So,
the task at hand and the data play a vital role in choosing the best neural network model for a
given problem. However, it is always better to start with a simple model like multi-layer
perceptron (MLP) that has just one hidden layer unlike CNN, LSTM, or RNN that require
configuring the nodes and layers. MLP is considered the simplest neural network because the
weight initialization is not sensitive and also there is no need to define a structure for the
network beforehand.
2) Why do we need autoencoders when there are already powerful dimensionality
reduction techniques like Principal Component Analysis?
The curse of dimensionality (the problems that arise when working with high-dimensional
data) is a common problem when working on machine learning or deep learning projects.
Curse of Dimensionality causes lots of difficulties while training a model because it requires
training a lot of parameters on a scarce dataset leading to issues like overfitting, large training
times and poor generalization. PCA and autoencoders are used to tackle these issues.
PCA is an unsupervised technique for wherein the actual data is projected to the direction of
high variance while autoencoders are neural networks used for compressing the data into a
low dimensional latent space and then try to reconstruct the actual high dimensional data.
PCA or autoencoders are effective only when the features have some relationship with each
other. A general thumb rule between choosing PCA and Autoencoders is the size of data.
Autoencoders work great for larger datasets and PCA works well for smaller datasets.
Autoencoders are usually preferred when there is a need for modelling non-linearities and
relatively complex relationships. Autoencoders can encode a lot of information with less
dimensions when there is curvature in low dim structure or non-linearity, making them a
better choice over PCA in such scenarios.
Autoencoders are usually preferred for identifying data anomalies rather than for reducing
data. Anomalous data points can be identified using the reconstruction error, PCA is not good
for reconstructing data particularly when there are non-linear relationships.
3) Say you have to build a neural network architecture; how will you decide how many
neurons and hidden layers are needed for the network?
Given a business problem, there is no hard and fast rule to determine the exact number of
neurons and hidden layers required to build a neural network architecture. The optimal size of
the hidden layer in a neural network lies between the size of the output layers and size of the
input. However, here are some common approaches that have the advantage of making a
great start to building a neural network architecture –
• To address any specific real-world predictive modelling problem, the best way is to
start with a rough systematic experimentation and find out what would work best for
any given dataset based on prior experience working with neural networks on similar
real-world problems. Based on the understanding of any given problem domain and
one’s experience working with neural networks, one can choose the network
configuration. The number of layers and neurons used on similar problems is always a
great way to start testing the configuration of a neural network.
• It is always advisable to begin with a simple neural network architecture and then go
on to enhance the complexity of the neural network.
• Try working with varying depth of networks and configure deep neural networks only
for challenging predictive modelling problems where depth can be beneficial.
4) Why CNN is preferred over ANN for Image Classification tasks even though it is
possible to solve image classification using ANN?
One common problem with using ANN’s for image classification is that ANN’s react
differently to input images and their shifted versions. Let’s consider a simple example where
you have the picture of a dog in the top left of an image and in another image, there is a
picture of a dog at the bottom right. ANN will assume that a dog will always appear in this
section of any image, however, that’s not the case.
ANN’s require concrete data points meaning if you are building a deep learning model to
distinguish between cats and dogs, the length of the ears, width of the nose and other features
should be provided as data points while if using CNN for image classification spatial features
are extracted from the input images. When there are thousands of features to be extracted,
CNN is a better choice because it gathers features on its own unlike ANN where each
individual feature needs to be measured.
5) Why Sigmoid or Tanh is not preferred to be used as the activation function in the
hidden layer of the neural network?
A common problem with Tanh or Sigmoid functions is that they saturate. Once saturated, the
learning algorithms cannot adapt to the weights and enhance the performance of the model.
Thus, Sigmoid or Tanh activation functions prevent the neural network from learning
effectively leading to a vanishing gradient problem. The vanishing gradient problem can be
addressed with the use of Rectified Linear Activation Function (ReLu) instead of sigmoid
and using a Xavier initialization.
When the model weights grow exponentially and become unexpectedly large in the end when
training the model, exploding gradient problem happens.
In a neural network with n hidden layers, n derivatives are multiplied together. If the weights
that are multiplied are greater than 1 then the gradient increases exponentially greater than
the usual one and eventually explodes as you propagate through the model. The situation
wherein the value of weights is more than 1 makes the output exponentially larger hindering
the model training and impacting the overall accuracy of the model is referred to as the
exploding gradients problem. Exploding gradients is a serious problem because the model
cannot learn from its training data resulting in poor loss. One can deal with the exploding
gradient problem either by gradient clipping, weight regularization or with the use of LSTM’s.
• It is always advisable to divide the dataset into training, validation, and test set.
• When working with little data, this problem can be solved by changing the parameters
of the neural network by trial and error.
• Increasing the size of the training dataset.
• Use batch normalization.
• Regularization
• Reduce the network complexity
8) What do you understand by learning rate in a neural network model? What happens if
the learning rate is too high or too low?
Learning rate is one of the most important configurable hyperparameter used in the training
of a neural network. The value of learning rate lies between 0 and 1. Choosing the learning
rate is one of the most challenging aspects of training a neural network because it is the
parameter that controls how quickly or slowly a neural network model adapts to a given
problem and learns. A higher learning rate value means that the model requires few training
epochs and results in rapid changes while a smaller learning rate implies that the model will
take a long time to converge or might never converge and get stuck on a suboptimal solution.
Thus, it is advisable not to use a learning rate that is too low or too high but instead a good
learning rate value should be discovered through trial and error.
9) What kind of a network would you prefer – shallow network or a deep network for
voice recognition?
Every neural network has a hidden layer along with input and output layers. Neural networks
that use a single hidden layer are known as shallow neural networks while those that use
multiple hidden layers are referred to as deep neural networks. Both shallow and deep
networks are capable of fitting into any function but shallow networks require a lot of
parameters unlike deep networks that can fit functions even with limited number of
parameters because of several layers. Deep networks are preferred today over shallow
networks because at every layer the model learns a novel and abstract representation of the
input. Also, they are much more efficient in terms of number of parameters and computations
compared to shallow networks.
10) Can you train a neural network model by initializing all biases as 0?
Yes , there is a possibility that the neural network model will learn even if all the biases are
initialized to 0.
11) Can you train a neural network model by initializing all the weights to 0 ?
No, it is not possible to train a model by initializing all the weights to 0 because the neural
network will never learn to perform a given task. Initializing all weights to zeros will cause
the derivatives to remain same for every w in W [1] because of which neurons will learn
same features in every iteration. Not just 0, but any kind of constant initialization of weights
is likely to produce a poor result.
Without non-linearities, a neural network will act like a perceptron regardless of how many
layers are there making the output linearly dependent on the input. In other words, having a
neural network with n layers and m hidden units with linear activation functions is just like
having a linear neural network without hidden layers that can only find linear separation
boundaries. A neural network without non-linearities cannot find appropriate solutions and
classify the data correctly for complex problems.
14) A deep learning model finds close to 12 million faces vectors. How will you find a
new face quickly?
You will need to know about One Shot Learning for Face Recognition which is a
classification task where is one or more examples(faces in this case) are used for classifying
new examples(faces) in future. One needs to know about the method of indexing data to
retrieve a new face faster. A new face can be recognized by finding the vectors that are close
)most similar) to the input face but in this case the system would be become super slow if we
were to calculate the distance to 12 million vectors. A convenient way would be to index data
on real vector space by dividing the data into easy structures for querying (almost like a tree
data structure). It is easier to find the vector that is in close proximity with time very quickly
whenever new data is available. Techniques like Annoy Indexing, Locality Sensitive
Hashing, and Approximate Nearest Neighbours can be used for this purpose.
15) What has fostered the implementation and experimentation of powerful neural
network architectures in the industry?
Flexibility makes deep learning powerful. Neural networks are universal function
approximators so even if it is a complex enough problem at hand(where the formula between
input and output is not known), a neural network can be approximated. Also, with transfer
learning (where the trained weights of an existing neural network can be used to initialize the
weights of another network that performs similar tasks) makes the application of deep
learning much more easier under situations where training a neural network from scratch is
costly or almost impossible when there is data scarcity.
Faster and powerful computational resources is also prime reason for the adoption of neural
network architectures. One cannot deny the fact that it is faster to train a neural network in
just minutes with GPU acceleration which would otherwise take days for the network to
learn.
16) Can you build deep learning models based solely on linear regression?
Yes, it is definitely possible to build deep networks using a linear function as the activation
function for each layer if the problem is represented by a linear equation. However, a
problem that is a composition of linear functions is a linear function and there is nothing
extraordinary that can be achieved with the implementation of a deep network because
adding more nodes to the network will not increase the predictive power of the machine
learning model.
17) When training a deep learning model you observe that after a few number of epochs
the accuracy of the model decreases. How will you address this problem?
The decrease in the accuracy of a deep learning model after a few epochs implies that the
model is learning from the characteristics of the dataset and not considering the features. This
is referred to as overfitting of the deep learning model. You can either use dropout
regularization or early stopping to fix this issue. Early stopping as the phrase implies stop
training the deep learning model any further the moment you notice a drop in accuracy of the
model. Dropout regularization is a technique wherein a few nodes or output layers are
dropped so that the remaining nodes have varying weights .
18) What is the impact on a model with an improperly set learning rate on weights ?
With images as inputs, an improperly set learning rate can cause noisy features. Having an
ill-chosen learning rate determines the prediction quality of a model and can result in an
unconverged neural network.
19)What do you understand by the terms Batch, Iterations, and Epoch in training a neural
network model?
• Epoch refers to the iteration where the complete dataset is passed forward and
backward through the neural network only once.
• It is not possible to pass the complete dataset to the network in one go so the dataset is
divided into parts. This is referred to as the Batch.
• The total number of batches needed to complete one epoch are referred to as iteration.
For example if you have 60,000 data rows and the batch size is 1000 then each epoch
will run 60 iterations.
For simple models, it could be possible to set the best learning rate value a priori. However,
for complex models it is not possible to calculate the best learning rate through theoretical
deductions that can actually make accurate predictions. Observations and experiences do play
a vital role in defining the optimal learning rate.
To answer this question one needs to explain the universal approximation theorem that forms
the base on why neural networks work.
According to the Universal Approximation Theorem, a neural network having a single hidden
layer containing finite number of neurons can approximate any continuous function to a
reasonable accuracy for inputs in a specific range. However, if the function has large gaps it
is not possible to approximate it. Meaning, if a neural network is trained with inputs between
20 and 30 , we cannot be assured that it will work well for inputs between 60 and 70.
22) What are the commonly used approaches to set the learning rate ?
• Using a fixed learning rate value for the complete learning process.
Ideally there is no significant difference between deep learning networks and neural
networks. Deep learning networks are neural networks but with a slightly complex
architecture than they were in 1990’s. It is the availability of hardware and computational
resources that has made it feasible to implement them now.
24) You want to train a deep learning model on 10GB dataset but your machine has 4GB
RAM. How will you go about implementing a solution to this deep learning problem?
One of the possible ways to answer this question would be to say that a neural network can be
trained by loading the data into the NumPy array and defining a small batch size .NumPy
doesn’t load the complete dataset into the memory but creates a complete mapping of the
dataset. NumPy offers several tools for compressing large datasets that can be integrated with
other NN packages like PyTorch, TensorFlow, or Keras.
25) How will the predictability of a neural network impact if you use a ReLu activation
function and then use Sigmoid function in the final layer of the network?
The neural network will predict only one class for all types of inputs because the output of a
ReLu activation function is always a non-negative result.
A major drawback to using a perceptron is that they can only linearly separable functions and
cannot handle non-linear inputs.
27) How will you differentiate between a multi-class and multi-label classification problem?
In a multi-class classification problem the classification task has more than two mutually
exclusive classes whereas in a multi-label problem each label has a different classification
task, however the tasks are related somehow. For example, classifying a set of images of
animals which may be cats, dogs, or bears is a multi-class classification problem that assume
that each sample has only one label meaning an image can be classified as either a cat or a
dog but not both at the same time. Now imagine that you want to process the below image.
The image shown below needs to be classified as both cat and dog because the image shows
both the animals. In a multi-label classification problem, a set of labels are assigned to each
sample and the classes are not mutually exclusive. So, a pattern can belong to one or more
classes in a multi-label classification problem.
You know how to ride a bicycle, so it will be easy for you to learn driving a bike. This is
transfer learning. You have some skill and you can learn a new skill that relates to it without
having to learn it from scratch. Transfer learning is a process in which the learning can be
transferred from one model to another without having to make the model learn everything
from scratch. The features and weights can be used for training the new model providing
reusability. Transfer learning works well in training a model easily when there is limited data.
29) What is fine tuning and how is it different from transfer learning ?
In transfer learning , the feature extraction part remains untouched and only the prediction
layer is retrained by changing the weights based on the application. To the contrary in fine
tuning, the prediction layer along with the feature extraction stage can be retrained making
the process flexible.
30) Why do we use convolutions for images instead of using fully connected layers?
Each convolution kernel in a CNN acts like its own feature detector and has a partially in-
built translation in-variance. Using convolutions lets one preserve, encode and make use of
the spatial information from the image unlike fully connected layers that do not have any
relative spatial information.
Gradient Clipping is used to deal with the exploding gradient problem that occurs during the
backpropagation. The gradient values are forced element-wise to a particular minimum or
maximum value if the gradient has crossed the expected range. Gradient clipping provides
numerical stability while training a neural network but does not provide any performance
improvements.
It is a deep learning process where a model gets raw data as the input and all the various parts
are trained simultaneously to produce the desired outcome with no intermediate tasks. The
advantage of end-to-end learning is that there is no need for implicitly doing feature
engineering which usually leads to a lower bias. A good example that you can quote in the
content of end-to-end learning is driverless cars. They use human provided input as guidance
and are trained to automatically learn and process the information using a CNN to complete
tasks.
34) What is the advantage of using small kernels like 3x3 than using a few large ones.
Smaller kernels let you use more filters so you can use a greater number of activations
functions and let the CNN learn a more discriminative mapping function. Also, smaller
kernels capture more spatial context and use less computations and parameters making them
a better choice over large ones.
35) How can you generate a dataset on multiple cores in real-time that can be fed to the deep
learning model?
One of the major challenges today in CV is the need to load large datasets of videos and
images but there is not enough memory on the machine. In such situations, data generators
act like a magic wand when it comes to loading a dataset that is memory consuming. You can
talk about the various data generators Keras model class provides. When working with big
data, in most of the cases it might not be required to load all the data into RAM as it would be
memory wastage, could lead to memory overflow and also take longer time to process.
Making use of generative functions is highly beneficial then as they generate the data to be
directly fed into the model in each batch for training.
36) How do you bring balance to the force when handling imbalanced datasets in deep
learning?
It is next to impossible to have a perfectly balanced real-world dataset when working on deep
learning problems so there will be some level of class imbalance within the data that can be
tackled either by –
• Weight Balancing -
• Over and Under Sampling
37) What are the benefits of using batch normalization when training a neural network?
• Batch normalization optimizes the network training process making it easier to build
and faster to train a deep neural network.
• Batch normalization regulates the values going into each activation function making
activation functions more viable because non-linearities that don’t seem to work well
become viable with the use of batch normalization.
• Batch normalization makes it easier to initialize weights and also allows the use of
higher learning rates ultimately increasing the speed at which the network trains.
LSTM works well for problems where accuracy is critical and sequence is large whereas if
you want less memory consumption and faster operations, opt for GRU. Refer here for
detailed answer: https://www.dezyre.com/recipes/what-is-difference-between-gru-and-lstm-
explain-with-example
39) RMSProp and Adam optimizer adjust gradients? Does this mean that they perform
gradient clipping?
This does not inherently mean that they perform gradient clipping because gradient clipping
involves setting up predetermined values beyond which the gradients cannot go unlike Adam
and RMSProp that make multiplicative adjustments to gradients.
40) Can you name a few hyperparameters used for training a neural network.
When training any neural networks there are two types of hyperparameters-one that define
the structure of the neural network and the other determining how a neural network is trained.
Listed are a few hyperparameters that are set before training any neural network –
• Initialization of weights
• Setting the number of hidden layers
• Learning Rate
• Number of epochs
• Activation Functions
• Batch Size
• Momentum
Multi-task learning with deep neural networks is a subfield wherein several tasks are learned
by a shared model. This reduces overfitting, enhances data efficiency, and speeds up the
learning process with the use of auxiliary information. Multi-task learning is useful when
there is small amount of data for any given task and we can benefit from training a deep
learning model on a large dataset.
44) To what kind of problems can the cross-entropy loss function be applied?
46) How important is it to shuffle the training data when using batch gradient descent?
Shuffling the training dataset will not make much of difference because the gradient is
calculated at every epoch using the complete training dataset.
47) What is the benefit of using max-pooling in classification convolutional neural networks?
The feature maps become smaller after max-pooling in CNN and hence help reduce the
computation and also gives more translation in-variance. Also, we don’t lose much semantic
information because we’re taking the maximum activation.
48) Can you name a few data structures that are commonly used in deep learning?
You can talk about computational graphs, tensors, matrix, data frames, and lists.
49) Can you add a L2 regularization to a recurrent neural network to overcome the vanishing
gradient problem?
This can actually worsen the vanishing gradient problem because the L2 regularization will
shrink weights towards zero.
It is not possible to use batch normalization in RNN because statistics are computed per batch
and thus batch normalization will not consider the recurrent part of the neural network. An
alternative to this could be layer normalization in RNN or reparametrizing the LSTM layer
that allows the use of batch normalization.
2) Which deep learning framework do you prefer to work with – PyTorch or TensorFlow
and why?
3) Talk about a deep learning project you’ve worked on and the tools you used?
4) Have you used ReLu activation function in your neural network? Can you explain
how does the ReLu activation function work?
5) How often do you use pre-trained models for your neural network ?
6) What does the future of video analysis look like with the use of deep learning solutions
? How effective/good is video analysis currently?
7) Tell us about your passion for deep learning. Do you like to participate in deep
learning/machine learning hackathons, write blogs around novel deep learning tools, or
attend local meetups, etc ?
8) Describe the last time you felt frustrated solving a deep learning challenge , and how
did you overcome it?
9) What is more important to you the performance of your deep learning model or its
accuracy?
10) Given the dataset, how will you decide which deep learning model to use and how to
implement it?
11) What is the last deep learning research paper you’ve read ?
12) What are the most commonly used neural network paradigms ? (Hint : Talk about
Encoder-Decoder Structures, LSTM, GAN, and CNN)
13) Is it possible to use a neural network as a tool of dimensionality reduction?
14) How deep learning models tackle the curse of dimensionality ?
15) What are the pros and cons of using neural networks?
16) How is a Capsule Neural Network different from a Convolutional Neural Network?
17) What is a GAN and what are the different types of GAN you’ve worked with ?
18) For any given problem, how do you decide if you have to use transfer learning or fine
tuning ?
19) Can you share some tricks or techniques that you use to fight overfitting of a deep
learning model and get better generalization?
20) Explain the difference between Gradient Descent and Stochastic Gradient Descent.
21) Which one do you think is more powerful – a two layer NN without any activation
function or a two layer decision tree?
22) Can you name the breakthrough project that garnered the popularity and adoption of
deep learning ?
23) Differentiate between bias and variance with respect to deep learning models and
how can you achieve a balance between the two?
24) What are your thoughts about using GPT3 for our business?
25) Can you train a neural network without using back-propagation? If yes, what
technique will you use to accomplish this ?
26) Describe your research experience in the field of deep learning?
27) Explain the working of a perceptron.
28) Differentiate between a feed forward neural network and a recurrent neural network.
29) Why don’t we see the exploding or vanishing gradient problem in feed forward neural
networks ?
30) How do you decide the size of the filter when performing a convolution operation in a
CNN ?
31) When designing a CNN, can we find out how may convolutional layers should we use
?
32) What do you understand by a computational graph ?
33) Differentiate between PCA and Autoencoders.
34) Which one is better for reconstruction linear autoencoder or PCA?
35) How is deep learning related to representation learning ?
36) Explain the Borel Measurable function.
37) How are Gradient Boosting and Gradient Descent different from each other ?
38) In a logistic regression model , will all the gradient descent algorithms lead to the
same model if run for a long time?
39) What is the benefit of shuffling a training dataset when using batch gradient descent?
40) Explain about the cross-entropy loss function.
41) Why is cross-entropy preferred as the cost function for multi-class classification
problems?
42) What happens if you do not use any activation functions in a neural network?
43) What is the importance of having residual neural networks ?
44) There is a neuron in the hidden layer which always results in a large error in back
propagation . What could be the reason for this ?
45) Explain the working of forward and back propagation in deep learning.
46) Is there any difference between feature learning and feature extraction ?
47) Do you know the difference between the padding parameters valid and same padding
in a CNN?
48) How does deep learning outperform traditional machine learning models in time
series analysis ?
49) Can you explain parameter sharing concept in deep learning ?
50) How many trainable parameters are there in a Gated Recurrent Unit cell and in a Long
Short Term Memory cell ?
So that pretty much makes it for this post – the most common deep learning engineer
interview questions and answers. Whether you’re a beginner or a seasoned professional,
hopefully, these deep learning job interview questions and answers have been useful and been
able to boost your confidence.
Congrats! You now have a know-how on the kind of deep learning interview questions you
can expect in your next job interview. However, there is still a lots to learn to solidify your
deep learning knowledge and get hands-on experience working with diverse deep learning
projects and all the deep learning frameworks like PyTorch, TensorFlow, and Keras.
ProjectPro helps you move right into practice with over 60+ end-to-end solved data science
and machine learning projects where you will learn how to develop machine learning/deep
learning models from scratch and develop a high-level ability to think about productionized
machine learning systems. Get started today to take your deep learning skills to the next level
and build a fantastic job-winning portfolio of projects.
We would love to hear your own machine learning or deep learning interview experiences. If
you have any other interesting deep learning interview questions to share that can be helpful,
please send an email with the questions and answers to khushbu.shah@dezyre.com to make
the learning experience for the community enriching and valuable. All the questions and
answers shared would be posted on the blog with due credit to the author.