Data scientist interview questions ❓

Date - 23-12-23
Company: Splunk
Role: Data Scientist
1. How Are Weights Initialized in a Neural network?
Ans: There are two methods here: we can either initialize the weights to zero or assign them
randomly.
Initializing all weights to 0: This makes your model similar to a linear model. All the neurons
and every layer perform the same operation, giving the same output and making the deep
net useless.
Initializing all weights randomly: Here, the weights are assigned randomly by initializing them
very close to 0. It gives better accuracy to the model since every neuron performs different
computations. This is the most commonly used method.
2. What are the variants of Gradient descent?
Ans: Stochastic Gradient Descent: We use only a single training example for calculation of
gradient and
update parameters.
Batch Gradient Descent: We calculate the gradient for the whole dataset and perform the
update at
each iteration.
Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a
variant of
Stochastic Gradient Descent and here instead of single training example, mini-batch of
samples is
used.
3. What are the feature selection methods used to select the right variables?
Ans: There are two main methods for feature selection:
Filter Methods
This involves:
• Linear discrimination analysis
• ANOVA
• Chi-Square
The best analogy for selecting features is "bad data in, bad answer out." When we're limiting
or selecting the features, it's all about selecting the useful feature.
Wrapper Methods
This involves:
• Forward Selection: We test one feature at a time and keep adding them until we get a good
fit
• Backward Selection: We test all the features and start removing them to see what works
better
• Recursive Feature Elimination: Recursively looks through all the different features and how
they pair together. Wrapper methods are very labor-intensive, and high-end computers are
needed if a lot of data analysis is performed with the wrapper method.
4. What is joint sampling and separate sampling?

Ans:
· Joint sampling is done when there are equal number of events and non-events. Not
appropriate for imbalanced data
· Separate sampling is done for imbalanced data. For rare event, all observations are kept
when target = 1 and only few observations are kept when target = 0.
Date - 20-12-2023
Company name: CRED
Topic: sets n groups, tpr-fpr, triggers, bag of words
1. What do Tableau's sets and groups mean?
Data is grouped using sets and groups according to predefined criteria. The primary
distinction between the two is that although a set can have only two options—either in or
out—a group can divide the dataset into several groups. A user should decide which group
or sets to apply based on the conditions.
3.What do you mean by a Bag of Words (BOW)?
It is used for word frequency or occurrences to train a classifier.
It contains a text representation that describes the frequency with which words appear in a
document.
It has two steps:
-A list of terms that are well-known.
-A metric for determining the existence of well-known terms.

3. What are Nested Triggers?
Triggers may implement DML by using INSERT, UPDATE, and DELETE statements. These
triggers that contain DML and find other triggers for data modification are called Nested
Triggers.
4. What is a True positive rate and a false positive rate?
True positive rate or Recall: It gives us the percentage of the true positives captured by the
model out of all the Actual Positive class.
TPR = TP/ (TP+FN)
False Positive rate: It gives us the percentage of all the false positives by my model
prediction from the all Actual Negative class.
FPR = FP/(FP+TN)
Date: 12-12-2023
Company name: Numerator
Topic : RNN, pop & remove, calculated field, SQL, Excel
1. What are the uses of using RNN in NLP?

The RNN is a stateful neural network, which means that it not only retains information from
the previous layer but also from the previous pass. Thus, this neuron is said to have
connections between passes, and through time.
For the RNN the order of the input matters due to being stateful. The same words with
different orders will yield different outputs.
RNN can be used for unsegmented, connected applications such as handwriting recognition
or speech recognition.
2. How to remove values to a python array?
Ans: Array elements can be removed using pop() or remove() method. The difference
between these two functions is that the former returns the deleted value whereas the latter
does not.
3. What are the advantages and disadvantages of views in the database?
Answer: Advantages of Views:
As there is no physical location where the data in the view is stored, it generates output
without wasting resources.
Data access is restricted as it does not allow commands like insertion, updation, and
deletion.
Disadvantages of Views:
The view becomes irrelevant if we drop a table related to that view.
Much memory space is occupied when the view is created for large tables.
4. Describe the Difference Between Window Functions and Aggregate Functions in SQL.
The main difference between window functions and aggregate functions is that aggregate
functions group multiple rows into a single result row; all the individual rows in the group are
collapsed and their individual data is not shown. On the other hand, window functions
produce a result for each individual row. This result is usually shown as a new column value
in every row within the window.
5. What is Ribbon in Excel and where does it appear?
The Ribbon is basically your key interface with Excel and it appears at the top of the Excel
window. It allows users to access many of the most important commands directly. It consists
of many tabs such as File, Home, View, Insert, etc. You can also customize the ribbon to suit
your preferences. To customize the Ribbon, right-click on it and select the “Customize the
Ribbon” option.
————————————————————-
Date: 04-12-2023
Company Name: Accenture
Topic: Silhouette, trend seasonality, bag of words, bagging boosting , F1 Score
1. What do you understand by the term silhouette coefficient?
The silhouette coefficient is a measure of how well clustered together a data point is with
respect to the other points in its cluster. It is a measure of how similar a point is to the points
in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette
coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst
possible score.
2. What is the difference between trend and seasonality in time series?

Trends and seasonality are two characteristics of time series metrics that break many
models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on
the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising
above a baseline and then decreasing again.
3. What is Bag of Words in NLP?
Bag of Words is a commonly used model that depends on word frequencies or occurrences
to train a classifier. This model creates an occurrence matrix for documents or sentences
irrespective of its grammatical structure or word order.
4. What is the difference between bagging and boosting?
Bagging is a homogeneous weak learners’ model that learns from each other independently
in parallel and combines them for determining the model average. Boosting is also a
homogeneous weak learners’ model but works differently from Bagging. In this model,
learners learn sequentially and adaptively to improve model predictions of a learning
algorithm
5. What do you understand by the F1 score?
The F1 score represents the measurement of a model's performance. It is referred to as a

weighted average of the precision and recall of a model. The results tending to 1 are
considered as the best, and those tending to 0 are the worst. It could be used in
classification tests, where true negatives don't matter much.
———————————————————-
Date: 30-11-2023
Company: Facebook
Role - Data Scientist
Topics - Machine Learning, Behavioral, Security, Real-time Systems
Question 1 : How would you approach building a recommendation system for personalized
content on Facebook? Consider factors like scalability and user privacy.
- Answer: Building a recommendation system for personalized content on Facebook would

involve collaborative filtering or content-based methods. Scalability can be achieved using
distributed computing, and user privacy can be preserved through techniques like federated
learning.
Question 2 : Describe a situation where you had to navigate conflicting opinions within your
team. How did you facilitate resolution and maintain team cohesion?
- Answer: In navigating conflicting opinions within a team, I facilitated resolution through

open communication, active listening, and finding common ground. Prioritizing team
cohesion was key to achieving consensus.
Question 3 : How would you enhance the security of user data on Facebook, considering the
evolving landscape of cybersecurity threats?
- Answer: Enhancing the security of user data on Facebook involves implementing robust
encryption mechanisms, access controls, and regular security audits. Ensuring compliance
with privacy regulations and proactive threat monitoring are essential.
Question 4 : Design a real-time notification system for Facebook, ensuring timely delivery of
notifications to users across various platforms.
- Answer: Designing a real-time notification system for Facebook requires technologies like
WebSocket for real-time communication and push notifications. Ensuring scalability and
reliability through distributed systems is crucial for timely delivery.
Date: 27-11-2023
Company: Twitter
Topics - Natural Language Processing, Recommendation Systems, Python, Graph Analysis
1: How would you preprocess and tokenize text data from tweets for sentiment analysis?
Discuss potential challenges and solutions.
- Answer: Preprocessing and tokenizing text data for sentiment analysis involves tasks like
lowercasing, removing stop words, and stemming or lemmatization. Handling challenges like
handling emojis, slang, and noisy text is crucial. Tools like NLTK or spaCy can assist in these
tasks.
2: Explain the collaborative filtering approach in building recommendation systems. How

might Twitter use this to enhance user experience?
- Answer: Collaborative filtering recommends items based on user preferences and

similarities. Techniques include user-based or item-based collaborative filtering and matrix
factorization. Twitter could leverage user interactions to recommend tweets, users, or topics.
3: Write a Python or Scala function to count the frequency of hashtags in a given collection
of tweets.
- Answer (Python):
def count_hashtags(tweet_collection):
hashtags_count = {}
for tweet in tweet_collection:
hashtags = [word for word in tweet.split() if word.startswith('#')]
for hashtag in hashtags:
hashtags_count[hashtag] = hashtags_count.get(hashtag, 0) + 1
return hashtags_count
4: How does graph analysis contribute to understanding user interactions and content
propagation on Twitter? Provide a specific use case.
- Answer: Graph analysis on Twitter involves examining user interactions. For instance,
identifying influential users or detecting communities based on retweet or mention networks.
Algorithms like PageRank or Louvain Modularity can aid in these analyses.
Date: 26-11-2023
Company: Microsoft
Topics - SQL, Machine Learning, NLP, Neural Networks
1. Write a SQL query to find the top three products with the highest revenue in the last
quarter from a sales database.
Answer: A SQL query to find the top three products with the highest revenue in the last
quarter:
SELECT TOP 3 ProductID, SUM(Revenue) AS TotalRevenue
FROM Sales
WHERE OrderDate >= DATEADD(QUARTER, DATEDIFF(QUARTER, 0, GETDATE()) -

1, 0)
GROUP BY ProductID
ORDER BY TotalRevenue DESC;
2. Explain the concept of ensemble learning and provide examples of scenarios where
ensemble methods could be beneficial.
Answer: Ensemble learning involves combining multiple models to improve overall

performance. Examples include Random Forests, Gradient Boosting, and AdaBoost.
Ensemble methods are beneficial when dealing with noisy data, reducing overfitting, and
enhancing model robustness by capturing diverse patterns in the data.
3. How would you build a language translation system using Microsoft's Cognitive Services
or other NLP tools?
Answer: To build a language translation system, I would use Azure Cognitive Services'
Translator API. By sending text to the API, it provides translated content. Preprocess text,
send requests to the API, and handle responses. Azure Cognitive Services also supports
neural machine translation for improved accuracy.
4. Explain the concept of transfer learning in the context of deep neural networks. How can
transfer learning be applied to tasks like image classification?
Answer: Transfer learning involves using a pre-trained model on a related task to improve
performance on a new task. For image classification, a pre-trained Convolutional Neural
Network (CNN) can be fine-tuned on a specific dataset, saving training time and leveraging
learned features.
Date: 23-11-2023
Company: Google
Topics - Machine Learning, Algorithms, Statistics, Data Analysis
1. Explain the concept of transfer learning in the context of deep learning models. How can it
be beneficial in practical applications?
Ans- Transfer learning involves leveraging pre-trained models on large datasets and
adapting them to new, related tasks with smaller datasets. In deep learning, this is achieved
by reusing the knowledge gained during the training of one model on a different, but related,
task. This is particularly beneficial when the new task has limited labeled data.
Practical applications include image recognition, where a model pre-trained on a dataset like
ImageNet can be fine-tuned for a specific domain. Transfer learning accelerates model
convergence, requires less labeled data, and helps overcome the challenges of training
deep neural networks from scratch.
2. Given a large dataset, how would you efficiently sample a representative subset for model
training? Discuss the trade-offs involved.
Answer- To efficiently sample a representative subset, one can use techniques like random
sampling or stratified sampling. For random sampling, simple random sampling or
systematic sampling methods can be employed. For stratified sampling, data is divided into
strata, and samples are randomly selected from each stratum.
Trade-offs involve the choice between biased and unbiased sampling. Random sampling
may not capture rare events, while stratified sampling might introduce complexity but
ensures representation. The size of the sample is also crucial; a too-small sample may not
be representative, while a too-large sample may incur unnecessary computational costs.
3. How would you approach analyzing A/B test results to determine the effectiveness of a
new feature on a platform like Google Search?
Answer: A/B testing involves comparing the performance of two versions (A and B) to
determine the impact of a change. To analyze A/B test results:
- Define Metrics: Clearly define key metrics (e.g., click-through rate, user engagement)
before the test.
- Random Assignment: Ensure random assignment of users to control (A) and experimental
(B) groups.
- Statistical Significance: Use statistical tests (e.g., t-test) to determine if differences between
groups are statistically significant.
- Practical Significance: Consider the practical significance of results to assess real-world

impact.
- Segmentation: Analyze results across different user segments for nuanced insights.
4. You have access to search query logs. How would you identify and address potential
biases in the search results?
Answer: To identify and address biases in search results:
- Analyze Demographics: Examine user demographics to identify biases related to age,

gender, or location.
- Query Intent: Understand user query intent and ensure diverse queries are
well-represented.
- Evaluate Results: Assess the diversity of results to avoid favoring specific perspectives.
- User Feedback: Gather feedback from users to identify biased or inappropriate results.
- Continuous Monitoring: Implement continuous monitoring and iterate on algorithms to

minimize biases.
Date: 20-11-2023
Company: Tiger Analytics
Topics - Predictive Modeling, Clustering, Time Series Forecasting, ML Deployment,

Ensemble Learning
1. How would you handle imbalanced datasets when building a predictive model, and what
techniques would you use to ensure model performance?
Answer: When dealing with imbalanced datasets, techniques like oversampling the minority
class, undersampling the majority class, or using advanced methods like SMOTE can be
employed. Additionally, adjusting class weights in the model or using ensemble techniques
like RandomForest can address imbalanced data challenges.
2. Explain the K-means clustering algorithm and its applications. How would you determine
the optimal number of clusters?
Answer: The K-means clustering algorithm partitions data into 'K' clusters based on
similarity. The optimal 'K' can be determined using methods like the Elbow Method or
Silhouette Score. Applications include customer segmentation, anomaly detection, and
image compression.
3.Describe a scenario where you successfully applied time series forecasting to solve a
business problem. What methods did you use?
Answer: In time series forecasting, one would start with data exploration, identify seasonality
and trends, and use techniques like ARIMA, Exponential Smoothing, or LSTM for modeling.
Evaluation metrics like MAE, RMSE, or MAPE help assess forecasting accuracy.
4. Discuss the challenges and considerations involved in deploying machine learning models
to a production environment.
Answer: Model deployment involves converting a trained model into a format suitable for
production, using frameworks like Flask or Docker. Deployment considerations include
scalability, monitoring, and version control. Tools like Kubernetes can aid in managing
deployed models.
5. Explain the concept of ensemble learning, and how might ensemble methods improve the
robustness of a predictive model?
Answer: Ensemble learning combines multiple models to enhance predictive performance.

Examples include Random Forests and Gradient Boosting. Ensemble methods reduce
overfitting, increase model robustness, and capture diverse patterns in the data.
————————————————————
Date: 22-10-2023
Company : Accenture
Role : Data Scientist
Topic : Silhouette coeff, Trend&Seasonality, Bag of words, Self join
1. What do you understand by the term silhouette coefficient?
The silhouette coefficient is a measure of how well clustered together a data point is with
respect to the other points in its cluster. It is a measure of how similar a point is to the points
in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette
coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst
possible score.
2. What is the difference between trend and seasonality in time series?
Trends and seasonality are two characteristics of time series metrics that break many
models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on
the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising
above a baseline and then decreasing again.
4. What is a Self-Join?
A self-join is a type of join that can be used to connect two tables. As a result, it is a unary
relationship. Each row of the table is attached to itself and all other rows of the same table in
a self-join. As a result, a self-join is mostly used to combine and compare rows from the
same database table.
5.Explain the Law of Large Numbers.
The ‘Law of Large Numbers’ states that if an experiment is repeated independently a large
number of times, the average of the individual results is close to the expected value. It also
states that the sample variance and standard deviation also converge towards the expected
value.
————————————————————-
Date - 15-10-2023
Company name: Capgemini
Topic: decorators, acid, one hot encoding, kpi
1. What are decorators in Python?
Ans: Decorators are used to add some design patterns to a function without changing its
structure. Decorators generally are defined before the function they are enhancing. To apply
a decorator we first define the decorator function. Then we write the function it is applied to
and simply add the decorator function above the function it has to be applied to. For this, we
use the @ symbol before the decorator.
2. What is the ACID property in a database?
The full form of ACID is atomicity, consistency, isolation, and durability.
• Atomicity refers that if any aspect of a transaction fails, the whole transaction fails and the
database state remains unchanged.
• Consistency means that the data meets all validity guidelines.
• Concurrency management is the primary objective of isolation.

• Durability ensures that once a transaction is committed, it will occur regardless of what
happens in between such as a power outage, fire, or some other kind of disturbance.
3. What is the meaning of KPI in statistics?
KPI is an acronym for a key performance indicator. It can be defined as a quantifiable

measure to understand whether the goal is being achieved or not. KPI is a reliable metric to
measure the performance level of an organization or individual with respect to the objectives.
An example of KPI in an organization is the expense ratio.
4. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of
the given dataset?
One-hot encoding is the representation of categorical variables as binary vectors. Label

Encoding is converting labels/words into numeric form. Using one-hot encoding increases
the dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the
data set. One-hot encoding creates a new variable for each level in the variable whereas, in
Label encoding, the levels of a variable get encoded as 1 and 0.
————————————————————-
Date: 14-10-2023
Company name: Elica
Topic: kmeans, KNN, LSTM, powerbi, SQL
1. Explain some cases where k-Means clustering fails to give good results
k-means has trouble clustering data where clusters are of various sizes and
densities.Outliers will cause the centroids to be dragged, or the outliers might get their own
cluster instead of being ignored. Outliers should be clipped or removed before clustering.If
the number of dimensions increase, a distance-based similarity measure converges to a
constant value between any given examples. Dimensions should be reduced before
clustering them.
2. If your Time-Series Dataset is very long, what architecture would you use?
If the dataset for time-series is very long, LSTMs are ideal for it because it can not only
process single data points, but also entire sequences of data. A time-series being a
sequence of data makes LSTM ideal for it.For an even stronger representational capacity,
making the LSTM's multi-layered is better.Another method for long time-series dataset is to
use CNNs to extract information.
3. How would you define Power BI as an effective solution ?
Power BI is a strong business analytical tool that creates useful insights and reports by
collating data from unrelated sources. This data can be extracted from any source like
Microsoft Excel or hybrid data warehouses. Power BI drives an extreme level of utility and
purpose using interactive graphical interface and visualizations.
4. Why is the KNN Algorithm known as Lazy Learner?
When the KNN algorithm gets the training data, it does not learn and make a model, it just
stores the data. Instead of finding any discriminative function with the help of the training
data, it follows instance-based learning and also uses the training data when it actually
needs to do some prediction on the unseen datasets. As a result, KNN does not immediately
learn a model rather delays the learning thereby being referred to as Lazy Learner.
5. Explain the difference between drop and truncate.
In SQL, the DROP command is used to remove the whole database or table indexes, data,
and more. Whereas the TRUNCATE command is used to remove all the rows from the table.
————————————————————-
Date - 06-08-2023
Company name: Walmart
Topic: kmean, cnn layers, constraint in SQL, tableau, NLP, bagging and boosting
k-means has trouble clustering data where clusters are of various sizes and densities.
Outliers will cause the centroids to be dragged, or the outliers might get their own cluster
instead of being ignored. Outliers should be clipped or removed before clustering.
If the number of dimensions increase, a distance-based similarity measure converges to a

clustering them.
2. Explain the different layers in CNN.
The different layers involved in the architecture of CNN are as follows:
1. Input Layer: The input layer in CNN should contain image data. Image data is represented
by a three-dimensional matrix. We have to reshape the image into a single column.
For Example, Suppose we have an MNIST dataset and you have an image of dimension 28
x 28 =784, you need to convert it into 784 x 1 before feeding it into the input. If we have “k”
training examples in the dataset, then the dimension of input will be (784, k).
2. Convolutional Layer: To perform the convolution operation, this layer is used which
creates several smaller picture windows to go over the data.
3. ReLU Layer: This layer introduces the non-linearity to the network and converts all the
negative pixels to zero. The final output is a rectified feature map.
3. What Would You Do If Some Countries/Provinces (Any Geographical Entity) are Missing
and Displaying a Null When You Use Map View in Tableau?
When working with maps and geographical fields, unknown or ambiguous locations are
identified by the indicator in the lower right corner of the view.
Click the indicator and choose from the following options:
Edit Locations - correct the locations by mapping your data to known locations
Filter Data - exclude the unknown locations from the view using a filter. The locations will not
be included in calculations
Show Data at Default Position - show the values at the default position of (0, 0) on the map.
4. What are Constraints in SQL?
Constraints are used to specify the rules concerning data in the table. It can be applied for
single or multiple fields in an SQL table during the creation of the table or after creating using
the ALTER TABLE command. The constraints are:
NOT NULL - Restricts NULL value from being inserted into a column.
CHECK - Verifies that all values in a field satisfy a condition.
DEFAULT - Automatically assigns a default value if no value has been specified for the field.
UNIQUE - Ensures unique values to be inserted into the field.
INDEX - Indexes a field providing faster retrieval of records.
PRIMARY KEY - Uniquely identifies each record in a table.
FOREIGN KEY - Ensures referential integrity for a record in another table.

6. What is the difference between bagging and boosting?
Bagging is a homogeneous weak learners’ model that learns from each other independently
in parallel and combines them for determining the model average. Boosting is also a
homogeneous weak learners’ model but works differently from Bagging. In this model,
learners learn sequentially and adaptively to improve model predictions of a learning
algorithm
————————————————————-
Date: 01-08-2023
Topic: adaboost, sliding window, subquery, workbook vs worksheet vs dashboard in tableau,

random forest, Naive Bayes
1. What is the AdaBoost Algorithm?
AdaBoost also called Adaptive Boosting is a technique in Machine Learning used as an

Ensemble Method. The most common algorithm used with AdaBoost is decision trees with
one level that means with Decision trees with only 1 split. These trees are also called
Decision Stumps. What this algorithm does is that it builds a model and gives equal weights
to all the data points. It then assigns higher weights to points that are wrongly classified.
Now all the points which have higher weights are given more importance in the next model.
It will keep training models until and unless a lower error is received.
2. What is the Sliding Window method for Time Series Forecasting?

Time series can be phrased as supervised learning. Given a sequence of numbers for a time
series dataset, we can restructure the data to look like a supervised learning problem.
In the sliding window method, the previous time steps can be used as input variables, and
the next time steps can be used as the output variable.
In statistics and time series analysis, this is called a lag or lag method. The number of
previous time steps is called the window width or size of the lag. This sliding window is the
basis for how we can turn any time series dataset into a supervised learning problem.
3. What do you understand by sub-queries in SQL?
A subquery is a query inside another query where a query is defined to retrieve data or
information back from the database. In a subquery, the outer query is called as the main
query whereas the inner query is called subquery. Subqueries are always executed first and
the result of the subquery is passed on to the main query. It can be nested inside a SELECT,
UPDATE or any other query. A subquery can also use any comparison operators such as
>,< or =.
4. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and Workbook?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A workbook contains sheets, which can be a worksheet, dashboard, or a story.
A worksheet contains a single view along with shelves, legends, and the Data pane.
A dashboard is a collection of views from multiple worksheets.
A story contains a sequence of worksheets or dashboards that work together to convey

information.
5. How is a Random Forest related to Decision Trees?
Random forest is an ensemble learning method that works by constructing a multitude of

decision trees. A random forest can be constructed for both classification and regression
tasks.
Random forest outperforms decision trees, and it also does not have the habit of overfitting
the data as decision trees do.
A decision tree trained on a specific dataset will become very deep and cause overfitting. To
create a random forest, decision trees can be trained on different subsets of the training
dataset, and then the different decision trees can be averaged with the goal of decreasing
the variance.
6. What are some disadvantages of using Naive Bayes Algorithm?
Some disadvantages of using Naive Bayes Algorithm are:
It relies on a very big assumption that the independent variables are not related to each
other.
It is generally not suitable for datasets with large numbers of numerical attributes.
It has been observed that if a rare case is not in the training dataset but is in the testing
dataset, then it will most definitely be wrong.
————————————————————-
Date: 30-07-2023
Company Name - Deloitte

Topics: Missing Values, Hypothesis Testing, Decision Trees, Central Limit Theorem,
Overfitting vs Underfitting, Imbalanced data
1. You are given a data set consisting of variables with more than 30 percent missing values.
How will you deal with them?
Ans.
If the data set is large, we can just simply remove the rows with missing data values. It is the
quickest way, we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with values like mean, mode,
forward or backward fill. There are different ways to do so, such as df.mean(),
df.fillna(mean).
Q2. Hypothesis Testing. Null and Alternate hypothesis
Ans. Hypothesis testing is defined as the process of choosing hypotheses for a particular
probability distribution, on the basis of observed data Hypothesis testing is simply a core and
important topic in statistics.
A null hypothesis is a statistical hypothesis in which there is no significant difference exist

between the set of variables. It is the original or default statement, with no effect, often
represented by H0 (H-zero). It is always the hypothesis that is tested.
Alternative Hypothesis is a statistical hypothesis used in hypothesis testing, which states that
there is a significant difference between the set of variables. It is often referred to as the
hypothesis other than the null hypothesis, often denoted by H1 (H-one). The acceptance of
alternative hypothesis depends on the rejection of the null hypothesis i.e. until and unless
null hypothesis is rejected, an alternative hypothesis cannot be accepted.
Q3. Why use Decision Trees?
Ans. First, a decision tree is a visual representation of a decision situation (and hence aids
communication). Second, the branches of a tree explicitly show all those factors within the
analysis that are considered relevant to the decision (and implicitly those that are not).
4. What is the difference between observational and experimental data?
Observational data comes from observational studies which are when you observe certain
variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain
variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control
group lives their lives normally. The test group is told to drink a glass of wine every night for
30 days. Then research can be conducted to see how wine affects sleep.
Q5. Central Limit Theorem?
Ans. The central limit theorem states that if you have a population with mean μ and standard
deviation σ and take sufficiently large random samples from the population with replacement
, then the distribution of the sample means will be approximately normally distributed.
Q6. Over Fitting and Under Fitting

Ans. Overfitting is a modeling error which occurs when a function is too closely fit to a limited
set of data points. Underfitting refers to a model that can neither model the training data nor
generalize to new data.
Q7. how to deal with imbalance data in classification modelling?
Ans. Follow these techniques:
1.Use the right evaluation metrics.
2. Use K-fold Cross-Validation in the right way.
3. Ensemble different resampled datasets.
4. resample with different ratios.
5. Cluster the abundant class.
6.Design your own models.
————————————————————-
Date: 25-07-2023
Company: Adobe
Topics: overfitting, underfitting, feature scaling, linear model, logistic function
1.What are the conditions for Overfitting and Underfitting?

Ans:
• In Overfitting the model performs well for the training data, but for any new data it fails to
provide output. For Underfitting the model is very simple and not able to identify the correct
relationship. Following are the bias and variance conditions.
• Overfitting – Low bias and High Variance results in the overfitted model. The decision tree
is more prone to Overfitting.
• Underfitting – High bias and Low Variance. Such a model doesn’t perform well on test data
also. For example – Linear Regression is more prone to Underfitting.
2. Which models are more prone to Overfitting?
Ans: Complex models, like the Random Forest, Neural Networks, and XGBoost are more
prone to overfitting. Simpler models, like linear regression, can overfit too – this typically
happens when there are more features than the number of instances in the training data.
3. When does feature scaling should be done?
Ans: We need to perform Feature Scaling when we are dealing with Gradient Descent
Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based
algorithms (KNN, K-means, SVM) as these are very sensitive to the range of the data points.
4. What is a logistic function? What is the range of values of a logistic function?

Ans. f(z) = 1/(1+e -z )
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity
to +infinity.
5. What are the drawbacks of a linear model?
Ans. There are a couple of drawbacks of a linear model:
A linear model holds some strong assumptions that may not be true in application. It
assumes a linear relationship, multivariate normality, no or little multicollinearity, no
auto-correlation, and homoscedasticity
A linear model can’t be used for discrete or binary outcomes.
You can’t vary the model flexibility of a linear model.
👁
Data scientist interview questions
𝕚𝓻𝓞𝓃 ΜⒶⓝ™ ⃤ • #sʏsᴛᴜᴍᴍᴍ
————————————————————-
Date: 18-07-2023
Company Name - Genpact
Topics: euclidian distance, normalisation vs standardization, ensemble learning, batch vs

SGD, root cause analysis
Q. How will you measure the Euclidean distance between the two arrays in NumPy?
A. The magnitude or length of the line segment between two locations is known as the
Euclidean distance in mathematics. We first create two numpy arrays in the first procedure.
The Euclidean distance is then computed directly using numpy's linalg.norm() function.
Q. Difference between Normalisation and Standardization?
A. Normalization is the process of rescaling values into a range of [0,1]. Typically,

standardisation entails rescaling data to a mean of 0 and a standard deviation of 1. (unit
variance).
Q. What is Ensemble Learning?
A. Ensemble learning is the process of systematically generating and combining many

models, such as classifiers or experts, to tackle a specific computational intelligence
problem. Ensemble learning is largely used to improve a model's performance (classification,
prediction, function approximation, etc.) or to lessen the risk of an unintentional poor model
selection.
Q. Difference between Batch and Stochastic Gradient Descent?
A. Batch Gradient Descent is highly slow on very big training sets since it entails
calculations over the entire training set at each step. As a result, Batch GD becomes
extremely computationally expensive. SGD is stochastic in nature, which means it chooses
up a "random" instance of training data at each step and then computes the gradient, which
is significantly faster than Batch GD because there is much less data to modify at once.
Q. What is root cause analysis?
A. A root cause is a component that contributed to a nonconformance and should be

eradicated permanently through process improvement. The root cause is the most
fundamental problem—the most fundamental reason—that puts in motion the entire
cause-and-effect chain that leads to the problem (s). Root cause analysis (RCA) is a word
that refers to a variety of approaches, tools, and procedures used to identify the root causes
of problems. Some RCA approaches are more directed toward uncovering actual root
causes than others, while others are more general problem-solving procedures, and yet
others just provide support for the root cause analysis core activity.
————————————————————-
Date: 04-04-2023
Company: Reliance JIO
Topic: svm kernels, cross validation, relationship types, data types in tableau , NLP
1. What are Different Kernels in SVM?
Linear kernel - used when data is linearly separable.
Polynomial kernel - When you have discrete data that has no natural notion of smoothness.
Radial basis kernel - Create a decision boundary able to do a much better job of separating
two classes than the linear kernel.
Sigmoid kernel - used as an activation function for neural networks.
2. What is Cross-Validation?
Cross-validation is a method of splitting all your data into three parts: training, testing, and
validation data. Data is split into k subsets, and the model has trained on k-1of those
datasets.
The last subset is held for testing. This is done for each of the subsets. This is k-fold
cross-validation. Finally, the scores from all the k-folds are averaged to produce the final
score.
3. List the different types of relationships in SQL.
One-to-One - This can be defined as the relationship between two tables where each record
in one table is associated with the maximum of one record in the other table.
One-to-Many & Many-to-One - This is the most commonly used relationship where a record
in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are needed for
defining a relationship.
Self-Referencing Relationships - This is used when a table needs to define a relationship

with itself.
4. What Are the Data Types Supported in Tableau?
Following data types are supported in Tableau:
Text (string) values
Date values
Date and time values
Numerical values
Boolean values (relational only)
Geographical values (used with maps)
————————————————————-
Date - 05-07-2023
Topic: decorators, ACID, one hot encoding, KPI, Pandas

the given dataset?

5. Mention The Different Types Of Data Structures In pandas?
There are two data structures supported by pandas library, Series and DataFrames. Both of
the data structures are built on top of Numpy. Series is a one-dimensional data structure in
pandas and DataFrame is the two-dimensional data structure in pandas. There is one more
axis label known as Panel which is a three-dimensional data structure and it includes items,
major_axis, and minor_axis.
————————————————————-
Date: 02-07-2023
Company name: Elica
Topic: kmeans, KNN, LSTM, powerbi, SQL

k-means has trouble clustering data where clusters are of various sizes and
densities.Outliers will cause the centroids to be dragged, or the outliers might get their own
cluster instead of being ignored. Outliers should be clipped or removed before clustering.If
the number of dimensions increase, a distance-based similarity measure converges to a
clustering them.
2. If your Time-Series Dataset is very long, what architecture would you use?
If the dataset for time-series is very long, LSTMs are ideal for it because it can not only
process single data points, but also entire sequences of data. A time-series being a
sequence of data makes LSTM ideal for it.For an even stronger representational capacity,
making the LSTM's multi-layered is better.Another method for long time-series dataset is to
use CNNs to extract information.
3. How would you define Power BI as an effective solution ?
Power BI is a strong business analytical tool that creates useful insights and reports by
collating data from unrelated sources. This data can be extracted from any source like
Microsoft Excel or hybrid data warehouses. Power BI drives an extreme level of utility and
purpose using interactive graphical interface and visualizations.
4. Why is the KNN Algorithm known as Lazy Learner?
When the KNN algorithm gets the training data, it does not learn and make a model, it just
stores the data. Instead of finding any discriminative function with the help of the training
data, it follows instance-based learning and also uses the training data when it actually
needs to do some prediction on the unseen datasets. As a result, KNN does not immediately
learn a model rather delays the learning thereby being referred to as Lazy Learner.
5. Explain the difference between drop and truncate.
In SQL, the DROP command is used to remove the whole database or table indexes, data,
and more. Whereas the TRUNCATE command is used to remove all the rows from the table.
————————————————————-
Date: 30-06-2023
Company name: Infosys
Topic: LTMS, F1 Score, ensemble, CTE, NumPy
1. Can you explain how the memory cell in an LSTM is implemented computationally?
The memory cell in an LSTM is implemented as a forget gate, an input gate, and an output
gate. The forget gate controls how much information from the previous cell state is forgotten.
The input gate controls how much new information from the current input is allowed into the
cell state. The output gate controls how much information from the cell state is allowed to
pass out to the next cell state.
2. What is CTE in SQL?
A CTE (Common Table Expression) is a one-time result set that only exists for the duration
of the query. It allows us to refer to data within a single SELECT, INSERT, UPDATE,
DELETE, CREATE VIEW, or MERGE statement's execution scope. It is temporary because
its result cannot be stored anywhere and will be lost as soon as a query's execution is
completed.
3. List the advantages NumPy Arrays have over Python lists?
Python’s lists, even though hugely efficient containers capable of a number of functions,
have several limitations when compared to NumPy arrays. It is not possible to perform
vectorised operations which includes element-wise addition and multiplication. They also
require that Python store the type information of every element since they support objects of
different types. This means a type dispatching code must be executed each time an
operation on an element is done.
4. What’s the F1 score? How would you use it?
The F1 score is a measure of a model’s performance. It is a weighted average of the

precision and recall of a model, with results tending to 1 being the best, and those tending to
0 being the worst.
5. Name an example where ensemble techniques might be useful?
Ensemble techniques use a combination of learning algorithms to optimize better predictive

performance. They typically reduce overfitting in models and make the model more robust
(unlikely to be influenced by small changes in the training data). You could list some
examples of ensemble methods (bagging, boosting, the “bucket of models” method) and
demonstrate how they could increase predictive power.
————————————————————
Date: 28-06-2023
Company Name - Fractal Analytics
Role- Data Scientist

Topics - residual error, missing value, confusion matrix, model performance, kmeans
1. What is residual error?
A. The difference between a group of values observed and their arithmetical mean. The
difference between what was expected and what was predicted is called the residual error.
2. Select an appropriate value of k in k-means?
A. The elbow approach is a popular way for determining the ideal value of K when using the
K-Means Clustering Algorithm. The essential concept behind this method is that it plots
various cost values as k changes. There will be fewer elements in the cluster as the value of
K grows.
3. How do you deal missing values in datasets?
A. Deleting Rows with missing values, Impute missing values for continuous variable, Impute
missing values for categorical variable, Using Algorithms that support missing values,
Prediction of missing values, Imputation using Deep Learning Library — Datawig, Imputation
using Machine Learning libraries like SimpleImputer, KNNImputer,etc.
4. What is a confusion matrix?

A. A confusion matrix is a method of summarising a classification algorithm's performance.
Calculating a confusion matrix can help you understand what your classification model is
getting right and where it is going wrong. It gives us: “true positive” for correctly predicted
event values, “false positive” for incorrectly predicted event values, “true negative” for
correctly predicted no-event values, “false negative” for incorrectly predicted no-event
values.
5. Why is mean square error a bad measure of model performance?
A. A disadvantage of the mean-squared error is that it is not very interpretable because

MSEs vary depending on the prediction task and thus cannot be compared across different
tasks. Assume, for example, that one prediction task is concerned with estimating the weight
of trucks and another is concerned with estimating the weight of apples. Then, in the first
task, a good model may have an RMSE of 100 kg, while a good model for the second task
may have an RMSE of 0.5 kg. Therefore, while RMSE is viable for model selection, it is
rarely reported and R2 is used instead.
————————————————————————-
Date: 25-06-2023
Company Name: L&T
Topics: Logistic regression, confusion Matrix, outliers, metrics
1. How can you assess a good logistic model?
A. An approach to determining the goodness of fit is through the Homer-Lemeshow

statistics, which is computed on data after the observations have been segmented into
groups based on having similar predicted probabilities. It examines whether the observed
proportions of events are similar to the predicted probabilities of occurrence in subgroups of
the data set using a Pearson chi-square test. Small values with large p-values indicate a
good fit to the data while large values with p-values below 0.05 indicate a poor fit. The null
hypothesis holds that the model fits the data and in the below example we would reject H0.
2. What is bias, variance trade off ?
A. Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. Variance is the variability of model prediction for a given data
point or a value which tells us spread of our data. If our model is too simple and has very few
parameters then it may have high bias and low variance. On the other hand if our model has
large number of parameters then it’s going to have high variance and low bias. So we need
to find the right/good balance without overfitting and underfitting the data. This tradeoff in
complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more
complex and less complex at the same time.
3. Why is mean square error a bad measure of model performance?
A. A disadvantage of the mean-squared error is that it is not very interpretable because

MSEs vary depending on the prediction task and thus cannot be compared across different
tasks. Assume, for example, that one prediction task is concerned with estimating the weight
of trucks and another is concerned with estimating the weight of apples. Then, in the first
task, a good model may have an RMSE of 100 kg, while a good model for the second task
may have an RMSE of 0.5 kg. Therefore, while RMSE is viable for model selection, it is
rarely reported and R2 is used instead.
4. How can the outlier values be treated
A. Below are some of the methods of treating the outliers
Trimming/removing the outlier: In this technique, we remove the outliers from the dataset.
Quantile based flooring and capping : In this technique, the outlier is capped at a certain
value above the 90th percentile value or floored at a factor below the 10th percentile value.
Mean/Median imputation : As the mean value is highly influenced by the outliers, it is
advised to replace the outliers with the median value.
5. What is a confusion matrix?
A. A confusion matrix is a method of summarising a classification algorithm's performance.

Calculating a confusion matrix can help you understand what your classification model is
getting right and where it is going wrong. It gives us: “true positive” for correctly predicted
event values, “false positive” for incorrectly predicted event values, “true negative” for
correctly predicted no-event values, “false negative” for incorrectly predicted no-event
values.
————————————————————————-
Date: 16-06-2023
Company - Hitachi
Topic: Machine Learning, Statistics
1. What is the difference between supervised learning and unsupervised learning? Give
concrete examples.
Supervised learning involves learning a function that maps an input to an output.
For example, if I had a dataset with two variables, age (input) and height (output), I could
implement a supervised learning model to predict the height of a person based on their age.
Unlike supervised learning, unsupervised learning is used to draw inferences and find
patterns from input data without references to labeled outcomes. A common use of
unsupervised learning is grouping customers by purchasing behavior to find target markets.
2.How do you assess the statistical significance of an insight?
Ans: You would perform hypothesis testing to determine statistical significance. First, you
would state the null hypothesis and alternative hypothesis.
Second, you would calculate the p-value, the probability of obtaining the observed results of
a test assuming that the null hypothesis is true. Last, you would set the level of the
significance (alpha) and if the p-value is less than the alpha, you would reject the null — in
other words, the result is statistically significant.
3. What is the Law of Large Numbers?
Ans: The Law of Large Numbers is a theory that states that as the number of trials
increases, the average of the result will become closer to the expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.
4.If a Company says that they want to double the number of ads in Newsfeed, how would
you figure out if this is a good idea or not?
Ans: You can perform an A/B test by splitting the users into two groups: a control group with
the normal number of ads and a test group with double the number of ads. Then you would
choose the metric to define what a “good idea” is. For example, we can say that the null
hypothesis is that doubling the number of ads will reduce the time spent on Facebook and
the alternative hypothesis is that doubling the number of ads won’t have any impact on the
time spent on Facebook. However, you can choose a different metric like the number of
active users or the churn rate. Then you would conduct the test and determine the statistical
significance of the test to reject or not reject the null.
Date: 13-06-2023
Company: Cloudera
Topic: machine learning, SQL, python
Q1 What are Loss Function and Cost Functions? Explain the key Difference Between them?
When calculating loss we consider only a single data point, then we use the term loss
function.
Whereas, when calculating the sum of error for multiple data then we use the cost function.
There is no major difference.
In other words, the loss function is to capture the difference between the actual and
predicted values for a single record whereas cost functions aggregate the difference for the
entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
Mean-Squared Error(MSE): In simple words, we can say how our model predicted values
against the actual values.
MSE = summation(predicted value - actual value)**2/n(no of data points)
Hinge loss: It is used to train the machine learning classifier, which is
L(y) = max(0,1- y_true*y_pred)

Where y = -1 or 1 indicating two classes and y represents the output form of the classifier.
The most common cost function represents the total cost as the sum of the fixed costs and
the variable costs in the equation y = mx + b
Q2 What is SVM? Can you name some kernels used in SVM?
SVM stands for support vector machine. They are used for classification and prediction
tasks. SVM consists of a separating plane that discriminates between the two classes of
variables. This separating plane is known as hyperplane. Some of the kernels used in SVM
are –
Polynomial Kernel
Gaussian Kernel
Laplace RBF Kernel
Sigmoid Kernel
Hyperbolic Kernel
Q3 What are the subsets of SQL?
Ans. The following are the four significant subsets of the SQL:
Data definition language (DDL): It defines the data structure that consists of commands like
CREATE, ALTER, DROP, etc.
Data manipulation language (DML): It is used to manipulate existing data in the database.
The commands in this category are SELECT, UPDATE, INSERT, etc.
Data control language (DCL): It controls access to the data stored in the database. The
commands in this category include GRANT and REVOKE.
Transaction Control Language (TCL): It is used to deal with the transaction operations in the
database. The commands in this category are COMMIT, ROLLBACK, SET TRANSACTION,
SAVEPOINT, etc.
Q4 Difference between a shallow and a deep copy?
It's faster to do shallow repetitions. It does, however, handle pointers and references in a
"lazy" manner. It just copies over the pointer price rather than producing a current copy of
the specific knowledge the pointer links to. As a result, each of the initial and subsequent
copies can have pointers that relate to the same underlying knowledge. Deep repetition
clones the underlying data completely. It is not shared by the first and, as a result, by the
copy.
Date: 30-05-2023
Company name: Twitter
Company: Twitter
Topic: bias variance, neural network, tableau, SQL missing values
1. What is bias-variance tradeoff in machine learning?
The goal in machine learning is to strike a balance between bias and variance to achieve
good generalization performance. Ideally, we want a model that can capture the underlying
patterns and complexity of the data while avoiding overfitting or underfitting. This balance is
known as the bias-variance trade-off. Increasing the complexity of a model, such as using
more features or increasing the model's capacity (e.g., adding more layers in a neural
network), typically reduces bias and increases variance. Conversely, reducing the complexity
of a model, such as using simpler algorithms or reducing the number of features, generally
increases bias and reduces variance. To find the optimal trade-off, one can employ various
techniques such as cross-validation, regularization, or ensemble methods. Cross-validation
helps evaluate the model's performance on unseen data and aids in selecting the
appropriate complexity level. Regularization techniques add constraints to the model to
reduce overfitting. Ensemble methods combine multiple models to leverage the strengths of
each model while mitigating their weaknesses. Understanding the bias-variance trade-off is
crucial for selecting the appropriate models, optimizing hyperparameters, and ensuring the
model's ability to generalize well to unseen data.
2. What are the differences between shallow neural networks and deep neural networks?
Shallow Neural Networks: Shallow neural networks have a small number of hidden layers
(typically one or two). They are relatively simpler and have fewer parameters to learn.
Shallow networks are suitable for simpler tasks or datasets with fewer complex patterns.
Deep Neural Networks: Deep neural networks have a large number of hidden layers (more
than two). They are more complex and have a significantly higher number of parameters.
Deep networks can learn intricate patterns and hierarchies of features from data. They are
more suitable for complex tasks such as image recognition, natural language processing,
and speech recognition.
3. How do you handle missing values in SQL queries?
To handle missing values in SQL queries, you can use the COALESCE or ISNULL function
to replace NULL values with a specified default value, or you can use the WHERE clause to
filter out rows with missing values using the IS NULL or IS NOT NULL condition.
4. How would you create a dashboard in Tableau to visualize sales performance over time?
To create a dashboard in Tableau to visualize sales performance over time, you would:
Connect to your sales data source.

- Drag a date field to the Columns or Rows shelf to represent time.
- Drag a sales measure to the Rows or Columns shelf to represent the sales value.
- Choose an appropriate visualization type, such as a line chart or area chart.
- Add any additional dimensions or measures to the visualization for more detailed insights.
- Customize the visual appearance, including axis labels, titles, and formatting.
- Create a new dashboard and add the sales performance visualization to it.
- Add other relevant visualizations or filters to the dashboard to provide a comprehensive

view of sales performance.
- Arrange and resize the visualizations on the dashboard to create an intuitive and visually
appealing layout.
- Save and publish the dashboard for others to view and interact with.
Date: 23-05-2023
Topic: LSTM, kmeans, cte, numpy
completed.

clustering them.
Date: 17-05-2023

the given dataset?

Date ; 10-05-2023
Company name: LTI
Topic: missing values, choice of algorithm, regularisation
1. How do you handle missing values in your data?
Answer: One common approach is to impute the missing values using statistical techniques
such as mean, median, or mode imputation. Another approach is to use machine learning
algorithms that can handle missing values, such as decision trees or k-nearest neighbors. If
the missing values are related to a specific feature or subset of the data, we can also
consider using a more targeted approach, such as regression imputation or multiple
imputation.
2. Can you explain the difference between supervised and unsupervised learning?
Answer: Supervised learning is a type of machine learning where the algorithm learns from
labeled data, where the target variable or outcome is already known. The algorithm uses this
labeled data to identify patterns and relationships, and then applies this learning to new,
unseen data to make predictions or classifications. In contrast, unsupervised learning is
used when there is no target variable or outcome variable to be predicted, and the algorithm
is tasked with finding patterns and relationships within the data on its own.
3. How do you decide which machine learning algorithm to use for a particular problem?
Answer: The choice of machine learning algorithm depends on several factors, including the
type of problem, the nature of the data, and the available computational resources. Some
common factors to consider when selecting a machine learning algorithm include the size of
the dataset, the number of features, the desired level of interpretability, the complexity of the
relationships between the features and the target variable, and the computational resources
available for training and testing the model.
4. What is regularization, and why is it important in machine learning?
Answer: Regularization is a technique used in machine learning to prevent overfitting, which

occurs when a model is too complex and fits the training data too closely, leading to poor
performance on new, unseen data. Regularization works by adding a penalty term to the loss
function that the model is optimizing, which discourages the model from relying too heavily
on any one feature or set of features.
Date ; 10-05-2023
Company name: LTI
Topic: missing values, choice of algorithm, regularisation
1. How do you handle missing values in your data?
Answer: One common approach is to impute the missing values using statistical techniques
such as mean, median, or mode imputation. Another approach is to use machine learning
algorithms that can handle missing values, such as decision trees or k-nearest neighbors. If
the missing values are related to a specific feature or subset of the data, we can also
consider using a more targeted approach, such as regression imputation or multiple
imputation.
2. Can you explain the difference between supervised and unsupervised learning?
Answer: Supervised learning is a type of machine learning where the algorithm learns from
labeled data, where the target variable or outcome is already known. The algorithm uses this
labeled data to identify patterns and relationships, and then applies this learning to new,
unseen data to make predictions or classifications. In contrast, unsupervised learning is
used when there is no target variable or outcome variable to be predicted, and the algorithm
is tasked with finding patterns and relationships within the data on its own.
3. How do you decide which machine learning algorithm to use for a particular problem?
Answer: The choice of machine learning algorithm depends on several factors, including the
type of problem, the nature of the data, and the available computational resources. Some
common factors to consider when selecting a machine learning algorithm include the size of
the dataset, the number of features, the desired level of interpretability, the complexity of the
relationships between the features and the target variable, and the computational resources
available for training and testing the model.
4. What is regularization, and why is it important in machine learning?
Answer: Regularization is a technique used in machine learning to prevent overfitting, which

occurs when a model is too complex and fits the training data too closely, leading to poor
performance on new, unseen data. Regularization works by adding a penalty term to the loss
function that the model is optimizing, which discourages the model from relying too heavily
on any one feature or set of features.
Date: 08-05-2023
Topic: adaboost, sliding window, subquery, workbook vs worksheet vs dashboard in tableau

>,< or =.


information.
Date- 05-05-2023
Company name: CRED
Topic: sets & groups, bag of word, nested triggers, TPR vs FPR
1. What do Tableau's sets and groups mean?
Data is grouped using sets and groups according to predefined criteria. The primary
distinction between the two is that although a set can have only two options—either in or
out—a group can divide the dataset into several groups. A user should decide which group
or sets to apply based on the conditions.
2.What do you mean by a Bag of Words (BOW)?
It is used for word frequency or occurrences to train a classifier.
It contains a text representation that describes the frequency with which words appear in a
document.
It has two steps:
-A list of terms that are well-known.
-A metric for determining the existence of well-known terms.

3. What are Nested Triggers?
Triggers may implement DML by using INSERT, UPDATE, and DELETE statements. These
triggers that contain DML and find other triggers for data modification are called Nested
Triggers.
4. What is a True positive rate and a false positive rate?
True positive rate or Recall: It gives us the percentage of the true positives captured by the
model out of all the Actual Positive class.
TPR = TP/ (TP+FN)
False Positive rate: It gives us the percentage of all the false positives by my model
prediction from the all Actual Negative class.
FPR = FP/(FP+TN)
Date: 02-05-2023
role: Data Scientist
Topic : Computer vision, excel, time series
1.What are the different types of Pooling? Explain their characteristics.

Max pooling: Once we obtain the feature map of the input, we will apply a filter of determined
shapes across the feature map to get the maximum value from that portion of the feature
map. It is also known as subsampling because from the entire portion of the feature map
covered by filter or kernel we are sampling one single maximum value.
Average pooling: Computes the average value of the feature map covered by kernel or filter,
and takes the floor value of the result.
Sum pooling: Computes the sum of all elements in that window.
2. What is a Moving Average Process in Time series?
In time-series analysis, moving-average process, is a common approach for modeling

univariate time series. The moving-average model specifies that the output variable depends
linearly on the current and various past values of a stochastic term.
3. What is the difference between SQL having vs where?
The WHERE clause specifies the criteria which individual records must meet to be selected
by a query. It can be used without the GROUP by clause. The HAVING clause cannot be
used without the GROUP BY clause . The WHERE clause selects rows before grouping.
The HAVING clause selects rows after grouping. The WHERE clause cannot contain
aggregate functions. The HAVING clause can contain aggregate functions
4. What is Relative cell referencing in excel?
In Relative referencing, there is a change when copying a formula from one cell to another
cell with respect to the destination. cells’ address Meanwhile, there is no change in Absolute
cell referencing when a formula is copied, irrespective of the cell’s destination. This type of
referencing is there by default. Relative cell referencing doesn’t require a dollar sign in the
formula.
Date: 28-04-2023
Topic: LSTM, kmeans, cte, numpy
completed.

Date: 26-04-2023
Topic : RNN, pop & remove, calculated field, views

does not.
deletion.
4. How to create a calculated field in Tableau?
Click the drop down to the right of Dimensions on the Data pane and select “Create >
Calculated Field” to open the calculation editor.
Name the new field and create a formula.
Date - 25-04-2023
Topic: kmean, cnn layers, constraint in SQL, tableau

clustering them.
the ALTER TABLE command.
The constraints are:
Date - 04-04-2023
Company: Reliance JIO
Topic: svm kernels, cross validation, relationship types, data types in tableau
1. What are Different Kernels in SVM?
Linear kernel - used when data is linearly separable.
Polynomial kernel - When you have discrete data that has no natural notion of smoothness.
Radial basis kernel - Create a decision boundary able to do a much better job of separating
two classes than the linear kernel.
Sigmoid kernel - used as an activation function for neural networks.
2. What is Cross-Validation?
Cross-validation is a method of splitting all your data into three parts: training, testing, and
validation data. Data is split into k subsets, and the model has trained on k-1of those
datasets.
The last subset is held for testing. This is done for each of the subsets. This is k-fold
cross-validation. Finally, the scores from all the k-folds are averaged to produce the final
score.
3. List the different types of relationships in SQL.
One-to-One - This can be defined as the relationship between two tables where each record
in one table is associated with the maximum of one record in the other table.
One-to-Many & Many-to-One - This is the most commonly used relationship where a record
in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are needed for
defining a relationship.
Self-Referencing Relationships - This is used when a table needs to define a relationship

with itself.
4. What Are the Data Types Supported in Tableau?
Following data types are supported in Tableau:
Text (string) values
Date values
Date and time values
Numerical values
Boolean values (relational only)
Geographical values (used with maps)
Date- 02-04-2023
Topic : RNN, pop & remove, calculated field, views

does not.
deletion.
4. How to create a calculated field in Tableau?
Date - 01-04-2023

the given dataset?

Date- 24-03-2023
Company name: Kenstar
Topic: auto encoders, kmeans, powerbi, recursive stored procedure
1. What are autoencoders?
Autoencoders are artificial neural networks that learn without any supervision. Here, these
networks have the ability to automatically learn by mapping the inputs to the corresponding
outputs.
Autoencoders, as the name suggests, consist of two entities:
Encoder: Used to fit the input into an internal computation state
Decoder: Used to convert the computational state back into the output
2. Compare K-means and KNN Algorithms.
K-Means is unsupervised. K-Means is a clustering algorithm. The points in each cluster are
similar to each other, and each cluster is different from its neighboring clusters. KNN is
supervised in nature. KNN is a classification algorithm It classifies an unlabeled observation
based on its K (can be any number) surrounding neighbors.
3. What is a Recursive Stored Procedure in SQL?
A stored procedure that calls itself until a boundary condition is reached, is called a recursive
stored procedure. This recursive function helps the programmers to deploy the same set of
code several times as and when required. Some SQL programming languages limit the
recursion depth to prevent an infinite loop of procedure calls from causing a stack overflow,
which slows down the system and may lead to system crashes.
4. SUM() vs SUMX(): What is the difference between the two DAX functions in Power BI?
The sum function (Sum()) takes the data columns and aggregates them totally but the SumX
function (SumX()) lets you filter the data which you are adding.
SUMX(Table, Expression), where the table contains the rows for calculation. Expression is a
calculation that will be evaluated on each row of the table.
Date: 20-03-2023
Company Name: Accenture
Topic: tf-idf, random forest, kernel in svm
1. What do you understand by a random forest model?
It combines multiple models together to get the final output or, to be more precise, it
combines multiple decision trees together to get the final output. So, decision trees are the
building blocks of the random forest model.
2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often
misunderstood. Both of them deal with data. Data Science is a broad field that deals with
large volumes of data and allows us to draw insights out of this voluminous data. Machine
Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals
with data, but here, we are solely focused on learning how to convert the processed data
into a functional model, which can be used to map inputs to outputs, e.g., a model that can
expect an image as an input and tell us if that image contains a flower as an output.
3. What is a kernel function in SVM?
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a
kernel function takes data as input and converts it into a required form. This transformation
of the data is based on something called a kernel trick, which is what gives the kernel
function its name. Using the kernel function, we can transform the data that is not linearly
separable (cannot be separated using a straight line) into one that is linearly separable.
4. Explain TF/IDF vectorization.
The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is a

numerical measure that allows us to determine how important a word is to a document in a
collection of documents called a corpus. TF/IDF is used often in text mining and information
retrieval.
Date: 17-03-2023
Topic: inputting missing values, star schema, ab testing, RNN
1. Which of the following machine learning algorithms can be used for inputting missing
values of both categorical and continuous variables?
K-means clustering
Linear regression
K-NN (k-nearest neighbor)
Decision trees
Ans- The K nearest neighbor algorithm can be used because it can compute the nearest
neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all
the other features.
When you're dealing with K-means clustering or linear regression, you need to do that in
your pre-processing, otherwise, they'll crash. Decision trees also have the same problem,
although there is some variance
2. What is an RNN (recurrent neural network)?
Ans - RNN is an algorithm that uses sequential data. RNN is used in language translation,
voice recognition, image capturing etc. There are different types of RNN networks such as
one-to-one, one-to-many, many-to-one and many-to-many. RNN is used in Google’s Voice
search and Apple’s Siri.
3. What is the goal of A/B Testing?
Ans - This is statistical hypothesis testing for randomized experiments with two variables, A
and B. The objective of A/B testing is to detect any changes to a web page to maximize or
increase the outcome of a strategy.
4. What is a star schema?
Ans - Star schema is the fundamental schema among the data mart schema and it is
simplest. It is said to be star as its physical model resembles to the star shape having a fact
table at its center and the dimension tables at its peripheral representing the star’s points.
Date - 27-02-2023
Topic: kmean, cnn layers, constraint in SQL, tableau
1. Explain some cases where k-Means clustering fails to give good results.

clustering them.
the ALTER TABLE command. The constraints are:
Date: 24-02-2023
Topic: adaboost, sliding window, subquery, workbook vs worksheet vs dashboard in tableau

>,< or =.

information.
Date: 23-02-2023
Company name: IKEA

Topic: stacking,cursors, powerbi, charts
1. How does Stacking work?
The idea of stacking is to learn several different weak learners and combine them by training
a meta-model to output predictions based on the multiple predictions returned by these weak
models.
If a stacking ensemble is composed of L weak learners, then to fit the model the following
steps are followed:
Split the training data into two folds.
Choose L weak learners and fit them to the data of the first fold.
For each of the L weak learners, make predictions for observations in the second fold.
Fit the meta-model on the second fold, using predictions made by the weak learners as
inputs.
2. Can you provide me examples of when a scatter graph would be more appropriate than a
line chart or vice versa?
A scatter graph would be more appropriate than a line chart when you are looking to show
the relationship between two variables that are not linearly related. For example, if you were
looking to show the relationship between a person’s age and their weight, a scatter graph
would be more appropriate than a line chart. A line chart would be more appropriate than a
scatter graph when you are looking to show a trend over time. For example, if you were
looking at the monthly sales of a company over the course of a year, a line chart would be
more appropriate than a scatter graph.
3. Where is data stored in Power BI?
When data is ingested into Power BI, it is basically stored in Fact and Dimension tables.
Fact tables: The central table in a star schema of a data warehouse, a fact table stores
quantitative information for analysis and is not normalized in most cases.
Dimension tables: It is just another table in the star schema that is used to store attributes
and dimensions that describe objects stored in a fact table.
4. What is Cursor? How to use a Cursor?
After any variable declaration, DECLARE a cursor. A SELECT Statement must always be
coupled with the cursor definition.
To start the result set, move the cursor over it. Before obtaining rows from the result set, the
OPEN statement must be executed.
To retrieve and go to the next row in the result set, use the FETCH command.
To disable the cursor, use the CLOSE command.
Finally, use the DEALLOCATE command to remove the cursor definition and free up the
resources connected with it.
Date: 16-02-2023


the given dataset?

To be added…

Data scientist interview questions ❓

Uploaded by

Copyright:

Available Formats

Data scientist interview questions ❓

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data scientist interview questions ❓

Uploaded by

Copyright:

Available Formats

Date - 23-12-23

Role: Data Scientist

1. How Are Weights Initialized in a Neural network?

2. What are the variants of Gradient descent?

Ans: There are two main methods for feature selection:

• Linear discrimination analysis

4. What is joint sampling and separate sampling?

Company name: CRED

Role: Data Scientist

Topic: sets n groups, tpr-fpr, triggers, bag of words

1. What do Tableau's sets and groups mean?

3.What do you mean by a Bag of Words (BOW)?

It is used for word frequency or occurrences to train a classifier.

It has two steps:

-A list of terms that are well-known.

-A metric for determining the existence of well-known terms.

4. What is a True positive rate and a false positive rate?

TPR = TP/ (TP+FN)

Company name: Numerator

Role: Data Scientist

Topic : RNN, pop & remove, calculated field, SQL, Excel

1. What are the uses of using RNN in NLP?

2. How to remove values to a python array?

3. What are the advantages and disadvantages of views in the database?

Answer: Advantages of Views:

The view becomes irrelevant if we drop a table related to that view.

5. What is Ribbon in Excel and where does it appear?

Company Name: Accenture

Role: Data Scientist

Topic: Silhouette, trend seasonality, bag of words, bagging boosting , F1 Score

1. What do you understand by the term silhouette coefficient?

2. What is the difference between trend and seasonality in time series?

3. What is Bag of Words in NLP?

4. What is the difference between bagging and boosting?

5. What do you understand by the F1 score?

The F1 score represents the measurement of a model's performance. It is referred to as a

Role - Data Scientist

Topics - Machine Learning, Behavioral, Security, Real-time Systems

- Answer: Building a recommendation system for personalized content on Facebook would

- Answer: In navigating conflicting opinions within a team, I facilitated resolution through

Role - Data Scientist

Topics - Natural Language Processing, Recommendation Systems, Python, Graph Analysis

2: Explain the collaborative filtering approach in building recommendation systems. How

- Answer: Collaborative filtering recommends items based on user preferences and

for tweet in tweet_collection:

hashtags = [word for word in tweet.split() if word.startswith('#')]

for hashtag in hashtags:

Role - Data Scientist

Topics - SQL, Machine Learning, NLP, Neural Networks

SELECT TOP 3 ProductID, SUM(Revenue) AS TotalRevenue

WHERE OrderDate >= DATEADD(QUARTER, DATEDIFF(QUARTER, 0, GETDATE()) -

ORDER BY TotalRevenue DESC;

Answer: Ensemble learning involves combining multiple models to improve overall

Role - Data Scientist

Topics - Machine Learning, Algorithms, Statistics, Data Analysis

- Practical Significance: Consider the practical significance of results to assess real-world

- Analyze Demographics: Examine user demographics to identify biases related to age,

- Continuous Monitoring: Implement continuous monitoring and iterate on algorithms to

Company: Tiger Analytics

Role - Data Scientist