DS Long Answers

DS Unit 1 Essay Questions
1. Define Data science. What are the traits of Data science? Discuss the applications of Data science with suitable examples -Mid
1 Question
2. Write a brief note on various measures of data similarity and dissimilarity
3. What is Data matrix? Explain using an example how to find a Dissimilarity matrix
4. Using an example discuss similarity of Binary variables
5. Using an example Table below discuss similarity of any two types of variables what you have identified
6. Define Proximity matrix. Find the similarity matrix for given DS Lab continuous evaluation grades (Ordinal attribute) data set
7. What is data pre-processing and why do we need it? Explain cleaning of data in brief.
8. Write a python code for reading a dataset and removing the NaN values of filling the NaN values.
9. Explain data cleaning and munging with a suitable example
10. Explain various ways of data transformation and data reduction techniques.
11. Explain the process of data discretization with a suitable example
12. Explain the various ways of preparing the data for analysis.
13. What is the need for data visualization. Write on the libraries supported by python for data visulizations
14. Write a python code for plotting 5 different graphs with an example
15. Write a brief note on scrapping the web using twitter data API
16. a )List the visualization tools in python. b). Discuss the steps needed to perform Web scrapping to retrieve the III-B.Tech-I sem
students results from CVR website.
1. Explain the least square methods for estimation of coefficients of linear multiple regression.
2. Estimate R2 and adjusted R2 value for the following data set
3. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student placement marks
and perform R2 , Adjusted R2 to test the performance of the model
4. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student placement marks
and Find the Mean Square Error (MSE) of the model. -> Mid Question
5. Explain Logistic regression with example and How to implement in Python?

6. Define Classification. Discuss the procedure of the KNN-Classifier to classify a Person X ( sugar level 190, Age 45) from given
case study of diabetic patients. -> Mid Question
7. Explain random forest with example and How to implement in Python?

8. Explain ridge regression with example and How to implement in Python?
9. What is Multi collinearity issue? How can you address this issue?
10. Explain SVM with example and How to implement in Python?
11. Explain ANOVA with example and How to implement in Python?
12. What is ANOVA. Find the significance of the noise on solving questions from dataset given below
1. Explain the mean vector and correlation with example

2. Explain Mahalanobis distance with example
3. a). Define Data matrix, Euclidean distance, Mahalanobis distance, Precision matrix in Multi variate data analysis. b). Find the
mean vector for the given vectors u1 = (23, 56, 76) u2 = (123, 89, 64) and u3 = (98, 54, 78)
4. Explain PCA with example and How to implement in Python?
5. What is the need for Dimensionality reduction. Reduce 2D to 1D for given dataset with PCA?
6. What is the need of Dimensionality reduction. How do you identify the Principal components from given data set?
7. What are the key differences between PCA and MDS dimensionality reduction approaches. Give the steps in classical Multi-
Dimensional Scaling algorithm
8. Explain in brief the steps involved to implement Multidimensional scaling
9. Define & Explain Stochastic process. Explain different types of stochastic process with suitable examples.
10. Write short notes on i) Stochastic process, ii) Markov chain process, iii) Transition probability matrix
11. Explain Markov Chain and their classification with example and How to implement in Python?
1. What do you mean by well posed learning problem? Explain with example
2. What is a Well-posed problem? Discuss in detail the steps involved in designing a learning task
3. What is Machine learning? Discuss different approaches in Machine Learning with suitable examples
4. Discuss the Issues in machine learning
5. Prove that the LMS weight update rule performs a gradient descent to minimize the squared error
6. Illustrate general-to-specific ordering of hypothesis in concept learnin
7. Implement an algorithm demonstrating tic-tac-toe learning approach
8. Write FIND-S algorithm. Apply the FIND-S algorithm on given training samples to find the maximally specific hypothesis.
9. Define Version space? Find the hypothesis for given training examples using Candidate elimination algorithm
10. Consider the given below training example which finds malignant tumours from MRI Scans
11. Explain candidate-elimination learning algorithm using version spaces

12. Explain the concept of Inductive Bias in brief
1. Discuss representational power of two-layer perceptron model versus multilayer perceptron model.
2. How is ANN useful in making a machine intelligent?
3. Explain Supervised and Un-Supervised learning?
4. Do you have any idea about the deep neural network?
5. Explain the advantage of Artificial Neural Network?
6. Explain the feed-forward neural network?
7. What is the convolutional neural network?
8. What are Neural Networks? What are the types of Neural networks?
9. Explain appropriate problem for Neural Network Learning with its characteristics.
10. Explain the concept of a Perceptron with a neat diagram.
11. Explain the single perceptron with its learning algorithm.
12. How a single perceptron can be used to represent the Boolean functions such as AND,
13. Write a note on (i) Perceptron Training Rule (ii) Gradient Descent and Delta Rule
14. Write Gradient Descent algorithm for training a linear unit.
15. Derive the Gradient Descent Rule
16. Write Stochastic Gradient Descent algorithm for training a linear unit.
17. Differentiate between Gradient Descent and Stochastic Gradient Descent
18. Write Stochastic Gradient Descent version of the Back Propagation algorithm for feedforward networks containing two layers
of sigmoid units.
19. Derive the Back Propagation Rule
20. Explain different types of activation functions.
21. Explain the followings w.r.t Back Propagation algorithm a. Convergence and Local Minima b. Representational Power of
Feedforward Networks c. Hypothesis Space Search and Inductive Bias d. Hidden Layer Representations e. Generalization,
Overfitting, and Stopping Criterion
22. What is an Artificial Neural Network? Describe the architecture of ANN and explain the topological units of ANN. Mid 2
23. Write short notes on i) Weights, ii) Bias, iii) Activation function, iv) Inductive Bias, Mid 2
24. What is neuron? How is the ANN motivated with biological concept?
25. Define activation function. Explain different types of activation functions. Justify, ‘ANN without activation function is a linear
regression model’.
26. explain backpropagation algorithm with an example network
27.
OF ALL THE ABOVE 27 QUESTIONS
ONLY below questions are answered.Remaining to be learned own
1. What is an Artificial Neural Network? Describe the architecture of ANN and explain the topological units of ANN.
2. types of Ann,
3. activation functions
4. back propogation
DS Unit 1 Essay Answers.
1. Define Data science. What are the traits of Data science? Discuss the applications of Data science with suitable
examples -Mid 1 Question
Answer :
Data science
is a multi-disciplinary field that uses scientific methods, processes, algorithms, systems to extract knowledge, insights
from structured and unstructured data
Data Science is the science which uses in computer science, statistics, machine learning, visualization and human-
computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products
Traits of Data science / Big Data
1. Volume
How much data is there?
refers to data generated from many sources
2. Variety
How diverse are different types of data?
data can be structures, unstructured or semi structured
3. Velocity
At what speed is new data generated? -> ie speed of generation of data
4. Veracity
How accurate is the data? -> ie how much the data is reliable
5. Value
ability to transform big data into valuable data and store
Applications of Data science
1. Advanced Image Recognition
Eg : Face Mask detection
2. Recommendation System
Eg : in youtube, google news recommendation
3. Banking
Eg : Fraud detection, NPA risk modeling
4. Transport
Eg : Self driving cars
2. Write a brief note on various measures of data similarity and dissimilarity
Answer :
Data Similarity
is Numerical measure of how alike two data objects are.
It is higher when objects are more alike.
it often falls in the range [0,1]
Data Dissimilarity
Numerical measure of how different are two data objects
it is lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Measures of Similarity/Dissimilarity for Simple Attributes
Distances
Minkowski distance
1. h = 1: Manhattan (city block, L1 norm)distance
2. h = 2: (L2 norm) Euclidean distance
3. h → infinity . “supremum” (Lmax norm, L infinity norm) distance
3. What is Data matrix? Explain using an example how to find a Dissimilarity matrix
Answer :
Data Matrix
representing n data points with p dimensions
Dissimilarity matrix
is a triangular matrix which represents n data points, but registers only the distance
Example using Eucledian distance
Considering below data Matrix
Solution
1. Calculating Eucledian distances
2. Answer : Writing distances in form of Matrix

4. Using an example discuss similarity of Binary variables
Answer :
Proximity Measure for Binary Attributes

Consider the example
Dissimilarity in Binary Variables
Answer :
Similarity in Binary Variables

5. Using an example Table below discuss similarity of any two types of variables what you have identified
Answer :
Take variable
1. one as Half-Yearly -this is ordinal attribute and solve in same way as 6 th question
2. second as Final -this is numeric attribute so use formula and solve in same way as 3rd question
6. Define Proximity matrix. Find the similarity matrix for given DS Lab continuous evaluation grades (Ordinal attribute)
data set
Answer :
Proximity matrix
is a square matrix in which the entry in cell (j, k) is some measure of the similarity (or distance) between the items to
which row j and column k correspond.
Proximity matrices form the data for multidimensional scaling
7. What is data pre-processing and why do we need it? Explain cleaning of data in brief.
Answer :
Data preprocessing
is a technique that involves transforming raw data into useful and efficient format so that data mining analytics can be
applied
Major Tasks in Data Preprocessing
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation and data discretization
Why Preprocess the Data?
1. Accuracy
correct or wrong, accurate or not
2. Completeness
not recorded, unavailable, …
3. Consistency
some modified but some not, dangling, …
4. Timeliness
timely update?
5. Believability
how trustable the data are correct?
6. Interpretability
how easily the data can be understood?
Explain cleaning of data in brief
Data Cleaning
data can have many irrelevant and missing parts so data cleaning is done which involves handling of missing
data, noisy data or resolving the inconsistencies in the data
1. Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or
the most probable value
2. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc.
It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size
and then various methods are performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to complete the task.
1. smoothing by bin means
2. smoothing by bin medians
3. smoothing by bin boundaries
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
1. linear (having one independent variable) or
2. multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
8. Write a python code for reading a dataset and removing the NaN values of filling the NaN values.
Answer :
# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Importing dataset
df = pd.read_csv('user_data.csv')
#Checking whether null values are there or not

df.isnull().sum()
#use this command for all those attributes which has NaN values
df.X.fillna(np.random.randint(0,2),inplace=True) // here X is the attribute which has NaN values
#Checking whether the null values are filled

df.isnull().sum()
9. Explain data cleaning and munging with a suitable example

Answer :
Data munging, also known as data wrangling, is the data preparation process of manually transforming and
cleansing/cleaning the data for better decision making.
Data Munging includes the following steps:
1. Data exploration: In this process, the data is studied, analyzed and understood by visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing values of NaN, they
are needed to be taken care of by replacing them with mean, mode, the most frequent value of the column or simply by
dropping the row having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements, where new data can be added or
pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are required to be removed or
filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset as per our
requirements and then it can be used for a required purpose like data analyzing, machine learning, data visualization,
model training etc.
Eg : Refer this link -> https://www.geeksforgeeks.org/data-wrangling-in-python/
10. Explain various ways of data transformation and data reduction techniques
Answer :
Data Transformation
data are transformed into forms appropriate for data analytic processing
Data transformation tasks:
1. Smoothing
Remove the noise from the data.
Techniques includes Binning, Regression, Clustering.
2. Normalization
the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, 0.0 to 1.0
Types
1. Min-max normalization to [new_minA , new_maxA ]
2. Z-score normalization ( μ: mean, σ: standard deviation )
3. Normalization by decimal scaling
3. Attribute construction (or feature construction ), Subset selection

new attributes are constructed and added from the given set of attributes to help the efficent process
4. Aggregation
aggregation operations are applied to the data
Eg : On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the
annual sales
5. Discretization
Dividing the range of a continuous attribute into intervals
Eg : values for numerical attributes, like age, may be mapped to higher-level concepts, like youth,
middle-aged, and senior
6. Generalization
values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-
aged, and senior.
Eg : categorical attributes, like street, can be generalized to higher-level concepts, like city or country
Data reduction
is a technique used to obtain a reduced representation of the data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results
Data reduction strategies
1. Data compression
apply transformations to obtain reduced or compressed representation of original data
they are of 2 types
1. Lossless
If the original data can be reconstructed from the compressed data without any loss of
information
2. Lossy
If the original data can be reconstructed from the compressed data with loss of information, then
the data reduction is called lossy
Eg :
Wavelet transforms
Principal components analysis.
2. Dimensionality reduction-remove unimportant attributes/variables Eliminate the redundant attributes: which are weekly
important across the data.
Wavelet transforms
is a linear signal processing technique that, when applied to a data vector, transforms it to a numerically
different vector, of wavelet coefficients. The two vectors are of the same length. When applying this
technique to data reduction, we consider each tuple as an ndimensional data vector, that is, X=(x1 ,x,
…,xn ), depicting n measurements made on the tuple from n database attributes
Eg : using Fourier transform to reduce the data
Principal Components Analysis (PCA)

Feature subset selection
reduces the data set size by removing irrelevant or redundant attributes (or dimensions)
Typical heuristic attribute selection methods:
1. Best single attribute under the attribute independence assumption: choose by significance tests
2. Best step-wise( forward) feature selection:
The best single-attribute is picked first
Then next best attribute condition to the first, ...
3. Step-wise attribute( backward) elimination:
Repeatedly eliminate the worst attribute
4. Best combined attribute selection and elimination
5. Decision tree induction
Use attribute elimination and backtracking
feature creation
Create new attributes (features) that can capture the important information in a data set more effectively
than the original ones
Three general methodologies
1. Attribute extraction
Domain-specific
2. Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
3. Attribute construction
Combining features
Data discretization
3. Numerosity reduction- replace original data volume by smaller forms of data
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
11. Explain the process of data discretization with a suitable example
Answer :
Typical methods of discretisation for numerical data

1. Binning
Top-down split, unsupervised
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size
and then various methods are performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to complete the task.
1. smoothing by bin means
2. smoothing by bin medians
3. smoothing by bin boundaries
Eg :
2. Histogram analysis
Top-down split
is an unsupervised discretization technique because it does not use class information
A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins.
histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins
Eg :
for the dataset
we do Histogram Analysis in below way
3. Clustering analysis
Either top-down split or bottom-up merge, unsupervised
4. Entropy-based discretization
supervised, top-down split
Eg : if want example then refer this https://natmeurer.com/a-simple-guide-to-entropy-based-discretization/
5. Interval merging by Analysis

supervised, bottom-up merge
12. . Explain the various ways of preparing the data for analysis.
Answer :
1. Questionnaire checking:
Questionnaire checking involves eliminating unacceptable questionnaires. These questionnaires may be incomplete,
instructions not followed, little variance, missing pages, past cutoff date or respondent not qualified.
2. Editing
Editing looks to correct illegible, incomplete, inconsistent and ambiguous answers.
3. Coding
Coding typically assigns alpha or numeric codes to answers that do not already have them so that statistical techniques
can be applied.
4. Transcribing
Transcribing data involves transferring data so as to make it accessible to people or applications for further processing.
5. Cleaning
Cleaning reviews data for consistencies. Inconsistencies may arise from faulty logic, out of range or extreme values.
6. Statistical adjustments
Statistical adjustments applies to data that requires weighting and scale transformations.
7. Analysis strategy selection
Finally, selection of a data analysis strategy is based on earlier work in designing the research project but is finalized
after consideration of the characteristics of the data that has been gathered.
13. What is the need for data visualization. Write on the libraries supported by python for data visulizations
Answer :
Data Visualization
is the graphical representation of information and data by using visual elements like charts, graphs, and maps
data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data
Need for Data Visualisation
1. To make easier in understand and remember
2. To discover unknown facts, outliers, and trends
3. To visualize relationships and patterns quickly
4. To ask a better question and make better decisions
5. To competitive analyze
6. To improve insights
7. Data visualization can identify areas that need improvement or modifications
8. Data visualization can clarify which factor influence customer behavior
9. Data visualization helps you to understand which products to place where
10. Data visualization can predict sales volumes
libraries supported by python for data visulizations
1. Matplotlib
used for exploration & data visualisation
can do charts, plots & also can be customised
2. Seaborn
visualisation for large data
has advanced plots
3. Plotly
provides hgih quality plots
provides more advanced plots & features than MATPLOTLIB
4. Bokeh
5. Altair
6. ggplo
14. Write a python code for plotting 5 different graphs with an example
Answer :
below plots are plotted using Matplotlib

Common in all plots
Define the x-axis and corresponding y-axis values as lists.
Plot them on canvas using .plot() function.
Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
Give a title to your plot using .title() function.
Finally, to view your plot, we use .show() function
1. Line Plot
Code
Output
2. Bar Chart
Code
Output
3. Histogram
Code
Output
4. Scatter plot
Code
Output
5. Pie Plot
Code
Output
15. Write a brief note on scrapping the web using twitter data API
Answer :
Web Scraping is Extracting data from websites

It is also called web crawling(bot),web harvesting
Scrapping includes actions like
1. fetching the page,
2. Parsing HTML pages
The browser parses HTML into a DOM tree.
HTML parsing involves tokenization and tree construction.
HTML tokens include start and end tags, as well as attribute names and values.
Beautiful soup , a Python library for pulling data out of HTML and XML files is used
It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the
parse tree
3. extracting data from pages/ web sites
Use an API
Use web scrapping tool (walk to extract data)
Scrapping the web using twitter data API
Twitter API lets you read and write Twitter data.
can use it to compose tweets, read profiles, and access your followers' data and a high volume of tweets on particular
subjects in specific locations.
Process
For using the Twitter API you need to have a developer access Twitter account.
After completing with the set up to create an app, in it, we will get Keys and tokens, which will help us retrieve
data from Twitter.
They act as login credentials.
Save these credentials for further use.
To extract the twitter data you should be in login mode till you extract data
Types of credentials needed
access_token="”
access_token_secret="”
consumer_key="”
consumer_secret=""
16. a )List the visualization tools in python. b). Discuss the steps needed to perform Web scrapping to retrieve the III-
B.Tech-I sem students results from CVR website.
Answer :
Visualisation tools - provide an accessible way to see and understand trends, outliers, and patterns in data
1. Matplotlib
used for exploration & data visualisation
can do charts, plots & also can be customised
2. Seaborn
visualisation for large data
has advanced plots
3. Plotly
provides hgih quality plots
provides more advanced plots & features than MATPLOTLIB
Discuss the steps needed to perform Web scrapping to retrieve the III-B.Tech-I sem students results from CVR website
Instead of Amazon use CVR Website -> https://www.youtube.com/watch?v=ecAJfHHppVs
DS Unit 2 Essay Answers
1. Explain the least square methods for estimation of coefficients of linear multiple regression.
Answer :
Linear Multiple Regression

Linear Multiple Regression (LMR) model is a statistical model that establishes existence of a linear relationship
(association) between a dependent variable and several independent variables
Y = a + b1X1 + b2X2 + b3X3 + ... + bnXn
Where:
Y = the variable that we are trying to predict (dependent variable),
X = the variable that we are using to predict Y (independent variable),
a = the intercept,
b=(b1 , b2 ,b3 ,...,bn ) = the slope.
Eg : Blood Sugar Level vs Weight and Age
Method
To estimate the parameters a and b=(b1 , b2 ,b3 ,...,bn ) using the principle of least squares, form the sum of squared
deviations of the observed Yi ’s from the regression line:
Solution
2. Estimate R2 and adjusted R2 value for the following data set
Answer :
3. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student
placement marks and perform R2 , Adjusted R2 to test the performance of the model
Answer :
Regression is a statistical technique that is used to model the relationship of a dependent variable with respect to one or
more independent variables
4. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student
placement marks and Find the Mean Square Error (MSE) of the model. -> Mid Question
Answer :
Regression is a statistical technique that is used to model the relationship of a dependent variable with respect to one or
more independent variables
5. Explain Logistic regression with example and How to implement in Python?
Answer :
Logistic Regression
Logistic regression is a Machine Learning algorithms, which comes under the Supervised Learning technique. It is used
for predicting the categorical dependent variable using a given set of independent variable
Logistic regression is used for solving the classification problems
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two
maximum values (0 or 1)
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
LR=LogisticRegression()
LR.fit(x_train,y_train)
pred=LR.predict(x_test)
print("Logistic Regression accuracy::","\n",accuracy_score(y_test,pred))
print("\n")
print(confusion_matrix(y_test,pred))
print("\n")
print(classification_report(y_test,pred))
6. Define Classification. Discuss the procedure of the KNN-Classifier to classify a Person X ( sugar level 190, Age 45) from
given case study of diabetic patients. -> Mid Question
Answer :
Classification
The method of arranging data into homogenous classes according to common features present in the data is known as
classification
Types of Classification
KNN Classifier
K nearest neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure (distance function)
KNN Algorithm
Problem
7. Explain random forest with example and How to implement in Python?
Answer :
Random forests
is a supervised learning algorithm. It can be used both for classification and regression.
Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects
the best solution by means of voting
RF are ensemble methods used to boost the performance of DT
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction
Example - did not get
Code
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
def train(model, X, y):
# train the model
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.7,random_state=0)
model.fit(x_train, y_train)
# predict the training set

pred = model.predict(x_test)
print("Model Report")
print("RMSE:",sqrt(mean_squared_error(y_test, pred)))
from sklearn.ensemble import RandomForestRegressor

model3 = RandomForestRegressor()
train(model3, X, y)
8. Explain ridge regression with example and How to implement in Python?
Answer :
Ridge regression
is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2
regularization
When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted
values to be far away from the actual values. By adding a degree of bias to the regression estimates, ridge regression
reduces the standard errors.
Code
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(indp_data,target,test_size=0.25,random_state=3405)
from sklearn.linear_model import Ridge
import warnings
warnings.filterwarnings("ignore")
# Model initialization
regression_model = Ridge(normalize=True,random_state =100)
#Fit the data(train the model)
model= regression_model.fit(X_train, Y_train)
# Predict
y_predicted = model.predict(X_train)
# model evaluation
rmse = mean_squared_error(Y_train, y_predicted)
r2 =model.score(X_train, Y_train)
# printing values
print('Slope:' ,model.coef_)
print('Intercept:', model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
9. What is Multi collinearity issue? How can you address this issue?
Answer :
Multicollinearity, or collinearity, is the existence of nearlinear relationships among the independent variables
occurs when independent variables in a regression model are correlated
Issue
create inaccurate estimates of the regression coefficients, inflate the standard errors of the regression coefficients,
deflate the partial t-tests for the regression coefficients, give false, nonsignificant, pvalues, and degrade the
predictability of the model
How can it be measured ?
1. Tolerance
is percentage of variance in the independent variable that is not accounted for by other independent variables
2. Variance Inflation Factor

indicates degree to which standard errors are inflated due to the levels of collinearity
VIIF= reciprocal of tolerance
How to address it
1. Remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one
from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't
drastically reduce the R-squared. Consider using stepwise regression, best subsets regression, or specialized
knowledge of the data set to remove these variables. Select the model that has the highest R-squared value.
2. Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number
of predictors to a smaller set of uncorrelated components.
3. Obtaining more data on an expanded range would cure multicollinearity problem caused due to Data collection (ie
data collected from a narrow subspace of the independent variables(
4. situation of Over-defined model(there are more variables than observations) should be avoided
5. outlier-induced multicollinearity can be corrected by removing the outliers before ridge regression is applied
6. If Strutural multicollinearity
centering variables is efficient solution
7. Data multicollinearity
remove some highly corelated independent variables
Linearly combine correlated Independent variables , add them together
use LASSO or ridge regression
8. First check to see if one of predictor variable is a duplicate
9. Remove a redundant variable
10. Aggregate similar variables
11. Increase sample size
10. Explain SVM with example and How to implement in Python?
Answer :
Support Vector Machines (SVM)

It is Supervised Learning algorithms, which is used for Classification as well as Regression problems.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine
Code
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
svm=SVC()
svm.fit(x_train,y_train)
pred=svm.predict(x_test)
print("svm accuracy::","\n",accuracy_score(y_test,pred))
print("\n")
print(confusion_matrix(y_test,pred))
print("\n")
print(classification_report(y_test,pred))
11. Explain ANOVA with example and How to implement in Python?
Answer :
ANOVA ( Analysis of variance )

is a procedure for testing the difference among different of data for homogeneity
ANOVA help us to figure out if there is need to reject the null hypothesis or accept the alternate hypothesis
ANOVA is two types
1. One way ANOVA
only one factor is investigate
one independent variable (with 2 levels)
2. Two Way ANOVA
investigate two factors at the same time.
two independent variables (can have multiple levels).
1. Two way ANOVA without replication
2. Two way ANOVA with replication
Example
refer below question
Code
did not get
12. What is ANOVA. Find the significance of the noise on solving questions from dataset given below
Answer :
ANOVA ( Analysis of variance )

is a procedure for testing the difference among different of data for homogeneity
ANOVA help us to figure out if there is need to reject the null hypothesis or accept the alternate hypothesis
ANOVA is two types
1. One way ANOVA
only one factor is investigate
one independent variable (with 2 levels)
2. Two Way ANOVA
investigate two factors at the same time.
two independent variables (can have multiple levels).
1. Two way ANOVA without replication
2. Two way ANOVA with replication
Problem
1. Explain the mean vector and correlation with example
Answer :
Mean vector
consists of the means of each variable
mean vector is often referred to as the centroid
Example
Correlation
Correlation is a statistical measure that expresses the extent to which two variables are related
Example
1. sales might increase when the marketing department spends more on TV advertisements
2. customer's average purchase amount on an ecommerce website might depend on a number of factors related
to that customer
Formula (Relation b/e Correlation & Covariance)
Use of Correlation
understanding the relationships and subsequently building better business and statistical models
Variance-Covariance matrix consists of the variances of the variables along the main diagonal and the covariances between
each pair of variables in the other matrix position
It is also called as dispersion or dispersion matrix
Eg :Calculation of Mean Vector and Variance-Covarience Matrix for below matric
The set of 5 observations, measuring 3 variables, can be described by its mean vector and variance-covariance matrix.
The three variables, from left to right are length, width, and height of a certain object, for example. Each row vector Xi is
another observation of the three variables (or components)
The formula for computing the covariance of the variables X and Y is
Answer :
2. Explain Mahalanobis distance with example
Answer :
Mahalanobis distance
is the distance between two points in multivariate space
Why Mahalanobis?
In a regular Euclidean space, variables (e.g. x, y, z) are represented by axes drawn at right angles to each other; The
distance between any two points can be measured with a ruler.
For uncorrelated variables, the Euclidean distance equals the MD. However, if two or more variables are correlated, the
axes are no longer at right angles, and the measurements become impossible with a ruler. In addition, if you have more
than three variables, you can’t plot them in regular 3D space at all. The MD solves this measurement problem, as it
measures distances between points, even correlated points for multiple variables
Formula
Refer this for example
https://www.youtube.com/watch?v=4buOoXp7AyI
3. a). Define Data matrix, Euclidean distance, Mahalanobis distance, Precision matrix in Multi variate data analysis. b).
Find the mean vector for the given vectors u1 = (23, 56, 76) u2 = (123, 89, 64) and u3 = (98, 54, 78)
Answer :
4. Explain PCA with example and How to implement in Python?
Answer :
PCA is a Dimensionality reduction method(algorithms)

Principal Components Analysis (PCA) is a technique that can be used to simplify a dataset
It is a linear transformation that chooses a new coordinate system for the data set such that greatest variance by any
projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest
variance on the second axis, and so on.
PCA can be used for reducing dimensionality by eliminating the later principal components
An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D
Can be used to:
Reduce number of dimensions in data
Find patterns in high-dimensional data
Visualize data of high dimensionality
Example applications:
Face recognition
Image compression
Gene expression analysis
Steps
1. Take the whole dataset consisting of d+1 dimensions and ignore the labels such that our new dataset becomes d
dimensional.
2. Compute the mean for every dimension of the whole dataset.
3. Compute the covariance matrix of the whole dataset.
4. Compute eigenvectors and the corresponding eigenvalues.
5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d ×
k dimensional matrix W.
6. Use this d × k eigenvector matrix to transform the samples onto the new subspace.
Example -Refer question 5
Code -did not get
5. What is the need for Dimensionality reduction. Reduce 2D to 1D for given dataset with PCA?
Answer :
6. What is the need of Dimensionality reduction. How do you identify the Principal components from given data set?
Answer :
Dimensionality reduction
Dimensionality reduction is reducing dimensionality of high dimensional data.
This approach reduces the high dimensionality by projecting the data to lower dimensional subspace, without loosing
the essence of original data.
When dimensionality reduction approach is applied, the original features no longer exist and new features are
constructed from the original data using projections.
PCA is a Dimensionality reduction method(algorithms)
How to identify the Principal components from given data set -> for example refer question 5
1. Take the whole dataset consisting of d+1 dimensions and ignore the labels such that our new dataset becomes d
dimensional.
2. Compute the mean for every dimension of the whole dataset.
3. Compute the covariance matrix of the whole dataset.
4. Compute eigenvectors and the corresponding eigenvalues.
5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d ×
k dimensional matrix W.
6. Use this d × k eigenvector matrix to transform the samples onto the new subspace.
7. What are the key differences between PCA and MDS dimensionality reduction approaches. Give the steps in classical
Multi-Dimensional Scaling algorithm
Answer :
Steps
8. Explain in brief the steps involved to implement Multidimensional scaling

Answer :
for more detail refer Unit 3 pdf from Pg 254
9. Define & Explain Stochastic process. Explain different types of stochastic process with suitable examples.
Answer :
10. Write short notes on i) Stochastic process, ii) Markov chain process, iii) Transition probability matrix
Answer :
1. Stochastic process - refer prevoious question

2. Markov chain process -refer next question
3. Transition probability matrix
11. Explain Markov Chain and their classification with example and How to implement in Python?
Answer :
A Markov chain is a mathematical system that experiences transitions from one state to another according to
certain probabilistic rules. The defining characteristic of a Markov chain is that no matter how the process arrived at its present
state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent
solely on the current state and time elapsed
Markov Chain Classification -> this is given as Classification of States so not 100% sure whether correct or not
1. What do you mean by well posed learning problem? Explain with example
Answer :
Well Posed Learning Problem
Example :
Checkers game in detail

2. What is a Well-posed problem? Discuss in detail the steps involved in designing a learning task
Answer :
Well Posed Learning Problem
Steps involved in designing a learning task

Detail Explanation
1. Choosing the training experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function

4. Choosing a Function Approximation Algorithm
5. The Final Design

Final Answer for Checkers Problem
3. What is Machine learning? Discuss different approaches in Machine Learning with suitable examples
Answer :
Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data.
It is seen as a part of artificial intelligence
Three Approches of ML :
1. Supervised Learning :
The supervised learning approach can be adopted when a dataset contains the records of the response variable
values (or labels). Depending on the context,
this data with labels is usually referred to as “labeled data” and “training data.”
Eg :
We can mark e-mails as ‘spam’ or ‘not-spam’ based on the differentiating features of the previously seen
spam and not-spam e-mails, such as the lengths of the e-mails and use of particular keywords in the e-
mails.
Learning from training data continues until the machine learning model achieves a high level of accuracy
on the training data.
Types : There are two main supervised learning problems:
1. Classification Problems
2. Regression Problems.
2. Un Supervised Learinng :
Unsupervised learning is a learning approach used in ML algorithms to draw inferences from datasets, which do
not contain labels.
There are two main unsupervised learning problems:
1. Clustering
2. Dimensionality Reduction
Eg :
Recommender systems, which involve grouping together users with similar viewing patterns in order to
recommend similar content.
3. Reinforcement Learning :
Reinforcement learning is one of the primary approaches to machine learning concerned with finding optimal
agent actions that maximize the reward within a particular environment. The agent learns to perfect its actions
to gain the highest possible cumulative reward.
There are four main elements in reinforcement learning:
1. Agent: The trainable program which exercises the tasks assigned to it
2. Environment: The real or virtual universe where the agent completes its tasks.
3. Action: A move of the agent which results in a change of status in the environment
4. Reward: A negative or positive remuneration based on the action.
Eg :
self-driving cars
4. Discuss the Issues in machine learning

Answer :
5. Prove that the LMS weight update rule performs a gradient descent to minimize the squared error
Answer :
6. Illustrate general-to-specific ordering of hypothesis in concept learning
Answer :
7. Implement an algorithm demonstrating tic-tac-toe learning approach
Answer :
8. Write FIND-S algorithm. Apply the FIND-S algorithm on given training samples to find the maximally specific
hypothesis.
Answer :
Find - S Algorithm :
The Key Properties of Find S Algorihtm :
Issues in find-S
9. Define Version space? Find the hypothesis for given training examples using Candidate elimination algorithm
Answer :
Version Space :
Problem :
10. Consider the given below training example which finds malignant tumours from MRI Scans
Show the specific and general boundaries of the version space after applying candidate Elimination algorithm (Note:
Malignant:+Ve and Benign : -Ve)
Answer :
11. Explain candidate-elimination learning algorithm using version spaces
Answer :
The Candidate-Elimina on algorithm computes the Version Space Containg all hypothesis from H that are Consistent with
an observed sequence of training examples .
12. Explain the concept of Inductive Bias in brief
Answer :
1. What is an Artificial Neural Network? Describe the architecture of ANN and explain the topological units of ANN.
Answer :
2. types of Ann,
Answer :
3. activation functions
Answer :
Activation function is a function that is added into an artificial neural network in order to help the network learn complex
patterns in the data. When comparing with a neuron-based model that is in our brains, the activation function is at the end
deciding what is to be fired to the next neuron
The activation functions help the network use the important information and suppress the irrelevant data points.
Purpose of Activation Function
Various activation functions(AF):

Why non-linear Activation Function -> Eg : Sigmoid
4. back propogation -> Check from pdf

there are 3 things we dont know question is about what
1. BACKPROPAGATION Algorithm
2. Derivation of BACKPROPAGATION Rule
3. Remarks on the BACKPROPAGATION Algorithms
1. Convergence and Local Minima
2. Representational Power of Feedforward Networks
3. Hypothesis Space Search and Inductive Bias
4. Hidden Layer Representations
5. Generalization, Overfitting, and Stopping Criterion

DS Long Answers

Uploaded by

Copyright:

Available Formats

DS Long Answers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Long Answers

Uploaded by

Copyright:

Available Formats

DS Unit 1 Essay Questions

5. Explain Logistic regression with example and How to implement in Python?

7. Explain random forest with example and How to implement in Python?

1. Explain the mean vector and correlation with example

11. Explain candidate-elimination learning algorithm using version spaces

OF ALL THE ABOVE 27 QUESTIONS

ONLY below questions are answered.Remaining to be learned own

2. Write a brief note on various measures of data similarity and dissimilarity

1. h = 1: Manhattan (city block, L1 norm)distance

2. h = 2: (L2 norm) Euclidean distance

3. h → infinity . “supremum” (Lmax norm, L infinity norm) distance

2. Answer : Writing distances in form of Matrix

Proximity Measure for Binary Attributes

Dissimilarity in Binary Variables

Similarity in Binary Variables

# Importing the libraries

#Checking whether null values are there or not

#Checking whether the null values are filled

9. Explain data cleaning and munging with a suitable example

2. Z-score normalization ( μ: mean, σ: standard deviation )

3. Normalization by decimal scaling

3. Attribute construction (or feature construction ), Subset selection

Principal Components Analysis (PCA)

11. Explain the process of data discretization with a suitable example

Typical methods of discretisation for numerical data

5. Interval merging by Analysis

below plots are plotted using Matplotlib

Web Scraping is Extracting data from websites

Linear Multiple Regression

7. Explain random forest with example and How to implement in Python?

# predict the training set

from sklearn.ensemble import RandomForestRegressor

8. Explain ridge regression with example and How to implement in Python?

2. Variance Inflation Factor

10. Explain SVM with example and How to implement in Python?

Support Vector Machines (SVM)

11. Explain ANOVA with example and How to implement in Python?

ANOVA ( Analysis of variance )

ANOVA ( Analysis of variance )

1. Explain the mean vector and correlation with example

Eg :Calculation of Mean Vector and Variance-Covarience Matrix for below matric

2. Explain Mahalanobis distance with example

PCA is a Dimensionality reduction method(algorithms)

8. Explain in brief the steps involved to implement Multidimensional scaling

for more detail refer Unit 3 pdf from Pg 254

1. Stochastic process - refer prevoious question

Well Posed Learning Problem

Checkers game in detail

Well Posed Learning Problem

Steps involved in designing a learning task

2. Choosing the Target Function

3. Choosing a Representation for the Target Function

5. The Final Design

4. Discuss the Issues in machine learning

12. Explain the concept of Inductive Bias in brief

Various activation functions(AF):

4. back propogation -> Check from pdf

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.