LDA KNN Logistic
LDA KNN Logistic
LDA KNN Logistic
TABLE OF CONTENTS
Executive Summary ...................................................................................................................................... 3
Problem Statement 1: ................................................................................................................................. 3
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it. ............................................................................................................................................. 3
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers. ...... 5
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). ..................................................................................................... 9
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................................... 10
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................................... 14
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...................... 16
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the
models and write inference which model is best/optimized. .................................................................... 16
1.8 Based on these predictions, what are the insights?. ............................................................................ 25
Problem Statement 2 ................................................................................................................................. 25
2.1 Find the number of characters, words, and sentences for the mentioned documents. ...................... 25
2.2 Remove all the stopwords from all three speeches. ............................................................................ 26
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords) ............................................................................... 26
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) .......... 27
2
Executive Summary
In this report we will be machine learning techinques and compare different models by applying to our
dataset and draw the insights in the problem statement 1 and in problem statement 2 we will be
removing stopwords and plot word cloud for the speech.
Problem Statement 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which
party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.
Problem 1.1
Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
All the attributes are of continuous with the float data type.
There are 1525rows and 9 columns i.e attributes in the given dataframe
3
Checking the information and missing values in the data
Missing values
4
Table 2. Statistical Analysis - Problem 1
• From the above table we can see that age of the people who participated in voting lies in the range
of 24 to 93
• The ratings are in the range if 1-5 for Blair, Hague, economic con national and economic cond
household.
We have dropped the 'Unamed : 0 ' column as it had no significance in our analysis
Problem 1.1
Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Uni-Variate Analysis :
5
Fig 3. Univariate Analysis - Histogram
• Performed uni variate analysis to check the distribution of the data using the histogram and also
check the outliers using the box plot.
• We can see from the above histogram, the dataset is normally distributed for all our attributes.
• And we can see that there are outliers for two columns i.e., economic and economic household.
Bivariate Analysis
6
Fig 4. Bivariate Analysis(Pairplot) - Problem 1
Multi-Variate Analysis
7
Fig 5. Multivariate Analysis - Correlation plot
• From the above heatmap we can infer that there is no significant correlation among the
attributes given in the dataset.
Outlier Treatment:
• As we could see from the box plot, there were outliers which need to be treated. So below is the box
plot after the treatment.
8
Fig 6. Box Plot - Outlier Treatment
• So after the treatment, we can see that there no outliers present now.
Problem 1.3
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).
We have encoded the data use on-hot encoding method for the categorical columns we have
i.e., Vote and Gender.
9
Dataset after Encoding
After Encoding the data, i have split the data into training dataset and test dataset in the ratio
of 70:30 for our further analysis.
Scaling is not necessary for the dataset except for KNN model.
LDA Model:
10
Confusion Matrix for Training and Test Data :
11
Fig 8. AUC and ROC Curve - LDA
12
Fig 10. Classification Report - LR
• From both the models we can see that the AUC score is almost similar for both the
models i.e., approximately 89% for train and test data set.
• Accuracy is also 84% and 82% respectively.
13
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Naive Bayes:
Fig 11. Confusion Matrix & Classification Report - Naive Bayes - Train
Fig 12. Confusion Matrix & Classification Report - Naive Bayes - Test
KNN:
• We have used the plot of misclassification error vs k (with k value on X-axis) u to find the
optimum k -value for our analysis.
14
• Based on the plot we can see that for K=15, the misclassification error is least, hence we will
proceed building the model with K=15.
15
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.
LDA Tuning:
16
Fig 16. Classification Report - Tuning LDA
LR Tuning:
17
Fig 18. Confusion Matrix - Tuning LR
18
Training
Test
Fig 20. AUC and ROC Curve - LR tuning
KNN Tuning:
19
Fig 19. Confusion Matrix - KNN Tuning
Training
20
Test
Fig 23. AUC and ROC Curve - KNN tuning
21
Fig 24. Confusion Matrix & Classification Report - Test - Bagging
ADA Boosting:
Fig 26. Confusion Matrix & Classification Report - Train - ADA Boosting
22
Confusion Matrix and Classification for Test Data :
Fig 27. Confusion Matrix & Classification Report - Test - ADA Boosting
Gradient Boosting:
23
Fig 29. Confusion Matrix & Classification Report - Train - Gradient Boosting
Fig 30. Confusion Matrix & Classification Report - Test - Gradient Boosting
24
1.8 Based on these predictions, what are the insights?.
• Gradient Boosting model is the best model with the AUC score of 95% for training data and 89%
for test data.
• We have used different models but each model has its own advantages and disadvantages
• Linear discriminate analysis makes more assumptions about the underlying data, hence it is
assumed that logistic regression is the more flexible and more robust method in case of
violations of these assumptions.
• A general difference between KNN and other models is the large real time computation needed
by KNN compared to others. KNN vs naive bayes : Naive bayes is much faster than KNN due to
KNN's real-time execution.
• Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of a
single estimate as they combine several estimates from different models. So the result may be a
model with higher stability. We could see that with the help of AUC score of each of the models.
Insights:
• The Age of the voters who participated falls in the range of 24 to 93 , but however the age group
of 30 -70 are more prone to participate in the voting.
• The voters who gave 4 starts to Blair are the same voters who gave 2 stars to Hague.
• Labour party receive more votes as compared to Conservative party.
Problem Statement 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
2.1 Find the number of characters, words, and sentences for the mentioned documents.
The number of characters is char_count, the number of words is word_count and the number of
sentences is sents_count respectively for all the three speeches.
25
2.2 Remove all the stopwords from all three speeches.
With the of nltk.corpus package we can use the stopwords function to remove the stopwords from the
given speeches.
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords)
26
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)
27
Fig 33. Word cloud for Kennedy's Speech
28
Fig 33. Word cloud for Nixon's Speech
29