88% found this document useful (8 votes)
4K views

Vijaya ML

The document discusses analyzing election data to build models to predict which party voters will vote for. It provides details on exploratory data analysis conducted on a dataset with 1525 voters and 9 variables. Various classification models were applied including logistic regression, LDA, KNN, naive Bayes, and ensemble methods like gradient boosting, decision tree and random forest. Performance metrics like accuracy, confusion matrix and ROC curves were calculated and compared to determine the optimized model. The gradient boosting model showed the best performance with 89% accuracy on training data and 84% on test data. The document also discusses analyzing inaugural speeches from 3 US Presidents - Roosevelt, Kennedy and Nixon. Details like character, word and sentence counts are provided. Stopwords
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
88% found this document useful (8 votes)
4K views

Vijaya ML

The document discusses analyzing election data to build models to predict which party voters will vote for. It provides details on exploratory data analysis conducted on a dataset with 1525 voters and 9 variables. Various classification models were applied including logistic regression, LDA, KNN, naive Bayes, and ensemble methods like gradient boosting, decision tree and random forest. Performance metrics like accuracy, confusion matrix and ROC curves were calculated and compared to determine the optimized model. The gradient boosting model showed the best performance with 89% accuracy on training data and 84% on test data. The document also discusses analyzing inaugural speeches from 3 US Presidents - Roosevelt, Kennedy and Nixon. Details like character, word and sentence counts are provided. Stopwords
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Problem 1:

You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party
a voter will vote for on the basis of the given information, to create an exit poll that will help in predicting
overall win and seats covered by a particular party.

1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it. (4 Marks)

EDA(Exploratory Data Analysis)

The first step to do the analysis is importing all the necessary libraries. Then we need to load the
data set given. To find out the entries in the data set, we used head()

From the above result we infer that, there are total 10 columns with 1525 entries on each column.
The data types of all the variables are integer except “vote” and “gender” which is object.

To proceed further,we can remove the “unnamed” column,as this will not be able to analyse.
After removing the “unnamed”,our data set will look like

Data Description:

Checking for the duplicates:

Total no of duplicate values = 8

The number of duplicate values are very less , so we can drop those and proceed.

2. Perform Univariate and Bivariate Analysis. Do exploratory data


analysis. Check for Outliers.

Univariate Analysis and Outlier Check


Exploratory Data Analysis is majorly performed using the following methods: Univariate
analysis:- provides summary statistics for each field in the raw data set (or) summary only on one
variable. Ex:- CDF,PDF,Box plot.

Bivariate analysis:- is performed to find the relationship between each variable in the dataset and
the target variable of interest (or) using 2 variables and finding the relationship between
them.Ex:-Box plot, Violin plot.

Multivariate analysis:- is performed to understand interactions between different fields in the


dataset (or) finding interactions between variables more than 2. Ex:- Pair plot and 3D scatter
plot.
Univariate Analysis:
Histogram:

1. Economic.cond.National:
Multivariate Analysis:
Heat Map:

There is no correlation between any variables.

Data Preparation:
1. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30). Encoding the dataset .

Scaling is necessary for KNN model.


1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)

MODEL 1: LOGISTIC REGRESSION

We need to apply the logistic regression and fit the model.


Predicting the training and the testing data.

After predicting, we have to find the accuracy of training and testing data.

Training set Accuracy:

Testing set Accuracy:

Confusion and classification matrix for training data:


Confusion and classification matrix for test data:

Based on the accuracy of the training and the testing data result, the model is good to use.
The precision and the recall values are also good.

Model 2: LDA
First we applied LDA model and fitted the dataset. Later that we have predicted the data
training and the testing.

Train accuracy:

Test Accuracy:
Confusion and Classification matrix for Training set:

Confusion and Classification matrix for Testing set:

The LDA model is also having good accuracy and having good precision values.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)

MODEL 3: KNN

KNN and fitting the training data

Predicting the training and the testing :

Accuracy for training set:

Accuracy for testing set:

Confusion and Classification Matrix for training set:


Confusion and Classification Matrix for testing set:

Based on our study, we understood that KNN model is having good accuracy for both the training and
the testing sets with good precision score.

NAÏVE BAYES MODEL:

After modeling and fitting the dataset, the prediction values as follows:

Training set Accuracy:

Testing set Accuracy:


Classification and confusion matrix for training data:

Classification and confusion matrix for testing data:

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting. (7
marks)
Ada Boosting

The predicting score for training set along with its accuracy and classification ,confusion matrix of ada
boosting is follows:

The predicting score for testing set along with its accuracy and classification ,confusion matrix of ada
boosting is follows:

GRADIENT BOOSTING:
Performance Matrix on train data set

Performance Matrix on test data set:

DECISION TREE:

Performance Matrix on train data set


Performance matrix on test data set:

RANDOM FOREST:

Performance Matrix on train data set

Performance Matrix on test data set:


BAGGING:

Performance Matrix on train data set:

Performance Matrix on test data set:

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.

LOGISTIC REGRESSION:

Confusion matrix:
AUC on Test and train and ROC curve:

LDA:

Confusion and classification matrix:


AUC AND ROC CURVE:

KNN MODEL:

Classification and confusion matrix:


AUC and ROC CURVE:

NAÏVE BAYES MODEL:

Confusion and classification matrix:

AUC and ROC Curve:


Model comparision :

Among all the models, the gradient boosting shows high accuracy of 89% for training set and 84% for
testing set. The precision and recall is also good in gradient boosting.

Inference:

The most important variables are “Hague” and “Blair”.The people gave 4 stars to Blair and 2 stars to
Hague.

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

(Hint: use .words(), .raw(), .sent() for extracting counts)


2.1 Find the number of characters, words, and sentences for the mentioned documents.
Roosevelt:

Number of Character:
Number of words:

Number of Sentences:

Kennedy:

Number of characters:

Number of words:

Number of sentences:

Nixon:

Number of Characters:

Number of words:

Number of sentences:

2.2 Remove all the stopwords from all three speeches. – 3 Marks
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords)

Rosevelt:

National word occurs most.

Kennedy:
Mostly occurred words are “world,sides.new”

Nixon:

Mostly occurred words are “America,Peace,World”.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy