0% found this document useful (0 votes)
4 views

Cse3036 Predictive Analytics Final Lab Manual

The document is a lab manual for the Predictive Analytics course (CSE3036) at the School of Information Science, outlining the course objectives, outcomes, and assessment components. It emphasizes the development of skills in analyzing data sets for decision-making and includes specific experiments related to predictive modeling. The manual also details program outcomes and specific outcomes for the Data Science specialization, aiming to prepare students for employability through experiential learning.

Uploaded by

ashutosh.wk.24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Cse3036 Predictive Analytics Final Lab Manual

The document is a lab manual for the Predictive Analytics course (CSE3036) at the School of Information Science, outlining the course objectives, outcomes, and assessment components. It emphasizes the development of skills in analyzing data sets for decision-making and includes specific experiments related to predictive modeling. The manual also details program outcomes and specific outcomes for the Data Science specialization, aiming to prepare students for employability through experiential learning.

Uploaded by

ashutosh.wk.24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 112

SCHOOL OF INFORMATION SCIENCE

BACHELOR OF COMPUTER SCIENCE


ENGINEERING

Lab Manual

Course Code CSE3036

Course Name Predictive Analytics

Credit Structure 2-0-2-3

Year / Semester III/VI

Specialization B.Tech- Data Science

Prepared by Dr. Harishkumar K S


VISION OF PRESIDENCY SCHOOL OF INFORMATION SCIENCE

To be a value based, practice-driven School of Information Science, committed to developing globally-competent


Professionals, dedicated to applying Modern Information Science for Social Benefit

MISSION OF PRESIDENCY SCHOOL OF INFORMATION SCIENCE

 Cultivate a practice-driven environment with an Information-Technology-based pedagogy, integrating theory


and practice.
 Attract and nurture world-class faculty to excel in Teaching and Research, in the Information Science Domain.
 Establish state-of-the-art facilities for effective Teaching and Learning experiences.
 Promote Interdisciplinary Studies to nurture talent for global impact.
 Instil Entrepreneurial and Leadership Skills to address Social, Environmental and Community-needs.

PROGRAM OUTCOMES

PO-1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems. - High
PO-2: Problem analysis: Identify, formulate, review research literature, and analyze complex engineering problems
reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering sciences.
PO-3: Design/development of solutions: Design solutions for complex engineering problems and design system
components or processes that meet the specified needs with appropriate consideration for the public health and
safety, and the cultural, societal, and environmental considerations.
PO-4: Conduct investigations of complex problems: Use research-based knowledge and research methods including
design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid
conclusions. - High
PO-5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering and
IT tools including prediction and modeling to complex engineering activities with an understanding of the
limitations. - High
PO-6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, health,
safety, legal and cultural issues and the consequent responsibilities relevant to the professional engineering practice.
PO-7: Environment and sustainability: Understand the impact of the professional engineering solutions in societal
and environmental contexts, and demonstrate the knowledge of and need for sustainable development.
PO-8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
PO-9: Individual and teamwork: Function effectively as an individual, and as a member or leader in diverse teams,
and in multidisciplinary settings. - High
PO-10: Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions. - High
PO-11: Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one's own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments. - High
PO-12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent
and life-long learning in the broadest context of technological change.

PROGRAM SPECIFIC OUTCOMES:

At the end of the B. Tech. Program in Computer Science and Engineering (Data Science) the students shall:
PSO 01: [Problem Analysis]: Identify, formulate, research literature, and analyze complex engineering problems
related to Data Science principles and practices, Programming and Computing technologies reaching substantiated
conclusions using first principles of mathematics, natural sciences, and engineering sciences.
PSO 02: [Design/development of Solutions]: Design solutions for complex engineering problems related to Data
Science principles and practices, Programming and Computing technologies and design system components or
processes that meet t h e specified needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.
PSO 03: [Modern Tool usage] : Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities related to Data
Science principles and practices, Programming and Computing technologies with an understanding of the
limitations

COURSE DESCRIPTION:

An XR Development course that focuses on the development of immersive and interactive experiences for virtual reality
(VR), augmented reality (AR), and mixed reality (MR) platforms. The course cover topics such as creating 3D environments
and models, programming interactive elements, user interface design, and optimization for different XR devices. Students
may learn how to use software and tools such as Unity, Unreal Engine, and Vuforia to develop XR applications for gaming,
education, training, and other industries.

COURSE DESCRIPTION:

Predictive Analytics subject is conceptual in nature. The students will be benefited in this course to know about
modern data analytic concepts and develop the skills for analyzing and synthesizing data sets for decision making
in the firms.
COURSE OBJECTIVES

The objective of the course is to familiarize the learners with the concepts of Predictive Analytics and attain
Employability through Experiential Learning techniques.

COURSE OUTCOMES: On successful completion of the course the students shall be able to: (The outcomes are to be developed
using the appropriate action verbs from the Bloom’s Taxonomy-the list of verbs are attached)

TABLE 1: COURSE OUTCOMES

CO Number CO Expected BLOOMS


LEVEL

CO 1 Define the nature of analytics and its applications. Remember

CO 2 Summarize the concepts of predictive analytics and data Understand


mining.
CO 3 Construct the analytical tools in business scenarios to Apply
achieve competitive advantage.
CO 4 Build the real-world insights in decision trees and time series Apply
analysis methods in dynamic business environment.

MAPPING OF C.O. WITH P.O. [H-HIGH , M- MODERATE, L-LOW]

TABLE 2: CO PO Mapping ARTICULATION MATRIX

CO.

No PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO 10 PO 11 PO 12

CO1 H -- M M -- -- -- M M M --
CO2 H -- M H -- -- -- M M H --
CO3 M -- M H -- -- -- H H M --
CO4 H -- M H -- -- -- M H H --
CO5 H -- M M -- -- -- M M M --

TABLE 2b: CO PSO Mapping ARTICULATION MATRIX

CO. No PSO1 PSO2 PSO3

CO1 H - -

CO2 H - -

CO3 - - H

CO4 - L -

CO5 L M -

Assessment Component

Course
SL. Assessment Contents outco Marks Weightage Tentative Dates
No type me
Numb
er

1 CA1 Module 1 CO1, CO4 15 15% 15-02-2025

2 CA2 Module 2 CO1, CO4 15 15% 17-03-2025

Module 1 & CO1, CO2, 24-03-2025 -


2 Mid-Term 20 20%
Module 2 CO3 28-03-2025
Module 1, 2 CO1, CO2, 16-04-2025 –
4 Project 20 20%
,3,4 CO3, CO4 08-05-2025

End-term Module 1, 2 CO1, CO2, 12-05-2025 –


5 30 30%
examination ,3,4 CO3, CO4 16-05-2025

List of Experiments

S.No Name of the Experiment


1. Predicting house price using ML method.
2. Predicting buying behaviour for people interested in buying in car and whether or not
they bought the car.
3. Predicting the if a person would buy life insurance or not
4. Predict survival based on certain parameters decision tree
5. Build decision tree model to predict survival based on certain parameters
6. Identify the weather email is ham or spam
7. Naive Bayes: Predicting survival from titanic crash
8. Prediction of system temperature using Polynomial vs Linear Regression algorithms
9. Prediction of air passengers using ARIMA Model
10. Equipment maintenance Prediction
11. Air quality prediction using linear regression
12. Predicting the direction of stock market prices using random forest.
13. Market Price Prediction
14. Medical Insurance Price Prediction Using Machine learning Algorithms
15. Flight Price Prediction Using Machine learning Algorithms
SPECIFIC GUIDELINES TO STUDENTS:

 Maintain Minimum 75% Attendance: Regular attendance is mandatory to ensure continuity in learning and understanding of the
experiments.
 Missing lab sessions can lead to gaps in understanding and affect your ability to complete experiments and assignments.
 Carefully follow the instructions provided by the course instructor during both lectures and lab sessions.
 Ensure all assignments, reports, and projects are submitted before the deadline. Late submissions may result in penalties.
 Treat all lab equipment, including VR headsets, controllers, and computers, with care to avoid damage.
Experiment No 1

Name of the Experiment : Predicting house price using ML method.


Experiment No 2

Name of the Experiment : Predicting buying behaviour for people interested in buying in car
and whether or not they bought the car

#Random Forest Classification, decision tree classification

import numpy as np
import matplotlib.pyplot as
plt import pandas as pd

To buy or not to buy?

This dataset contains information about the age, gender and annual salary of people interested in buying in car and whether or not they bought the car. The
data will be used to train a model to help predict if future customers will buy the car or not.

The trained model


The model trained below can be used to predicted whether or not someone will purchase a car based on their age and annual salary. This can be useful in a
number of ways. For example, a car dealership can use this model to predict whether a potential customer is likely to purchase a

car based on their age and salary, and then target their advertising and sales efforts accordingly. This can help the dealership to more efficiently allocate their
resources and increase their sales. Similarly, an auto manufacturer can use this model to gain insights into their target market and tailor their product offerings
to better meet the needs of their customers.

df = pd.read_csv('car_data.csv')

display(df.head())

User ID Gender Age AnnualSalar Purchased


y
0 385 Male 35 20000 0

1 681 Male 40 43500 0

2 353 Male 49 74000 0

3 895 Male 40 107500 1

4 661 Male 25 79000 0

X = df.iloc[:400,
[2,3]].values y =
df.iloc[:400, -1].values
print(X)
[[ 35 20000]
[ 40 43500]
[ 49 74000]
[ 40 107500]
[ 25 79000]
[ 47 33500]
[ 46 132500]
[ 42 64000]
[ 30 84500]
[ 41 52000]
[ 42 80000]
[ 47 23000]
[ 32 72500]
[ 27 57000]
[ 42 108000]
[ 33 149000]
[ 35 75000]
[ 35 53000]
[ 46 79000]
[ 39 134000]
[ 39 51500]
[ 49 39000]
[ 54 25500]
[ 41 61500]
[ 31 117500]
[ 24 58000]
[ 40 107000]
[ 40 97500]
[ 48 29000]
[ 38 147500]
[ 45 26000]
[ 32 67500]
[ 37 62000]
[ 41 79500]
[ 44 113500]
[ 47 41500]
[ 38 55000]
[
39 114500]
[ 42 73000]
[ 26 15000]
[ 21 37500]
[ 59 39500]
[ 39 66500]
[ 43 80500]
[ 49 86000]
[ 37 75000]
[ 49 76500]
[ 28 123000]
[ 59 48500]
[ 40 60500]
[ 38 99500]
[ 51 35500]
[ 55 130000]
[ 23 56500]
[ 49 43500]
[ 49 36000]
[ 48 21500]
[ 49 98500]

sampled_df = df.sample(n=400)
X = sampled_df.iloc[:,
[2,3]].values y =
sampled_df.iloc[:, -1].values
print(X)

[[ 41 71000]
[ 20 49000]
[ 30 29500]
[ 39 72500]
[ 19 87500]
[ 31 63500]
[ 48 119000]
[ 53 143000]
[ 33 151500]
[ 42 60500]
[ 18 44000]
[ 55 116500]
[ 36 40500]
[ 51 140500]
[ 26 15000]
[ 32 117000]
[ 48 21500]
[ 23 28500]
[ 33 136500]
[ 29 43000]
[ 30 15000]
[ 39 81500]
[ 18 52000]
[ 37 127500]
[ 38 58500]
[ 35 75000]
[ 39 71000]
[ 38 79500]
[ 40 60000]
[ 26 17000]
[ 59 106500]
[ 56 131500]
[ 25 33500]
[ 43 109500]
[ 30 89000]
[ 42 80000]
[ 21 72000]
[ 54 26000]
[ 43 133000]
[ 31 90500]
[ 27 44500]
[ 27 20000]
[ 24 29500]
[ 31 16500]
[ 40 79500]
[ 28 90500]
[ 37 53500]
[ 36 80500]
[ 45 75500]
[ 40 57000]
[ 28 55500]
[ 33 60000]
[ 33 69000]
[ 19 19000]
[ 26 43000]
[ 51 136500]
[ 41 48500]
[ 60 42000]
#splitting the dataset into the Training set and Test
set from sklearn.model_selection import
train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 0)

#feature scaling
from sklearn.linear_model import
LogisticRegression from sklearn.preprocessing
import StandardScaler classifier =
LogisticRegression()
sc = StandardScaler()
X_train =
sc.fit_transform(X_train) X_test
= sc.transform(X_test) y_train =
y_train.ravel()
classifier.fit(X_train, y_train)

LogisticRegression()

print(X_train)

[[ 1.86185204 2.03821389]
[ 0.10108128 -0.01879763]
[ 0.10108128 -0.31265642]
[ 0.00326069 -0.4385959 ]
[-0.77930409 -
1.53007141]
[ 0.2967224 1.03069804]
8
[-0.6814835 0.80680563]
[ 1.5683902 0.94673839]
5
[ 0.88364606 -
1.15225296]
[-0.3880217 0.02318219]
[- 0.2470746
1.26840708 ]
[ 1.0792872 0.72284598]
6
[ 0.19890188 -0.52255556]
[-0.09455991 0.2470746 ]
[-0.3880217 -0.59252194]
[-0.77930409 0.0791553 ]
[ 0.88364606 -1.06829331]
[ 0.00326069 -0.4385959 ]
[-1.75751007 -1.54406468]
[-1.26840708 0.6528796 ]
[-0.29020111 2.05220717]
[-2.05097186 -1.44611175]
[ 1.07928726 1.70237528]
[ 1.27492845 -1.40413193]
[-0.3880217 0.10714185]
[-0.5836629 -1.20822607]
[-2.14879246 -0.57852866]
[-0.29020111 0.06516202]
[-0.29020111 -0.29866315]
[- -
0.3880217 0.34064297]
[- -1.4740983
0.4858423 ]
[- -0.3826228
0.4858423 ]
[- 0.12113512]
0.09455991
[- 1.32455683]
0.77930409
[ 0.0032606 0.27506116]
9
[- 1.46448959]
0.3880217
[ 1.47056965 -1.26419917]
[-0.19238051 0.1911015 ]
[ 0.00326069 1.42250976]
[-1.65968947 -1.23621262]
[ 0.1989018 0.72284598
8 ]
[-1.75751007 0.23308133
]
[ 1.8618520 0.94673839
4 ]
[-0.4858423 0.00918892
]
[-1.56186887 0.45697374
]
[ 1.27492845 -1.08228659]
[-1.26840708 0.42898719]
[-0.19238051 0.2470746 ]
[-1.26840708 -1.57205123]
[- 0.69485943]
1.17058648
[ 0.0032606 0.2470746
9 ]
[- 0.20509478]
1.07276588
[ 0.10108128 -0.3826228
]
[-0.4858423 0.12113512]
[ 1.1771078 0.48496029]
6
[ 0.0032606 0.06516202]
9
[-1.75751007 -
1.53007141]
[ 1.07928726 0.47096702]

print(classifier.predict(sc.transform([[49, 74000]])))

[1]

y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),
y_test.reshape(len(y_test),1)),1))
[[1 1]
[0 0]
[0 0]
[0 0]
[0 1]
[1 1]
[0 0]
[0 0]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 1]
[0 0]
[1 1]
[0 0]
[0 0]
[0 1]
[0 1]
[0 1]
[0 0]
[1 1]
[1 1]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 1]
[1 0]
[0 1]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]

The model was trained using a training and test set and the
outcome of the confusion matrix is as follows:

import seaborn as sns


from sklearn.metrics import confusion_matrix,
accuracy_score import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
accuracy = accuracy_score(y_test, y_pred)
print(cm)
print(f'Accuracy: {accuracy:.3f}')
[[61 4]
[10 25]]

The result shows 4 false positives, 10 false negatives and a


accuracy of 86%,
which is a good score, since this problem doesn't require a very
high precision.

Random Forest Classification

The colored background shows the predicted class for each point in the graph. The dots in the foreground represent the training data points and their color
shows their true class. The line in the middle divides the graph into two parts for each class. The picture gives us an idea about how good the tool is at
predicting data and how it might work on new data. The legend tells us which color represents which class.

from matplotlib.colors import ListedColormap


X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step =
0.25), np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000,
step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('slategrey',
'slateblue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], color = ListedColormap(('slategrey', 'slateblue'))(i), label = j)
plt.title('Random Forest Classification (Train set)')
plt.xlabel('Age')
plt.ylabel('Estimated
Salary') plt.legend()
plt.show()
This plot shows how well the Random Forest model can tell the difference between the two classes in the test data. It provides a visual

representation of the decision boundary that separates the two classes in the feature space. The dots represent the test data points, and their color indicates
their true class label. The plot helps us understand how good the model is at correctly predicting the class labels, and where it might make mistakes. The
legend tells us which color represents each class.

from matplotlib.colors import ListedColormap


X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step =
0.25), np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000,
step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('slategrey',
'slateblue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], color = ListedColormap(('slategrey', 'slateblue'))(i), label = j)
plt.title('Random Forest Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated
Salary') plt.legend()
plt.show()

Probability
We can now use the trained model to predict whether or not someone will purchase a car based on their age and annual salary. The model is used to predict if
43 year old customer X with an annual salary of 43000

# create a new data point to


predict new_data = [[35, 50000]]

# scale the new data using the same scaler used on the training data
new_data_scaled = sc.transform(new_data)

# make a prediction using the trained classifier


prediction = classifier.predict(new_data_scaled)

print(prediction)

[0]

The outcome "0" indicates that it is predicted that this customer X will not buy the car. New

prediction for 6 (potential) new customers:

new_data_2 = [[35, 50000], [42, 15700], [51, 65000], [23, 45600], [36, 42000], [41, 72000]]
df = pd.DataFrame(new_data_2, columns=[ 'Age', 'Annual Salary'])
df.index = df.index + 1
display(df)
Age Annual
Salary
1 35 50000

2 42 15700

3 51 65000

4 23 45600

5 36 42000
6 41 72000

from tabulate import


tabulate results = []
for i, data in enumerate(new_data_2):
# scale the new data using the same scaler used on the training data
data_scaled = sc.transform([data])
# make a prediction using the trained classifier
prediction = classifier.predict(data_scaled)
# add the results to the
table if prediction == 1:
prediction_text =
"Yes" else:
prediction_text = "No"
results.append([i+1, data[0], data[1],
prediction_text]) # print the table
headers = ["Customer", "Age", "Annual Salary",
"Prediction"] print(tabulate(results,
headers=headers))

Customer Age Annual Salary Prediction

1 35 50000 No
2 42 15700 No
3 51 65000 Yes
4 23 45600 No
5 36 42000 No
6 41 72000 No

This table shows that only customer 3 is expected to buy a car.


Experiment No 3

Name of the Experiment : Predicting the if a person would buy life insurance or not

1 import pandas as pd

2 from matplotlib import pyplot as

plt 3 %matplotlib inline

1 df = pd.read_csv("insurance_data.csv")

2 df.head()

age bought_insuranc
e
0 22 0

1 25 0

2 47 1

3 52 0

4 46 1

1 plt.scatter(df.age,df.bought_insurance,marker='+',color='red')

<matplotlib.collections.PathCollection at 0x7f98a203fc70>
1 of 6 23-09-2023, 13:48
1 from sklearn.model_selection import train_test_split

1X_train, X_test, y_train, y_test = train_test_split(df[['age']],df.bought_insurance,tra

1 from sklearn.linear_model import LogisticRegression

2 model = LogisticRegression()

1 X_test

age

7 60

21 26

18 19

25 54

26 23

20 21

17 58

5 56

16 25

1 model.fit(X_train, y_train)

▾LogisticRegression
LogisticRegression()
1
X_test
age

7 60

21 26

18 19

25 54

26 23

20 21

17 58

5 56

16 25

1 y_predicted = model.predict(X_test)

1 model.predict_proba(X_test)

array([[0.0649100 0.93508991],
9,
[0.90192749, 0.09807251],
[0.96175932, 0.03824068],
[0.14120445, 0.85879555],
[0.93401012, 0.06598988],
[0.94966579, 0.05033421],
[0.08469503, 0.91530497],
[0.10980239, 0.89019761],
[0.91392635, 0.08607365]]
)

1 model.score(X_test,y_test)

0.8888888888888888

1 y_predicted

array([1, 0, 0, 1, 0, 0, 1, 1, 0])
1 X_test
age

3 of 6 23-09-2023, 13:48
7 60

21 26

18 19

25 54

26 23

20 21

17 58

5 56

16 25

1 model.coef_

array([[0.14371961]])

1 model.intercept_ array([-

5.95553684])

1 import math

2 def sigmoid(x):

3 return 1 / (1 + math.exp(-x))

1 def prediction_function(age):

2 z = 0.14371961 * age -5.95553684 # 0.04150133 ~ 0.042 and -1.52726963 ~ -1.53

3 y = sigmoid(z)

4 return y

1 age = 35
2 prediction_function(age)

0.28386895059715234

4 of 6 23-09-2023, 13:48
1 age = 43

2 prediction_function(age)

0.5558673456608243

1 from sklearn.metrics import confusion_matrix

1 confusion_matrix(y_test,y_predicted)

5 of 6 23-09-2023, 13:48
Experiment No 4

Name of the Experiment : Predict survival based on certain parameters decision tree

Decision Tree Classification

predict_survival_based_on_certain_parametersdecision_tree

import pandas as pd

df = pd.read_csv("salaries.csv")
df.head()

company job degree salary_more_then_100k


0 google sales executive bachelors 0

1 google sales executive masters 0

2 google business manager bachelors 1

3 google business manager masters 1

4 googl computer programmer bachelors 0


e

inputs = df.drop('salary_more_then_100k',axis='columns')

target = df['salary_more_then_100k']

from sklearn.preprocessing import


LabelEncoder le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree =
LabelEncoder()

inputs['company_n'] =
le_company.fit_transform(inputs['company']) inputs['job_n'] =
le_job.fit_transform(inputs['job']) inputs['degree_n'] =
le_degree.fit_transform(inputs['degree'])

inputs

company job degree company_n job_n degree_n


0 google sales executive bachelors 2 2 0

1 google sales executive masters 2 2 1

2 google business manager bachelors 2 0 0

3 google business manager masters 2 0 1

4 google computer programmer bachelors 2 1 0

5 google computer programmer masters 2 1 1

6 abc pharma sales executive masters 0 2 1

7 abc pharma computer programmer bachelors 0 1 0

8 abc pharma business manager bachelors 0 0 0

6 of 6 23-09-2023, 1
9 abc pharma business manager masters 0 0 1

10 facebook sales executive bachelors 1 2 0

11 facebook sales executive masters 1 2 1

12 facebook business manager bachelors 1 0 0

13 facebook business manager masters 1 0 1

14 facebook computer programmer bachelors 1 1 0

15 facebook computer programmer masters 1 1 1

inputs_n = inputs.drop(['company','job','degree'],axis='columns')

inputs_n

company_n job_n degree_n


0 2 2 0

1 2 2 1

2 2 0 0

3 2 0 1

4 2 1 0

5 2 1 1

6 0 2 1

7 0 1 0

8 0 0 0

9 0 0 1

10 1 2 0

11 1 2 1

12 1 0 0

13 1 0 1

14 1 1 0

15 1 1 1

targe
t

0 0
1 0
2 1
3 1
4 0
5 1
6 0
7 0
8 0
9 1
10 1
11 1
12 1
13 1
14 1
15 1
Name: salary_more_then_100k, dtype: int64

from sklearn import tree


model = tree.DecisionTreeClassifier()

model.fit(inputs_n, target)

▾DecisionTreeClassifier DecisionTreeClassifier()

model.score(inputs_n,target)
1.0

Is salary of Google, Computer Engineer, Bachelors degree > 100 k ?

model.predict([[2,1,0]])

/usr/local/lib/python3.9/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but


DecisionTreeClass warnings.warn(
array([0])

Is salary of Google, Computer Engineer, Masters degree > 100 k ?

model.predict([[2,1,1]])

/usr/local/lib/python3.9/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but


DecisionTreeClass warnings.warn(
array([1])
Exercise: Build decision tree model to predict survival based on certain parameters

CSV file is available to download at https://github.com/codebasics/p y/blob/master/ML/9_decision_tree/Exercise/titanic.csv

In this file using following columns build a model to predict if person would survive or not,

1. Pclass
2. Sex
3. Age
4. Fare

Calculate score of your model


Experiment No 5

Name of the Experiment : Build decision tree model to predict survival based on certain parameters

import pandas as pd

df = pd.read_csv("titanic.csv")
df.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Bradley (Florence
Briggs Th...
STON/O2.
2 3 1 3 Heikkinen, Miss. female 26.0 0 0
3101282
7.9250 NaN S

Futrelle, Mrs. Jacques Heath


3
(Lily 4 1 1
May Peel) female 35.0 1 0 113803 53.1000 C123 S
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)

df.head()

Survived Pclass Sex Age Fare

0 0 3 male 22.0 7.2500

1 1 1 female 38.0 71.2833

2 1 3 female 26.0 7.9250

3 1 1 female 35.0 53.1000

4 0 3 male 35 0 8 0500

inputs = df.drop('Survived',axis='columns')
target = df.Survived

inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})

inputs.Age[:10]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64

inputs.Age = inputs.Age.fillna(inputs.Age.mean())

inputs.head()

Pclass Sex Age Fare

0 3 1 22.0 7.2500

1 1 2 38.0 71.2833

2 3 2 26.0 7.9250

3 1 2 35.0 53.1000

4 3 1 35 0 8 0500

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.2)

len(X_train)

712

len(X_test)

179

from sklearn import tree


model = tree.DecisionTreeClassifier()

model.fit(X_train,y_train)

▾DecisionTreeClassifier DecisionTreeClassifier()

model.score(X_test,y_tes

t)

0.8100558659217877
Experiment No 6

Name of the Experiment : Identify the weather email is ham or spam

import pandas as pd

df = pd.read_csv("spam.csv")
df.head()

Category Message
0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf he


lives aro

df.groupby('Category').describe()

Message

count unique top freq

Category

ham 4825 4516 Sorry, I'll call later 30

spam 747 641 Please call our customer service 4


representativ

naive_bayes_2_email_spam_

df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else


0) df.head()

Category Message spam


0 ham Go until jurong point, crazy.. Available only ... 0

1 ham Ok lar... Joking wif u oni... 0

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1

3 ham U dun say so early hor... U c already then say... 0

4 ham Nah I don't think he goes to usf, he lives aro... 0

emails = [
'Hey mohan, can we get together to watch footbal game tomorrow?',
'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

from sklearn.feature_extraction.text import


CountVectorizer v = CountVectorizer()
X_train_count =
v.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],


[0, 0, 0, ..., 0, 0, 0]])
from sklearn.naive_bayes import
MultinomialNB model = MultinomialNB()
model.fit(X_train_count,y_train)

MultinomialNB()

X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9842067480258435
Sklearn Pipeline

from sklearn.pipeline import


Pipeline clf = Pipeline([
('vectorizer',
CountVectorizer()), ('nb',
MultinomialNB())
])

clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

clf.score(X_test,y_test)

0.9842067480258435

clf.predict(emails)

array([0, 1])
Experiment No 7

Name of the Experiment : Naive Bayes: Predicting survival from titanic crash

Naive Bayes: Predicting survival from titanic crash

import pandas as pd

df = pd.read_csv("titanic.csv") df.head()

PassengerI Name Pclass Sex Age SibSp Parch Ticket Fare Cabin
d
Braun
0 1 d, Mr. 3 male 22.0 1 0 A/5 7.2500 NaN
Owen 21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 female 38.0 1 0 PC 17599 71.2833 C85
Brigg
s
Th..
.
STON/O2.
3 female 26.0 0 7.9250 NaN
Heikkinen, 0 3101282
2 3 Miss.
Laina

Futrelle,

Mrs.

3 Jacques 1 female 35.0 1 0 113803 53.1000 C123


4 Heath
(Lily May
Peel)

Allen,
4 5 Mr. 3 male 35.0 0 0 373450 8.0500 NaN
Willia
m
Henry

df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns' df.head()

Pclass Sex Age Fare Survived

0 3 male 22.0 7.2500 0

1 1 female 38.0 71.2833 1

2 3 female 26.0 7.9250 1

3 1 female 35.0 53.1000 1


inputs = df.drop('Survived',axis='columns') target =
df.Survived

Start with AI.


coding or generate

dummies = pd.get_dummies(inputs.Sex) dummies.head(3)

female male

0 0 1

1 1 0

2 1 0

inputs = pd.concat([inputs,dummies],axis='columns') inputs.head(3)

Pclass Sex Age Fare female male

0 3 male 22.0 7.2500 0 1

1 1 female 38.0 71.2833 1 0

2 3 female 26.0 7.9250 1 0

I am dropping male column as well because of dummy


variable
trap theory. One column is enough to repressent male
vs female

inputs.drop(['Sex','male'],axis='columns',inplace=True) inputs.head(3)

Pclass Age Fare female

0 3 22.0 7.2500 0

1 1 38.0 71.2833 1

2 3 26.0 7.9250 1

inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')
inputs.Age[:10]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64

inputs.Age = inputs.Age.fillna(inputs.Age.mean()) inputs.head()

Pclass Age Fare female

0 3 22.0 7.2500 0

1 1 38.0 71.2833 1

2 3 26.0 7.9250 1

3 1 35.0 53.1000 1

4 3 35.0 8.0500 0

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

from sklearn.naive_bayes import GaussianNB model


= GaussianNB()

model.fit(X_train,y_train)

▾GaussianNB

GaussianNB()

model.score(X_test,y_test)

0.7723880597014925

X_test[0:10]
Pclass Age Fare female

883 2 28.00000 10.5000 0


0
309 1 30.00000 56.9292 1
0
120 2 21.00000 73.5000 0
0
489 3 9.000000 15.9000 0

575 3 19.00000 14.5000 0


0
403 3 28.00000 15.8500 0
0
407 2 3.000000 18.7500 0

471 3 38.00000 8.6625 0


0
477 3 29.00000 7.0458 0

y_test[0:10]

883 0
309 1
120 0
489 1
575 0
403 0
407 1
471 0
477 0
26 0
Name: Survived, dtype:
int64

model.predict(X_test[0:10])

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

model.predict_proba(X_test[:10])

array([[0.9224254 0.07757453],
7,
[0.03565211, 0.96434789],
[0.69856719, 0.30143281],
[0.9425538 , 0.0574462 ],
[0.95473038, 0.04526962],
[0.96153647, 0.03846353],
[0.86634408, 0.13365592],
[0.96586774, 0.03413226],
[0.96152186, 0.03847814],
[0.96195658, 0.03804342]]
Calculate the score using cross validation

from sklearn.model_selection import cross_val_score


cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.72 , 0.872 , 0.736 , 0.75 , 0.82258065])


Experiment No 8

Name of the Experiment : Prediction of system temperature using Linear Regression algorithms

#libraries

import numpy as np

import matplotlib.pyplot as plt


import pandas as pd

#loading the dataset

dataset = pd.read_csv("Dataset.csv") dataset.head()

CPU CPU Package Used Used


Memory Temperatur
Total Temperature Memory Space
e
0 3.528023 40 48.99466 1.918564 35 92.11886
3
1 5.060095 40 48.97118 1.917644 35 92.11886
4
2 10.937500 40 49.11020 1.923088 35 92.11886
0
3 5.078125 40 49.12500 1.923668 35 92.11886
4

#removing cpu temp from dataset


dataset.head()

dataset.dtypes
ds = dataset.drop(columns = 'CPU Package Temperature')

#rain data: all columns except cpu temp

X = ds.iloc[::].values print(X)

[[ 3.5280227 48.9946632 1.9185638 35. 92.11886 ]


7 4
[ 5.060095 48.9711838 1.9176445 35. 92.11886 ]
[10.9375 49.1102 1.9230880 35. 92.11886 ]
7
...
[17.9687 54.2728844 2.1252517 42. 92.1337 ]
5 7

# includong only one one feature X1=X[:,[2]]

print(X1)

[[1.91856384]
[1.9176445 ]
[1.92308807]

...

[2.12525177]

[2.12425232]

[2.12401581]]
data = pd.read_csv("Dataset.csv") data.head()

CPU CPU Package Used Used


Memory Temperatur
Total Temperature Memory Space
e
0 3.528023 40 48.99466 1.918564 35 92.11886
3
1 5.060095 40 48.97118 1.917644 35 92.11886
4
2 10.937500 40 49.11020 1.923088 35 92.11886
0
3 5.078125 40 49.12500 1.923668 35 92.11886
4
Start with AI.
coding or generate

#output feature

Y = data.iloc[:,1].values

print(Y) [40 40 40 ... 48

48 48]

#splitting

from sklearn.model_selection import train_test_split #from


sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X1, Y, test_size = 0.2, rando

#polynomal regression

from sklearn.preprocessing import PolynomialFeatures from


sklearn.linear_model import LinearRegression

polyreg = PolynomialFeatures(degree = 4) # Transform X into X_Poly which contains X, X sq X_Poly =


polyreg.fit_transform(X1)

regressor1 = LinearRegression() # Fit X_Poly in our Linear Regression model regressor1.fit(X_Poly,Y)

▾LinearRegression

LinearRegression()

X1.shape

(2190, 1)

X_Poly.shape
(2190, 5)
# Predicting a new result

y_pred = regressor1.predict(X_Poly)
print(y_pred)

[40.54329613 40.51138967 40.69861546 ... 45.18661763 45.17429162


45.17136307]

from sklearn.metrics import mean_squared_error mse =


mean_squared_error(Y, y_pred)

print(mse)

9.203938505726908

from sklearn.linear_model import LinearRegression regressor


= LinearRegression()

regressor.fit(X1,Y)

▾LinearRegression

LinearRegression()

# Visualizing Linear Regression Results plt.scatter(X1, Y, color


= 'red')

plt.plot(X1, regressor.predict(X1), color = 'blue') plt.title('Linear Regression')

plt.xlabel('STORAGETEMP')
plt.ylabel('CPU TEMP')

plt.show()
prediction of system temperature using Polynomial Regression
algorithms

# Visualising the Polynomial Regression results plt.scatter(X1, Y, color


= 'red')

plt.plot(X1, regressor1.predict(polyreg.fit_transform(X1)), color = 'blue') plt.title('Polynomial Regression')

plt.xlabel('STOREGETEMP') plt.ylabel('CPU
TEMP')

plt.show()
Experiment No 9

Name of the Experiment : Prediction of air passengers using ARIMA Model


Experiment No 10

Name of the Experiment : Prediction of Equipment maintenance Prediction

import pandas as pd
import numpy as np

# Generating mock data


np.random.seed(42)

data = pd.DataFrame({
'sensor_1': np.random.normal(0, 1, 1000),
'sensor_2': np.random.normal(0, 1, 1000),
'sensor_3': np.random.normal(0, 1, 1000),
'operational_hours': np.random.randint(100, 5000, 1000),
'maintenance': np.random.choice([0, 1], 1000, p=[0.95, 0.05])
})

# Simulating remaining useful life (RUL) based on operational hours and sensor readings
data['RUL'] = 5000 - data['operational_hours'] - (data['sensor_1'] + data['sensor_2'] +
data['sensor_3']).cumsum()

# Save to CSV
data.to_csv('machinery_data.csv', index=False)

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load data
data = pd.read_csv('machinery_data.csv')

# Handle missing values if any


data.fillna(method='ffill', inplace=True)

# Feature selection
features = ['sensor_1', 'sensor_2', 'sensor_3', 'operational_hours']
target_rul = 'RUL'
target_maintenance = 'maintenance'

# Normalize features
scaler = StandardScaler()
data[features] = scaler.fit_transform(data[features])

# Split data for regression and classification


X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(data[features], data[target_rul], test_size=0.2,
random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(data[features], data[target_maintenance],
test_size=0.2, random_state=42)
In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Train regression model


reg_model = RandomForestRegressor(n_estimators=100, random_state=42)
reg_model.fit(X_train_reg, y_train_reg)

# Predict and evaluate


y_pred_reg = reg_model.predict(X_test_reg)
mse_reg = mean_squared_error(y_test_reg, y_pred_reg)
print(f"Regression Model MSE: {mse_reg}")
Regression Model MSE: 1103.2844396957576
In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train classification model


clf_model = RandomForestClassifier(n_estimators=100, random_state=42)
clf_model.fit(X_train_clf, y_train_clf)
# Predict and evaluate
y_pred_clf = clf_model.predict(X_test_clf)
accuracy_clf = accuracy_score(y_test_clf, y_pred_clf)
print(f"Classification Model Accuracy: {accuracy_clf}")
print(classification_report(y_test_clf, y_pred_clf))
Classification Model Accuracy: 0.955
precision recall f1-score support

0 0.95 1.00 0.98 191


1 0.00 0.00 0.00 9

accuracy 0.95 200


macro avg 0.48 0.50 0.49 200
weighted avg 0.91 0.95 0.93 200
Experiment No 11

Name of the Experiment : AIR QUALITY PREDICTION USING LINEAR


REGRESSION

Scrapping the data from the website and after taken data we calculated the daily hours Air Quality data into
daily average data of 2013 to 2017 for target value of PM2.5 after we Predicting the Air Quality using Linear
Regression

Columns

 T Average annual temperature


 TM Annual average maximum temperature
 Tm Average annual minimum temperature
 PP Rain or snow precipitation total annual
 V Annual average wind speed
 RA Number of days with rain
 SN Number of days with snow
 TS Number of days with storm
 FG Number of foggy days
 TN Number of days with tornado
 GR Number of days with hail

Lets import required libraries to the project


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Lets extract the data from the file
In [4]:

df = pd.read_csv('Real_Combine.csv')
In [5]:

df.head()
Out[5]:

Lets Check the dataset have null values or not uisng heatmap
In [6]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:

<AxesSubplot:>

In [10]:

df.isnull().sum()
Out[10]:

T 0
TM 0
Tm 0
SLP 0
H 0
VV 0
V 0
VM 0
PM 2.5 1
dtype: int64
As we seen here PM 2.5 has one null values lets drop the one column beacause one values of drop will not effect
to the model...
In [16]:

df = df.dropna()
Lets split the dataset into independent and dependent features
In [17]:

X = df.iloc[:,:-1] ##Independent Features,


#In this we just drop the last feature and consider reamaining features as
#independent features

Y = df.iloc[:,-1] ##Dependent Features,


##In this we just drop all the feature instaead last
features #for dependent features
Lets check the correlation of all features with Multivariant
In [18]:

sns.pairplot(df)
Out[18]:

<seaborn.axisgrid.PairGrid at 0x7f8c98da28e0>
Check the all Features Correlations

df.corr()
Correlation Matrix with Heatmap
Correlation states how the features are related to each other or the target variable.

Correlation can be positive (increase in one value of feature increases the value of the target variable) or
negative (increase in one value of feature decreases the value of the target variable)

Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of
correlated features using the seaborn library.
In [25]:

#Lets get Correlations of each features in dataset


corrmat = df.corr()
In [26]:

corrmat.index
Out[26]:
Index(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'], dtype='object')
In [23]:

top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g = sns.heatmap(df[top_corr_features].corr(),annot=True,cmap='RdYlGn')

Feature Importance
You can get the feature importance of each feature of your dataset by using the feature importance property of
the model.

Feature importance gives you a score for each feature of your data, the higher the score more important or
relevant is the feature towards your output variable.

Feature importance is an inbuilt class that comes with Tree Based Regressor, we will be using Extra
Tree Regressor for extracting the top 10 features for the dataset.

In [31]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,Y)
ExtraTreesRegressor()
In [33]:
X.head()
Out[33]:

Lets Check the which features have more values using ensemble techniques of Feature selection

print(model.feature_importances_)
[0.165886 0.09391187 0.17356502 0.1292003 0.08139158 0.25912898
0.05701879 0.03989745]
Lets visualize the data with top 5 features and top 5 features will aply to the model

feat_importances = pd.Series(model.feature_importances_, index=X.columns)


feat_importances.nlargest(5).plot(kind='barh')
plt.show()

Applying Linear Regression Model


In [45]:
sns.distplot(Y)
/Users/anil/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot`
is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot`
(a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[45]:

<AxesSubplot:xlabel='PM 2.5',

ylabel='Density'> It looks like Right Skwed

Distribution
Train Test Split
In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test , Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=0)
In [51]:
from sklearn.linear_model import LinearRegression
In [53]:
Regressor = LinearRegression()
In [54]:
Regressor.fit(X_train,Y_train)
Out[54]:
LinearRegression()
In [55]:

Regressor.coef_
Out[55]:

array([ 2.63949039e+00, 5.19978529e-01, -7.59811846e+00, 4.93219944e-01,


-8.37064222e-01, -5.04301355e+01, -2.75417778e+00, -3.92662839e-02])
In [56]:

Regressor.intercept_
Out[56]:

-157.37425475061315
Lets find the R Squared value for train data

In [92]:
print("Coefficient of determination R^2 <-- on train set: {}".format(Regressor.score(X_train,
Y_train))) Coefficient of determination R^2 <-- on train set: 0.6007706404750855
R square Value of train data is near to 1 so its prety much good
Lets find the R Squared Value for Test Data

In [61]:
print("Coefficient of determination R^2 <-- on train set: {}".format(Regressor.score(X_test, Y_test)))
Coefficient of determination R^2 <-- on train set: 0.5316188612878152
R square Value of test data is also near to 1 so its prety much
good Lets check cross validation score

In [65]:
from sklearn.model_selection import cross_val_score
score=cross_val_score(Regressor,X,Y,cv=5)
In [68]:
score.mean()
Out[68]:
0.46724362258523333

Model Evalaution
In [74]:
coeff_df = pd.DataFrame(Regressor.coef_,X.columns,columns=['Coefficient'])
In [75]:
coeff_df
Out[75]:

Coefficient

T 2.639490

TM 0.519979

Tm -7.598118

SLP 0.493220

H -0.837064

VV -50.430135

V -2.754178

VM -0.039266

Interpreting the coefficients:

 Holding all other features fixed, a 1 unit increase in T is associated with an *increase of 2.639490
in AQI PM2.5 *.
 Holding all other features fixed, a 1 unit increase in TM is associated with an *increase of 0.519979
in AQI PM 2.5 *.
 Holding all other features fixed, a 1 unit increase in Tm is associated with an *decrease of -7.598118
in AQI PM 2.5 *.
 Holding all other features fixed, a 1 unit increase in SLP is associated with an *increase of 0.493220
in AQI PM2.5 *.
 Holding all other features fixed, a 1 unit increase in H is associated with an *increase of -0.837064 in
AQI PM 2.5 *.
 Holding all other features fixed, a 1 unit increase in VV is associated with an *decrease of -
50.430135 in AQI PM 2.5 *.
 Holding all other features fixed, a 1 unit increase in V is associated with an *decrease of -2.754178
in AQI PM 2.5 *.
 Holding all other features fixed, a 1 unit increase in VM is associated with an *decrease of -
0.039266 in AQI PM 2.5 *.

Lets predict the test data


In [76]:
prediction = Regressor.predict(X_test)
In [77]:
sns.distplot(Y_test - prediction)

Out[77]:
<AxesSubplot:xlabel='PM 2.5', ylabel='Density'>

The test data has Gaussian Distributiion it so good to help to predict the data

In [79]:
plt.scatter(Y_test,prediction)

Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

1n∑i=1n|yi−y^i|

Mean Squared Error (MSE) is the mean of the squared errors:

1n∑i=1n(yi−y^i)2

Root Mean Squared Error (RMSE) is the square root of the mean of the squared

errors: 1n∑i=1n(yi−y^i)2

Comparing these metrics:

 MAE is the easiest to understand, because it's the average error.


 MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful
in the real world.
 RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.


In [86]:

from sklearn import metrics


In [88]:

print('MAE:', metrics.mean_absolute_error(Y_test, prediction))


print('MSE:', metrics.mean_squared_error(Y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test, prediction)))
MAE: 40.28335537132943
MSE: 3057.6641286741387
RMSE: 55.296149311449696
For the deploymenent process save the file using pickle package
Experiment No 12

Name of the Experiment : Predicting the direction of stock market prices using random forest.

Goal
Reproduce the results from the paper "Predicting the direction of stock market prices using random
forest."

Import Libraries
%matplotlib inlineimport matplotlib.pyplot as pltplt.rcParams['figure.figsize']
= (7,4.5) # Make the default figures a bit bigger
import numpy as npimport random
#Let's make this notebook reproducible np.random.seed(42)random.seed(42)
import pandas_techinal_indicators as ta #https://github.com/Crypto-toolbox/pandas-
technical-indicators/blob/master/technical_indicators.py
import pandas as pdfrom sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, precision_score, confusion_matrix,
recall_score, accuracy_scorefrom sklearn.model_selection import train_test_split

Data
In order to reproduce the same results as the authors, we try to use the same data they used in page
18, table 15.

Since the authors state that they compare these results with other authors using the data from yahoo
finance in the period [2010-01-04 to 2014-12-10] we will use the same data. It is not clear which
periods they used as traning set and testing set.
aapl = pd.read_csv('AAPL.csv')del(aapl['Date'])del(aapl['Adj Close'])aapl.head()
Open High Low Close Volume

33.6414 33.8014 33.4942 33.5714 1076649


0
30 30 86 30 00

33.9157 34.1042 33.2500 33.7099 1507863


1
14 86 00 99 00

33.5685 34.0728 33.5385 34.0700 1711269


2
73 57 70 00 00

34.0285 34.3200 33.8571 34.2200 1117543


3
72 00 43 01 00

34.2214 34.5600 34.0942 34.3714 1571255


4
28 01 84 29 00

Exponential smoothing
The authors don't give any guideline for alpha, so let's assume it is 0.9
def get_exp_preprocessing(df, alpha=0.9):
edata = df.ewm(alpha=alpha).mean()
return edata
saapl = get_exp_preprocessing(aapl)saapl.head() #saapl stands for smoothed aapl
Open High Low Close Volume

33.6414 33.8014 33.4942 33.5714 1.076649e


0
30 30 86 30 +08

33.8907 34.0767 33.2722 33.6974 1.468662e


1
79 54 08 02 +08

33.6005 34.0732 33.5121 34.0330 1.687227e


2
03 43 74 76 +08

33.9858 34.2953 33.8226 34.2013 1.174460e


3
04 47 77 25 +08

34.1978 34.5335 34.0671 34.3544 1.531579e


4
68 38 26 20 +08

Feature Extraction - Technical Indicators


It's not very clear what 'n' should be in most of the indicators, so, we are using several values of 'n'

note: the Williams %R indicator does not seem to be available in this library yet
def feature_extraction(data):
for x in [5, 14, 26, 44, 66]:
data = ta.relative_strength_index(data, n=x)
data = ta.stochastic_oscillator_d(data, n=x)
data = ta.accumulation_distribution(data, n=x)
data = ta.average_true_range(data, n=x)
data = ta.momentum(data, n=x)
data = ta.money_flow_index(data, n=x)
data = ta.rate_of_change(data, n=x)
data = ta.on_balance_volume(data, n=x)
data = ta.commodity_channel_index(data, n=x)
data = ta.ease_of_movement(data, n=x)
data = ta.trix(data, n=x)
data = ta.vortex_indicator(data, n=x)

data['ema50'] = data['Close'] / data['Close'].ewm(50).mean()


data['ema21'] = data['Close'] / data['Close'].ewm(21).mean()
data['ema14'] = data['Close'] / data['Close'].ewm(14).mean()
data['ema5'] = data['Close'] / data['Close'].ewm(5).mean()
#Williams %R is missing
data = ta.macd(data, n_fast=12, n_slow=26)

del(data['Open'])
del(data['High'])
del(data['Low'])
del(data['Volume'])

return data
def compute_prediction_int(df, n):
pred = (df.shift(-n)['Close'] >= df['Close'])
pred = pred.iloc[:-n]
return pred.astype(int)
def prepare_data(df, horizon):
data = feature_extraction(df).dropna().iloc[:-horizon]
data['pred'] = compute_prediction_int(data, n=horizon)
del(data['Close'])
return data.dropna()

Prepare the data with a prediction horizon of 10 days


data = prepare_data(saapl, 10)
y = data['pred']
#remove the output from the inputfeatures = [x for x in data.columns
if x not in ['gain', 'pred']]X = data[features]

Make sure that future data is not used by splitting the data in first 2/3 for training and the
last 1/3 for testing
train_size = 2*len(X) // 3
X_train = X[:train_size]X_test = X[train_size:]y_train = y[:train_size]y_test =
y[train_size:]
print('len X_train', len(X_train))print('len y_train', len(y_train))print('len X_test',
len(X_test))print('len y_test', len(y_test))
len X_train 644
len y_train 644
len X_test 323
len y_test 323

Random Forests
rf = RandomForestClassifier(n_jobs=-1, n_estimators=65,
random_state=42)rf.fit(X_train, y_train.values.ravel());
The expected results for a 10 days prediction according to the paper in table 15 for Apple stock
should be around 92%
pred = rf.predict(X_test)precision = precision_score(y_pred=pred, y_true=y_test)
recall = recall_score(y_pred=pred, y_true=y_test)f1 = f1_score(y_pred=pred,
y_true=y_test)accuracy = accuracy_score(y_pred=pred, y_true=y_test)confusion =
confusion_matrix(y_pred=pred, y_true=y_test)print('precision: {0:1.2f}, recall:
{1:1.2f}, f1: {2:1.2f}, accuracy: {3:1.2f}'.format(precision, recall, f1,
accuracy))print('Confusion Matrix')print(confusion)
precision: 0.66, recall: 0.68, f1: 0.67, accuracy: 0.58
Confusion Matrix
[[ 47 71]
[ 66 139]]

However, the resulting accuracy is 58% !


Some plots for intuition of what is going on
plt.figure(figsize=(20,7))plt.plot(np.arange(len(pred)), pred,
label='pred')plt.plot(np.arange(len(y_test)), y_test, label='real' );plt.title('Prediction
versus reality in the test set')plt.legend();

plt.figure(figsize=(20,7))proba = rf.predict_proba(X_test)
[:,1]plt.figure(figsize=(20,7))plt.plot(np.arange(len(proba)), proba,
label='pred_probability')plt.plot(np.arange(len(y_test)), y_test, label='real'
);plt.title('Prediction probability versus reality in the test set');plt.legend();plt.show();
<matplotlib.figure.Figure at 0x1fec5d7a908>

Let's now duplicate the analysis for the case where the test
set is shuffled
This means that there is data leakage in the training set, as the future and the past are together in
the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 2*len(X) // 3)
print('len X_train', len(X_train))print('len y_train', len(y_train))print('len X_test',
len(X_test))print('len y_test', len(y_test))
len X_train 644
len y_train 644
len X_test 323
len y_test 323

Let's use Random Forests with data leaked data set


rf = RandomForestClassifier(n_jobs=-1, n_estimators=65,
random_state=42)rf.fit(X_train, y_train.values.ravel());
The expected results for a 10 days prediction according to the paper in table 15 for Apple stock
should be around 92%
pred = rf.predict(X_test)precision = precision_score(y_pred=pred,
y_true=y_test)recall = recall_score(y_pred=pred, y_true=y_test)f1 =
f1_score(y_pred=pred, y_true=y_test)accuracy = accuracy_score(y_pred=pred,
y_true=y_test)confusion = confusion_matrix(y_pred=pred,
y_true=y_test)print('precision: {0:1.2f}, recall: {1:1.2f}, f1: {2:1.2f}, accuracy:
{3:1.2f}'.format(precision, recall, f1, accuracy))print('Confusion
Matrix')print(confusion)
precision: 0.87, recall: 0.91, f1: 0.89, accuracy: 0.87
Confusion Matrix
[[117 25]
[ 16 165]]

The accuracy results almost match those expected from the paper 87% vs the expected 92%
plt.figure(figsize=(20,7))plt.plot(np.arange(len(pred)), pred, alpha=0.7,
label='pred')plt.plot(np.arange(len(y_test)), y_test, alpha=0.7, label='real'
);plt.title('Prediction versus reality in the test set - Using Leaked data')plt.legend();

plt.figure(figsize=(20,7))proba = rf.predict_proba(X_test)
[:,1]plt.figure(figsize=(20,7))plt.plot(np.arange(len(proba)), proba, alpha = 0.7,
label='pred_probability')plt.plot(np.arange(len(y_test)), y_test, alpha = 0.7,
label='real' );plt.title('Prediction probability versus reality in the test set - Using
Leaked data');plt.legend();plt.show();
<matplotlib.figure.Figure at 0x1fec591be10>
Experiment No 13

Name of the Experiment : Stock Price Prediction usin ML Algorithm

import pandas as
pd # Load the
stock data
data = pd.read_csv(AAPL_short_volume.csv)
close_prices_AAPL = data['Close']

# Reverse the order of the data


close_prices_AAPL_reverse =
close_prices_AAPL.iloc[::-1]
# Reset index to maintain the correct time series order in the plot
close_prices_AAPL_reverse.reset_index(drop=True, inplace=True)

# Plot the line chart


import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(close_prices_AAPL_reverse)
plt.xlabel('Time')
plt.ylabel('Close Prices')
plt.title('AAPL Stock Close
Prices') plt.grid(True)
plt.show()

# Data preprocessing
data = close_prices_AAPL_reverse.values.reshape(-1, 1) # Reshape the
data data_normalized = data / np.max(data) # Normalize the data

# Split the data into training and


testing sets train_size =
int(len(data_normalized) * 0.8)
train_data =
data_normalized[:train_size] test_data
= data_normalized[train_size:]

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import LSTM, Dense,
Dropout from tensorflow.keras.optimizers import
Adam

# Function to create LSTM model


def create_lstm_model(units, activation,
learning_rate): model = Sequential()
model.add(LSTM(units=units, activation=activation, input_shape=(1, 1)))
model.add(Dense(units=1))
optimizer = Adam(learning_rate=learning_rate)
model.compile(optimizer=optimizer,
loss='mean_squared_error') return model
# Define hyperparameters for
tuning lstm_units = [50, 100,
200] lstm_activations = ['relu',
'tanh'] learning_rates = [0.001,
0.01, 0.1]
epochs = 100
batch_size = 32

# Perform hyperparameter tuning for LSTM


model best_rmse = float('inf')
best_lstm_model = None

from sklearn.metrics import mean_squared_error

for units in lstm_units:


for activation in lstm_activations:
for learning_rate in learning_rates:
# Create and train LSTM model
model = create_lstm_model(units=units, activation=activation,
learning_rate=learning_rate) model.fit(train_data[:-1].reshape(-1, 1, 1),
train_data[1:], epochs=epochs, batch_size=batch_size,
verbose=0)

# Predict on test data


test_predictions = model.predict(test_data[:-1].reshape(-1, 1, 1)).flatten()

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_data[1:], test_predictions))

# Check if current model has lower


RMSE if rmse < best_rmse:
best_rmse = rmse
best_lstm_model =
model

# Predict on the entire dataset using the best LSTM model


all_lstm_predictions = best_lstm_model.predict(data_normalized[:-1].reshape(-1, 1, 1)).flatten()

# Inverse normalize the LSTM predictions


all_lstm_predictions = all_lstm_predictions *
np.max(data)

from sklearn.svm import SVR


from sklearn.model_selection import GridSearchCV

# Support Vector Machines (SVM)


Model svm_model = SVR()

svm_params = {
'C': [0.1, 1, 10],
'gamma': [0.01, 0.1, 1]
}

svm_grid_search = GridSearchCV(svm_model, svm_params,


scoring='neg_mean_squared_error')
svm_grid_search.fit(np.arange(len(close_prices_AAPL_reverse)).reshape(-1, 1),
close_prices_AAPL_reverse) svm_best_model = svm_grid_search.best_estimator_
svm_predictions = svm_best_model.predict(np.arange(len(close_prices_AAPL_reverse)).reshape(-1, 1))

from xgboost import


XGBRegressor from lightgbm
import LGBMRegressor

# Random Forest Model


rf_model = RandomForestRegressor()

rf_params = {
lstm_activations = ['relu', 'tanh']
learning_rates = [0.001, 0.01, 0.1]
epochs = 100
batch_size = 32

# Perform hyperparameter tuning for LSTM


model best_rmse = float('inf')
best_lstm_model = None

for units in lstm_units:


for activation in lstm_activations:
for learning_rate in learning_rates:
# Create and train LSTM model
model = create_lstm_model(units=units, activation=activation,
learning_rate=learning_rate) model.fit(train_data[:-1].reshape(-1, 1, 1),
train_data[1:], epochs=epochs, batch_size=batch_size,
verbose=0)

# Predict on test data


test_predictions = model.predict(test_data[:-1].reshape(-1, 1, 1)).flatten()

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_data[1:], test_predictions))

# Check if current model has lower


RMSE if rmse < best_rmse:
best_rmse = rmse
best_lstm_model =
model

# Predict on the entire dataset using the best LSTM model


all_lstm_predictions = best_lstm_model.predict(data_normalized[:-1].reshape(-1, 1, 1)).flatten()
# Inverse normalize the LSTM predictions
all_lstm_predictions = all_lstm_predictions *
np.max(data)

# Calculate the scaling factor based on the maximum value of the


original data scaling_factor = np.max(close_prices_AAPL_reverse)

# Function to predict future stock prices using the LSTM model


def predict_future_lstm(model, data, num_predictions,
scaling_factor): predictions = []

# Get the last data point from the input data


last_data_point = data[-1]

for _ in range(num_predictions):
# Predict the next time step
prediction = model.predict(last_data_point.reshape(1, 1, 1))
predictions.append(prediction[0, 0])

# Update last_data_point to include the predicted value for the next iteration
last_data_point = np.append(last_data_point[1:], prediction)

# Inverse normalize the predictions


predictions = np.array(predictions) * scaling_factor

return predictions

# Predict the next 10 days using the LSTM


model num_predictions = 10
lstm_predictions = predict_future_lstm(best_lstm_model, data_normalized, num_predictions, scaling_factor)

# Plot the LSTM predictions for the next 10


days plt.figure(figsize=(10, 6))
plt.plot(close_prices_AAPL_reverse,
label='Actual')
plt.plot(np.arange(len(close_prices_AAPL_reverse), len(close_prices_AAPL_reverse) +
num_predictions), lstm_predictions, label='LSTM Predicted')
plt.title(f"LSTM Model - RMSE: {best_rmse:.2f}")
plt.xlabel('Time')
plt.ylabel('Stock
Price') plt.legend()
plt.show()

# Print the predicted stock prices for the next 10 days


using LSTM print("Predicted stock prices for the next 10
days:")
for i, prediction in enumerate(lstm_predictions, start=1):
print(f"Day {i}: {prediction:.2f}")

Experiment No 14
Name of the Experiment : Medical Insurance Price Prediction Using Machine learning Algorithms

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport


seaborn as sns
%matplotlib inline
data = pd.read_csv('./insurance.csv')data.head()
ag childre smok
sex bmi region charges
e n er

femal 27.90 southwe 16884.924


0 19 0 yes
e 0 st 00

33.77 southea 1725.5523


1 18 male 1 no
0 st 0

33.00 southea 4449.4620


2 28 male 3 no
0 st 0

22.70 northwe 21984.470


3 33 male 0 no
5 st 61

28.88 northwe 3866.8552


4 32 male 0 no
0 st 0

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

There are no missing values as such


data['region'].value_counts().sort_values()
northeast 324
northwest 325
southwest 325
southeast 364
Name: region, dtype: int64
data['children'].value_counts().sort_values()
5 18
4 25
3 157
2 240
1 324
0 574
Name: children, dtype: int64

Converting Categorical Features to Numerical


clean_data = {'sex': {'male' : 0 , 'female' : 1} ,
'smoker': {'no': 0 , 'yes' : 1},
'region' : {'northwest':0, 'northeast':1,'southeast':2,'southwest':3}
}data_copy = data.copy()data_copy.replace(clean_data, inplace=True)
data_copy.describe()
age sex bmi children smoker region charges

cou 1338.0000 1338.0000 1338.0000 1338.0000 1338.0000 1338.0000 1338.0000


nt 00 00 00 00 00 00 00

mea 13270.422
39.207025 0.494768 30.663397 1.094918 0.204783 1.514948
n 265

12110.011
std 14.049960 0.500160 6.098187 1.205493 0.403694 1.105572
237

1121.8739
min 18.000000 0.000000 15.960000 0.000000 0.000000 0.000000
00

4740.2871
25% 27.000000 0.000000 26.296250 0.000000 0.000000 1.000000
50

9382.0330
50% 39.000000 0.000000 30.400000 1.000000 0.000000 2.000000
00

16639.912
75% 51.000000 1.000000 34.693750 2.000000 0.000000 2.000000
515

63770.428
max 64.000000 1.000000 53.130000 5.000000 1.000000 3.000000
010

corr = data_copy.corr()fig, ax =
plt.subplots(figsize=(10,8))sns.heatmap(corr,cmap='BuPu',annot=True,fmt=".2f",ax=
ax)plt.title("Dependencies of Medical
Charges")plt.savefig('./sampleImages/Cor')plt.show()
Smoker, BMI and Age are most important factor that determnines - Charges
Also we see that Sex, Children and Region do not affect the Charges. We might drop these 3 columns
as they have less correlation
print(data['sex'].value_counts().sort_values())
print(data['smoker'].value_counts().sort_values())print(data['region'].
value_counts().sort_values())
female 662
male 676
Name: sex, dtype: int64
yes 274
no 1064
Name: smoker, dtype: int64
northeast 324
northwest 325
southwest 325
southeast 364
Name: region, dtype: int64

Now we are confirmed that there are no other values in above pre-preocessed column,
We can proceed with EDA
plt.figure(figsize=(12,9))plt.title('Age vs
Charge')sns.barplot(x='age',y='charges',data=data_copy,palette='husl')plt.savefig('./
sampleImages/AgevsCharges')
plt.figure(figsize=(10,7))plt.title('Region vs
Charge')sns.barplot(x='region',y='charges',data=data_copy,palette='Set3')
<matplotlib.axes._subplots.AxesSubplot at 0x2e69a038b38>

plt.figure(figsize=(7,5))sns.scatterplot(x='bmi',y='charges',hue='sex',data=data_copy,
palette='Reds')plt.title('BMI VS Charge')
Text(0.5, 1.0, 'BMI VS Charge')
plt.figure(figsize=(10,7))plt.title('Smoker vs
Charge')sns.barplot(x='smoker',y='charges',data=data_copy,palette='Blues',hue='sex'
)
<matplotlib.axes._subplots.AxesSubplot at 0x2e69a19b2e8>

plt.figure(figsize=(10,7))plt.title('Sex vs
Charges')sns.barplot(x='sex',y='charges',data=data_copy,palette='Set1')
<matplotlib.axes._subplots.AxesSubplot at 0x2e69a291a20>
Plotting Skew and Kurtosis
print('Printing Skewness and Kurtosis for all columns')print()for col in
list(data_copy.columns):
print('{0} : Skewness {1:.3f} and Kurtosis
{2:.3f}'.format(col,data_copy[col].skew(),data_copy[col].kurt()))
Printing Skewness and Kurtosis for all columns

age : Skewness 0.056 and Kurtosis -1.245


sex : Skewness 0.021 and Kurtosis -2.003
bmi : Skewness 0.284 and Kurtosis -0.051
children : Skewness 0.938 and Kurtosis 0.202
smoker : Skewness 1.465 and Kurtosis 0.146
region : Skewness -0.038 and Kurtosis -1.329
charges : Skewness 1.516 and Kurtosis 1.606
plt.figure(figsize=(10,7))sns.distplot(data_copy['age'])plt.title('Plot for
Age')plt.xlabel('Age')plt.ylabel('Count')
Text(0, 0.5, 'Count')
plt.figure(figsize=(10,7))sns.distplot(data_copy['bmi'])plt.title('Plot for
BMI')plt.xlabel('BMI')plt.ylabel('Count')
Text(0, 0.5, 'Count')

plt.figure(figsize=(10,7))sns.distplot(data_copy['charges'])plt.title('Plot for
charges')plt.xlabel('charges')plt.ylabel('Count')
Text(0, 0.5, 'Count')
There might be few outliers in Charges but then we cannot say that the value is an outlier
as there might be cases in which Charge for medical was very les actually!
Prepating data - We can scale BMI and Charges Column before proceeding with
Prediction
rom sklearn.preprocessing import StandardScalerdata_pre = data_copy.copy()
tempBmi = data_pre.bmitempBmi = tempBmi.values.reshape(-1,1)data_pre['bmi'] =
StandardScaler().fit_transform(tempBmi)
tempAge = data_pre.agetempAge = tempAge.values.reshape(-1,1)data_pre['age'] =
StandardScaler().fit_transform(tempAge)
tempCharges = data_pre.chargestempCharges = tempCharges.values.reshape(-
1,1)data_pre['charges'] = StandardScaler().fit_transform(tempCharges)
data_pre.head()
se childre smok regio
age bmi charges
x n er n

- -
0.29858
0 1.43876 1 0.45332 0 1 3
4
4 0

- -
0.50962
1 1.50996 0 1 0 2 0.95368
1
5 9

- -
0.38330
2 0.79795 0 3 0 2 0.72867
7
4 5

- -
0.71984
3 0.44194 0 1.30553 0 0 0
3
8 1

4 - 0 - 0 0 0 -
0.51314 0.29255 0.77680
se childre smok regio
age bmi charges
x n er n

9 6 2

X = data_pre.drop('charges',axis=1).valuesy = data_pre['charges'].values.reshape(-
1,1)
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.2, random_state=42)
print('Size of X_train : ', X_train.shape)print('Size of y_train : ',
y_train.shape)print('Size of X_test : ', X_test.shape)print('Size of Y_test : ',
y_test.shape)
Size of X_train : (1070, 6)
Size of y_train : (1070, 1)
Size of X_test : (268, 6)
Size of Y_test : (268, 1)

Importing Libraries
from sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import
RandomForestRegressorfrom sklearn.tree import DecisionTreeRegressorfrom
sklearn.svm import SVRimport xgboost as xgb
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score,
confusion_matrixfrom sklearn.model_selection import cross_val_score,
RandomizedSearchCV, GridSearchCV

Linear Regression
%%timelinear_reg = LinearRegression()linear_reg.fit(X_train, y_train)
Wall time: 32 ms
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
cv_linear_reg = cross_val_score(estimator = linear_reg, X = X, y = y, cv = 10)
y_pred_linear_reg_train = linear_reg.predict(X_train)r2_score_linear_reg_train =
r2_score(y_train, y_pred_linear_reg_train)
y_pred_linear_reg_test = linear_reg.predict(X_test)r2_score_linear_reg_test =
r2_score(y_test, y_pred_linear_reg_test)
rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_reg_test)))
print('CV Linear Regression : {0:.3f}'.format(cv_linear_reg.mean()))print('R2_score
(train) : {0:.3f}'.format(r2_score_linear_reg_train))print('R2_score (test) :
{0:.3f}'.format(r2_score_linear_reg_test))print('RMSE : {0:.3f}'.format(rmse_linear))
CV Linear Regression : 0.745
R2_score (train) : 0.741
R2_score (test) : 0.783
RMSE : 0.480

Support Vector Machine (Regression)


X_c = data_copy.drop('charges',axis=1).valuesy_c =
data_copy['charges'].values.reshape(-1,1)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c,y_c,test_size=0.2,
random_state=42)
X_train_scaled = StandardScaler().fit_transform(X_train_c)y_train_scaled =
StandardScaler().fit_transform(y_train_c)X_test_scaled =
StandardScaler().fit_transform(X_test_c)y_test_scaled =
StandardScaler().fit_transform(y_test_c)
svr = SVR()#svr.fit(X_train_scaled, y_train_scaled.ravel())
parameters = { 'kernel' : ['rbf', 'sigmoid'],
'gamma' : [0.001, 0.01, 0.1, 1, 'scale'],
'tol' : [0.0001],
'C': [0.001, 0.01, 0.1, 1, 10, 100] }svr_grid = GridSearchCV(estimator=svr,
param_grid=parameters, cv=10, verbose=4, n_jobs=-1)svr_grid.fit(X_train_scaled,
y_train_scaled.ravel())
Fitting 10 folds for each of 60 candidates, totalling 600 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 7.2s
[Parallel(n_jobs=-1)]: Done 82 tasks | elapsed: 9.0s
[Parallel(n_jobs=-1)]: Done 205 tasks | elapsed: 11.7s
[Parallel(n_jobs=-1)]: Done 376 tasks | elapsed: 15.6s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed: 40.6s finished

GridSearchCV(cv=10, error_score='raise-deprecating',
estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
epsilon=0.1, gamma='auto_deprecated', kernel='rbf',
max_iter=-1, shrinking=True, tol=0.001,
verbose=False),
iid='warn', n_jobs=-1,
param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 'scale'],
'kernel': ['rbf', 'sigmoid'], 'tol': [0.0001]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=4)
svr = SVR(C=10, gamma=0.1, tol=0.0001)svr.fit(X_train_scaled,
y_train_scaled.ravel())print(svr_grid.best_estimator_)print(svr_grid.best_score_)
SVR(C=10, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.0001, verbose=False)
0.8311303137187737

cv_svr = svr_grid.best_score_
y_pred_svr_train = svr.predict(X_train_scaled)r2_score_svr_train =
r2_score(y_train_scaled, y_pred_svr_train)
y_pred_svr_test = svr.predict(X_test_scaled)r2_score_svr_test =
r2_score(y_test_scaled, y_pred_svr_test)
rmse_svr = (np.sqrt(mean_squared_error(y_test_scaled, y_pred_svr_test)))
print('CV : {0:.3f}'.format(cv_svr.mean()))print('R2_score (train) :
{0:.3f}'.format(r2_score_svr_train))print('R2 score (test) :
{0:.3f}'.format(r2_score_svr_test))print('RMSE : {0:.3f}'.format(rmse_svr))
CV : 0.831
R2_score (train) : 0.857
R2 score (test) : 0.871
RMSE : 0.359

Ridge Regressor
from sklearn.preprocessing import PolynomialFeatures, StandardScalerfrom
sklearn.pipeline import Pipelinefrom sklearn.linear_model import Ridge
steps = [ ('scalar', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('model', Ridge())]
ridge_pipe = Pipeline(steps)

parameters = { 'model__alpha': [1e-15, 1e-10, 1e-8, 1e-3, 1e-2,1,2,5,10,20,25,35,


43,55,100], 'model__random_state' : [42]}reg_ridge = GridSearchCV(ridge_pipe,
parameters, cv=10)reg_ridge = reg_ridge.fit(X_train, y_train.ravel())
C:\Users\sahil\Anaconda3\lib\site-packages\sklearn\linear_model\ridge.py:147: LinAlgWarning: Ill-conditioned
matrix (rcond=2.25803e-19): result may not be accurate.
overwrite_a=True).T
C:\Users\sahil\Anaconda3\lib\site-packages\sklearn\linear_model\ridge.py:147: LinAlgWarning: Ill-conditioned
matrix (rcond=2.14414e-19): result may not be accurate.
overwrite_a=True).T

reg_ridge.best_estimator_, reg_ridge.best_score_
(Pipeline(memory=None,
steps=[('scalar',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('poly',
PolynomialFeatures(degree=2, include_bias=True,
interaction_only=False, order='C')),
('model',
Ridge(alpha=20, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=42, solver='auto',
tol=0.001))],
verbose=False),
0.8259990140429396)

ridge = Ridge(alpha=20, random_state=42)ridge.fit(X_train_scaled,


y_train_scaled.ravel())cv_ridge = reg_ridge.best_score_
y_pred_ridge_train = ridge.predict(X_train_scaled)r2_score_ridge_train =
r2_score(y_train_scaled, y_pred_ridge_train)
y_pred_ridge_test = ridge.predict(X_test_scaled)r2_score_ridge_test =
r2_score(y_test_scaled, y_pred_ridge_test)
rmse_ridge = (np.sqrt(mean_squared_error(y_test_scaled,
y_pred_linear_reg_test)))print('CV : {0:.3f}'.format(cv_ridge.mean()))print('R2 score
(train) : {0:.3f}'.format(r2_score_ridge_train))print('R2 score (test) :
{0:.3f}'.format(r2_score_ridge_test))print('RMSE : {0:.3f}'.format(rmse_ridge))
CV : 0.826
R2 score (train) : 0.741
R2 score (test) : 0.784
RMSE : 0.465

RandomForest Regressor
%%timereg_rf = RandomForestRegressor()parameters = { 'n_estimators':
[600,1000,1200],
'max_features': ["auto"],
'max_depth':[40,50,60],
'min_samples_split': [5,7,9],
'min_samples_leaf': [7,10,12],
'criterion': ['mse']}
reg_rf_gscv = GridSearchCV(estimator=reg_rf, param_grid=parameters, cv=10,
n_jobs=-1)reg_rf_gscv = reg_rf_gscv.fit(X_train_scaled, y_train_scaled.ravel())
Wall time: 9min 47s
reg_rf_gscv.best_score_, reg_rf_gscv.best_estimator_
(0.8483687880955955,
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_split=7,
min_weight_fraction_leaf=0.0, n_estimators=1200,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False))
rf_reg = RandomForestRegressor(max_depth=50, min_samples_leaf=12,
min_samples_split=7,
n_estimators=1200)rf_reg.fit(X_train_scaled, y_train_scaled.ravel())
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_split=7,
min_weight_fraction_leaf=0.0, n_estimators=1200,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
cv_rf = reg_rf_gscv.best_score_
y_pred_rf_train = rf_reg.predict(X_train_scaled)r2_score_rf_train = r2_score(y_train,
y_pred_rf_train)
y_pred_rf_test = rf_reg.predict(X_test_scaled)r2_score_rf_test =
r2_score(y_test_scaled, y_pred_rf_test)
rmse_rf = np.sqrt(mean_squared_error(y_test_scaled, y_pred_rf_test))
print('CV : {0:.3f}'.format(cv_rf.mean()))print('R2 score (train) :
{0:.3f}'.format(r2_score_rf_train))print('R2 score (test) :
{0:.3f}'.format(r2_score_rf_test))print('RMSE : {0:.3f}'.format(rmse_rf))
CV : 0.848
R2 score (train) : 0.884
R2 score (test) : 0.879
RMSE : 0.348
models = [('Linear Regression', rmse_linear, r2_score_linear_reg_train,
r2_score_linear_reg_test, cv_linear_reg.mean()),
('Ridge Regression', rmse_ridge, r2_score_ridge_train, r2_score_ridge_test,
cv_ridge.mean()),
('Support Vector Regression', rmse_svr, r2_score_svr_train, r2_score_svr_test,
cv_svr.mean()),
('Random Forest Regression', rmse_rf, r2_score_rf_train, r2_score_rf_test,
cv_rf.mean())
]
predict = pd.DataFrame(data = models, columns=['Model', 'RMSE',
'R2_Score(training)', 'R2_Score(test)', 'Cross-Validation'])predict
R2_Score(traini R2_Score(te Cross-
Model RMSE
ng) st) Validation

0.4798
0 Linear Regression 0.741410 0.782694 0.744528
08

0.4652
1 Ridge Regression 0.741150 0.783800 0.825999
06

Support Vector 0.3587


2 0.857234 0.871283 0.831130
Regression 71

Random Forest 0.3475


3 0.884422 0.879228 0.848369
Regression 22

plt.figure(figsize=(12,7))predict.sort_values(by=['Cross-Validation'], ascending=False,
inplace=True)
sns.barplot(x='Cross-Validation', y='Model',data = predict,
palette='Reds')plt.xlabel('Cross Validation Score')plt.ylabel('Model')plt.show()
18999.110
Experiment No 15
Name of the Experiment : Flight Price PredictionUsing Machine learning Algorithms

Instructions:-
1. You will have the dataset.
2. Find the cheapest and most expensive flight at a specific time.
3. You will have to go through EDA.
4. Train ML Model.
5. Find a sweet spot for a cheap ticket.

Dataset X includes the following features


 f1: Ticket Purchase Date Time
 f2: Origin
 f3: Destination
 f4: Departure Date Time
 f5: Arrival Date Time
 f6: Airline
 f7: Refundable Ticket
 f8: Baggage Weight
 f9: Baggage Pieces
 f10: Flight Number

Dataset Y will have the following variables


 Target

1) Exploratory Data Analysis:-


We will extract information from our data
# Import Libraries
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport
seaborn as snsfrom datetime import date, datetimefrom sklearn.preprocessing
import LabelEncoderfrom sklearn.preprocessing import StandardScaler
# Load your data
train = pd.read_csv("sastaticket_train.csv")
test = pd.read_csv("sastaticket_test.csv")
df = traindf.head()
Unna Unna Unna
f f f f1 targ
med: med: f1 f4 f5 f6 f7 f8 med:
2 3 9 0 et
0.2 0 0.1

0 27691 27691 2021-01-08 x y 2021-01- 2021-01- ga Tr 0.0 0 c- 27691 7400


9 9 12:43:27.82872 23 23 mm ue 2 9 .0
8+00:00 05:00:00+ 07:00:00+ a
Unna Unna Unna
f f f f1 targ
med: med: f1 f4 f5 f6 f7 f8 med:
2 3 9 0 et
0.2 0 0.1

00:00 00:00

2021-07- 2021-07-
2021-07-01
12092 12092 01 01 alp Tr 35. a- 12092 1537
1 04:45:11.39754 x y 1
463 463 13:00:00+ 15:00:00+ ha ue 0 9 463 7.0
1+00:00
00:00 00:00

2021-07- 2021-07-
2021-06-24 ga
11061 11061 29 29 Tr 20. c- 11061 6900
2 11:28:47.56511 x y mm 1
788 788 14:00:00+ 16:00:00+ ue 0 4 788 .0
5+00:00 a
00:00 00:00

2021-06- 2021-06-
2021-06-05 a-
87998 87998 09 09 alp Tr 15. 87998 9707
3 11:09:48.65592 x y 1 2
08 08 16:00:00+ 18:00:00+ ha ue 0 08 .0
7+00:00 3
00:00 00:00

2021-08- 2021-08-
2021-07-29
16391 16391 23 23 bet Tr 20. b- 16391 6500
4 09:53:51.06530 x y 0
150 150 05:00:00+ 06:55:00+ a ue 0 1 150 .0
6+00:00
00:00 00:00

df_test = testdf_test.head()

Unnam f f f f1
f1 f4 f5 f6 f7 f8
ed: 0 2 3 9 0

2021-09-16
269444 2021-10-03 2021-10-03 ome Tru 20. d-
0 12:20:01.578279+00:0 x y 1
9 04:40:00+00:00 06:40:00+00:00 ga e 0 1
0

2021-09-18
308855 2021-09-23 2021-09-23 ome Tru 20. d-
1 20:13:13.612131+00:0 x y 1
6 17:05:00+00:00 19:05:00+00:00 ga e 0 5
0

2021-09-24
391489 2021-11-10 2021-11-10 alph Tru 20. a-
2 17:53:41.424953+00:0 x y 1
9 13:00:00+00:00 15:00:00+00:00 a e 0 9
0

2021-09-07
113985 2021-09-13 2021-09-13 Tru 40. b-
3 19:39:07.182848+00:0 x y beta 0
9 05:00:00+00:00 06:55:00+00:00 e 0 1
0

2021-09-05
2021-09-22 2021-09-22 alph Tru 20. a-
4 594648 03:48:20.099555+00:0 x y 1
04:00:00+00:00 06:00:00+00:00 a e 0 1
0

X_train_df = trainy_train_df = train


X_train_df = X_train_df.drop(["Unnamed: 0.2" , "Unnamed: 0" , "Unnamed: 0.1" ,
"target"] , axis=1)X_train_df.head()
f f f f1
f1 f4 f5 f6 f7 f8
2 3 9 0

2021-01-08 2021-01-23 2021-01-23 gam Tru


0 x y 0.0 0 c-2
12:43:27.828728+00:00 05:00:00+00:00 07:00:00+00:00 ma e

1 2021-07-01 x y 2021-07-01 2021-07-01 alph Tru 35. 1 a-9


f f f f1
f1 f4 f5 f6 f7 f8
2 3 9 0

04:45:11.397541+00:00 13:00:00+00:00 15:00:00+00:00 a e 0

2021-06-24 2021-07-29 2021-07-29 gam Tru 20.


2 x y 1 c-4
11:28:47.565115+00:00 14:00:00+00:00 16:00:00+00:00 ma e 0

2021-06-05 2021-06-09 2021-06-09 alph Tru 15. a-


3 x y 1
11:09:48.655927+00:00 16:00:00+00:00 18:00:00+00:00 a e 0 23

2021-07-29 2021-08-23 2021-08-23 Tru 20.


4 x y beta 0 b-1
09:53:51.065306+00:00 05:00:00+00:00 06:55:00+00:00 e 0

y_train_df = y_train_df.drop(["Unnamed: 0.2" , "Unnamed: 0" , "Unnamed: 0.1" ,


"target"] , axis=1)y_train_df.head()
f f f f1
f1 f4 f5 f6 f7 f8
2 3 9 0

2021-01-08 2021-01-23 2021-01-23 gam Tru


0 x y 0.0 0 c-2
12:43:27.828728+00:00 05:00:00+00:00 07:00:00+00:00 ma e

2021-07-01 2021-07-01 2021-07-01 alph Tru 35.


1 x y 1 a-9
04:45:11.397541+00:00 13:00:00+00:00 15:00:00+00:00 a e 0

2021-06-24 2021-07-29 2021-07-29 gam Tru 20.


2 x y 1 c-4
11:28:47.565115+00:00 14:00:00+00:00 16:00:00+00:00 ma e 0

2021-06-05 2021-06-09 2021-06-09 alph Tru 15. a-


3 x y 1
11:09:48.655927+00:00 16:00:00+00:00 18:00:00+00:00 a e 0 23

2021-07-29 2021-08-23 2021-08-23 Tru 20.


4 x y beta 0 b-1
09:53:51.065306+00:00 05:00:00+00:00 06:55:00+00:00 e 0

# Structure
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0.2 5000 non-null int64
1 Unnamed: 0 5000 non-null int64
2 f1 5000 non-null object
3 f2 5000 non-null object
4 f3 5000 non-null object
5 f4 5000 non-null object
6 f5 5000 non-null object
7 f6 5000 non-null object
8 f7 5000 non-null bool
9 f8 5000 non-null float64
10 f9 5000 non-null int64
11 f10 5000 non-null object
12 Unnamed: 0.1 5000 non-null int64
13 target 5000 non-null float64
dtypes: bool(1), float64(2), int64(4), object(7)
memory usage: 512.8+ KB
# Check null values
df.isnull().sum()
Unnamed: 0.2 0
Unnamed: 0 0
f1 0
f2 0
f3 0
f4 0
f5 0
f6 0
f7 0
f8 0
f9 0
f10 0
Unnamed: 0.1 0
target 0
dtype: int64
# Get summary statistics of the data
df.describe()
Unnamed: Unnamed: Unnamed:
f8 f9 target
0.2 0 0.1

cou 5.000000e 5.000000e 5000.0000 5000.0000 5.000000e 5000.0000


nt +03 +03 00 00 +03 00

mea 1.086293e 1.086293e 1.086293e 10104.351


22.494400 0.944600
n +07 +07 +07 800

6.275456e 6.275456e 6.275456e 3359.9361


std 8.887101 0.607951
+06 +06 +06 18

2.499000e 2.499000e 2.499000e 4990.0000


min 0.000000 0.000000
+03 +03 +03 00

5.417290e 5.417290e 5.417290e 7796.0000


25% 20.000000 1.000000
+06 +06 +06 00

1.093803e 1.093803e 1.093803e 9403.0000


50% 20.000000 1.000000
+07 +07 +07 00

1.621582e 1.621582e 1.621582e 11245.000


75% 32.000000 1.000000
+07 +07 +07 000

2.177443e 2.177443e 2.177443e 33720.000


max 45.000000 2.000000
+07 +07 +07 000

# Understanding the data


df.head()
Unna Unna Unna
f f f f1 targ
med: med: f1 f4 f5 f6 f7 f8 med:
2 3 9 0 et
0.2 0 0.1

0 27691 27691 2021-01-08 x y 2021-01- 2021-01- ga Tr 0.0 0 c- 27691 7400


9 9 12:43:27.82872 23 23 mm ue 2 9 .0
Unna Unna Unna
f f f f1 targ
med: med: f1 f4 f5 f6 f7 f8 med:
2 3 9 0 et
0.2 0 0.1

05:00:00+ 07:00:00+
8+00:00 a
00:00 00:00

2021-07- 2021-07-
2021-07-01
12092 12092 01 01 alp Tr 35. a- 12092 1537
1 04:45:11.39754 x y 1
463 463 13:00:00+ 15:00:00+ ha ue 0 9 463 7.0
1+00:00
00:00 00:00

2021-07- 2021-07-
2021-06-24 ga
11061 11061 29 29 Tr 20. c- 11061 6900
2 11:28:47.56511 x y mm 1
788 788 14:00:00+ 16:00:00+ ue 0 4 788 .0
5+00:00 a
00:00 00:00

2021-06- 2021-06-
2021-06-05 a-
87998 87998 09 09 alp Tr 15. 87998 9707
3 11:09:48.65592 x y 1 2
08 08 16:00:00+ 18:00:00+ ha ue 0 08 .0
7+00:00 3
00:00 00:00

2021-08- 2021-08-
2021-07-29
16391 16391 23 23 bet Tr 20. b- 16391 6500
4 09:53:51.06530 x y 0
150 150 05:00:00+ 06:55:00+ a ue 0 1 150 .0
6+00:00
00:00 00:00

#Getting columns names


print(df.columns)
Index(['Unnamed: 0.2', 'Unnamed: 0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7',
'f8', 'f9', 'f10', 'Unnamed: 0.1', 'target'],
dtype='object')
# Finding unique values in categorical lists
cat_list = ["f2" , "f3" , "f6" , "f8" , "f9" , "f10"]
for i in cat_list:
print(i , df[i].unique())
f2 ['x']
f3 ['y']
f6 ['gamma' 'alpha' 'beta' 'omega']
f8 [ 0. 35. 20. 15. 32. 40. 45.]
f9 [0 1 2]
f10 ['c-2' 'a-9' 'c-4' 'a-23' 'b-1' 'a-5' 'b-9' 'a-7' 'd-1' 'c-6' 'a-1' 'd-5'
'b-69' 'b-19' 'd-3' 'b-319' 'b-369' 'b-67' 'b-73']
# We will find the ticket price based on f6 (Airline Name) , f8 (Baggage Weight) and f9
(Baggage Pieces)
df.drop(['Unnamed: 0.2', 'Unnamed: 0', 'f2', 'f3', 'f10', 'Unnamed: 0.1'], axis=1,
inplace=True)df_test.drop(['Unnamed: 0', 'f2', 'f3', 'f10'], axis=1, inplace=True)
df.head()
f
f1 f4 f5 f6 f7 f8 target
9

2021-01-08 2021-01-23 2021-01-23 gam Tru


0 0.0 0 7400.0
12:43:27.828728+00:00 05:00:00+00:00 07:00:00+00:00 ma e

1 2021-07-01 2021-07-01 2021-07-01 alph Tru 35. 1 15377.


f
f1 f4 f5 f6 f7 f8 target
9

04:45:11.397541+00:00 13:00:00+00:00 15:00:00+00:00 a e 0 0

2021-06-24 2021-07-29 2021-07-29 gam Tru 20.


2 1 6900.0
11:28:47.565115+00:00 14:00:00+00:00 16:00:00+00:00 ma e 0

2021-06-05 2021-06-09 2021-06-09 alph Tru 15.


3 1 9707.0
11:09:48.655927+00:00 16:00:00+00:00 18:00:00+00:00 a e 0

2021-07-29 2021-08-23 2021-08-23 Tru 20.


4 beta 0 6500.0
09:53:51.065306+00:00 05:00:00+00:00 06:55:00+00:00 e 0

df_test.head()
f
f1 f4 f5 f6 f7 f8
9

2021-09-16 2021-10-03 2021-10-03 ome Tru 20.


0 1
12:20:01.578279+00:00 04:40:00+00:00 06:40:00+00:00 ga e 0

2021-09-18 2021-09-23 2021-09-23 ome Tru 20.


1 1
20:13:13.612131+00:00 17:05:00+00:00 19:05:00+00:00 ga e 0

2021-09-24 2021-11-10 2021-11-10 alph Tru 20.


2 1
17:53:41.424953+00:00 13:00:00+00:00 15:00:00+00:00 a e 0

2021-09-07 2021-09-13 2021-09-13 Tru 40.


3 beta 0
19:39:07.182848+00:00 05:00:00+00:00 06:55:00+00:00 e 0

2021-09-05 2021-09-22 2021-09-22 alph Tru 20.


4 1
03:48:20.099555+00:00 04:00:00+00:00 06:00:00+00:00 a e 0

# Now we will split date and time separately


df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 5000 non-null object
1 f4 5000 non-null object
2 f5 5000 non-null object
3 f6 5000 non-null object
4 f7 5000 non-null bool
5 f8 5000 non-null float64
6 f9 5000 non-null int64
7 target 5000 non-null float64
dtypes: bool(1), float64(2), int64(1), object(4)
memory usage: 278.4+ KB
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 250 non-null object
1 f4 250 non-null object
2 f5 250 non-null object
3 f6 250 non-null object
4 f7 250 non-null bool
5 f8 250 non-null float64
6 f9 250 non-null int64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 12.1+ KB
# Will do typecasting
df["f1"] = pd.to_datetime(df["f1"])df["f4"] = pd.to_datetime(df["f4"])df["f5"] =
pd.to_datetime(df["f5"])
df_test["f1"] = pd.to_datetime(df_test["f1"])df_test["f4"] =
pd.to_datetime(df_test["f4"])df_test["f5"] = pd.to_datetime(df_test["f5"])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 5000 non-null datetime64[ns, UTC]
1 f4 5000 non-null datetime64[ns, UTC]
2 f5 5000 non-null datetime64[ns, UTC]
3 f6 5000 non-null object
4 f7 5000 non-null bool
5 f8 5000 non-null float64
6 f9 5000 non-null int64
7 target 5000 non-null float64
dtypes: bool(1), datetime64[ns, UTC](3), float64(2), int64(1), object(1)
memory usage: 278.4+ KB
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 250 non-null datetime64[ns, UTC]
1 f4 250 non-null datetime64[ns, UTC]
2 f5 250 non-null datetime64[ns, UTC]
3 f6 250 non-null object
4 f7 250 non-null bool
5 f8 250 non-null float64
6 f9 250 non-null int64
dtypes: bool(1), datetime64[ns, UTC](3), float64(1), int64(1), object(1)
memory usage: 12.1+ KB
# Adding column after subtraction to find time before departure
df.insert(0 , "time_to_depart(s)" , ((df["f4"] - df["f1"]).astype("timedelta64[s]")) ,
True)df.insert(1 , "time_travel(s)" , ((df["f5"] - df["f4"]).astype("timedelta64[s]")) ,
True)
df_test.insert(0 , "time_to_depart(s)" , ((df_test["f4"] -
df_test["f1"]).astype("timedelta64[s]")) , True)df_test.insert(1 , "time_travel(s)" ,
((df_test["f5"] - df_test["f4"]).astype("timedelta64[s]")) , True)
df.head()
time_to_depa time_travel f targe
f1 f4 f5 f6 f7 f8
rt(s) (s) 9 t

2021-01-08 2021-01-23 2021-01-23


gam Tr 7400.
0 1268192.0 7200.0 12:43:27.828728 05:00:00+0 07:00:00+0 0.0 0
ma ue 0
+00:00 0:00 0:00

2021-07-01 2021-07-01 2021-07-01


alph Tr 35. 1537
1 29688.0 7200.0 04:45:11.397541 13:00:00+0 15:00:00+0 1
a ue 0 7.0
+00:00 0:00 0:00

2021-06-24 2021-07-29 2021-07-29


gam Tr 20. 6900.
2 3033072.0 7200.0 11:28:47.565115 14:00:00+0 16:00:00+0 1
ma ue 0 0
+00:00 0:00 0:00

2021-06-05 2021-06-09 2021-06-09


alph Tr 15. 9707.
3 363011.0 7200.0 11:09:48.655927 16:00:00+0 18:00:00+0 1
a ue 0 0
+00:00 0:00 0:00

2021-07-29 2021-08-23 2021-08-23


Tr 20. 6500.
4 2142368.0 6900.0 09:53:51.065306 05:00:00+0 06:55:00+0 beta 0
ue 0 0
+00:00 0:00 0:00

df_test.head()
time_to_depart time_travel( f
f1 f4 f5 f6 f7 f8
(s) s) 9

2021-09-16 2021-10-03 2021-10-03


ome Tru 20.
0 1441198.0 7200.0 12:20:01.578279+0 04:40:00+00 06:40:00+00 1
ga e 0
0:00 :00 :00

2021-09-18 2021-09-23 2021-09-23


ome Tru 20.
1 420706.0 7200.0 20:13:13.612131+0 17:05:00+00 19:05:00+00 1
ga e 0
0:00 :00 :00

2021-09-24 2021-11-10 2021-11-10


alph Tru 20.
2 4043178.0 7200.0 17:53:41.424953+0 13:00:00+00 15:00:00+00 1
a e 0
0:00 :00 :00

2021-09-07 2021-09-13 2021-09-13


Tru 40.
3 465652.0 6900.0 19:39:07.182848+0 05:00:00+00 06:55:00+00 beta 0
e 0
0:00 :00 :00

2021-09-05 2021-09-22 2021-09-22


alph Tru 20.
4 1469499.0 7200.0 03:48:20.099555+0 04:00:00+00 06:00:00+00 1
a e 0
0:00 :00 :00

df.isnull().sum()
time_to_depart(s) 0
time_travel(s) 0
f1 0
f4 0
f5 0
f6 0
f7 0
f8 0
f9 0
target 0
dtype: int64
# Separating categorical columns
cats_cols = ["f6" , "f7" , "f8" , "f9"]num_cols = ["time_to_depart(s)" , "time_travel(s)"]
# Plotting categorical columns
c = 1plt.figure(figsize=(15,20))
for i in cats_cols:
plt.subplot(4, 2, c)
sns.countplot(x=df[i])
c=c+1
plt.show()

# Plotting numerical columns


c = 1plt.figure(figsize=(15,20))
for i in num_cols:
plt.subplot(4, 2, c)
sns.distplot(df[i])
c=c+1
plt.show()
C:\Users\muham\AppData\Local\Temp\ipykernel_5280\2961181608.py:8: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.


Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(df[i])
C:\Users\muham\AppData\Local\Temp\ipykernel_5280\2961181608.py:8: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(df[i])

# Plotting the target column


sns.displot(df.target)
<seaborn.axisgrid.FacetGrid at 0x1d5b850a830>
sns.boxplot(df["target"])
<Axes: >

# Check Skew
df.skew()
df.skew()
time_to_depart(s) 3.703644
time_travel(s) 1.319374
f7 -11.660949
f8 0.418300
f9 0.027547
target 2.056345
dtype: float64
# Check Kurtosis
df.kurtosis()
df.kurtosis()
time_to_depart(s) 20.796049
time_travel(s) 5.504109
f7 134.031335
f8 0.046374
f9 -0.310364
target 6.344666
dtype: float64

2) Encoding of variables
df.drop(["f1" , "f4" , "f5"] , axis=1 , inplace=True)df_test.drop(["f1" , "f4" , "f5"] ,
axis=1 , inplace=True)
df.head()
time_to_depart time_travel( f
f6 f7 f8 target
(s) s) 9

gam Tru
0 1268192.0 7200.0 0.0 0 7400.0
ma e

alph Tru 35. 15377.


1 29688.0 7200.0 1
a e 0 0

gam Tru 20.


2 3033072.0 7200.0 1 6900.0
ma e 0

alph Tru 15.


3 363011.0 7200.0 1 9707.0
a e 0

Tru 20.
4 2142368.0 6900.0 beta 0 6500.0
e 0

# Encoding
le = LabelEncoder()
df["f6"] = le.fit_transform(df["f6"])df["f7"] = le.fit_transform(df["f7"])df["f8"] =
le.fit_transform(df["f8"])
df_test["f6"] = le.fit_transform(df_test["f6"])df_test["f7"] =
le.fit_transform(df_test["f7"])df_test["f8"] = le.fit_transform(df_test["f8"])
df.sample(10)
time_to_depart time_travel( f f f f
target
(s) s) 6 7 8 9

31 11900.
1317988.0 7140.0 2 1 2 2
63 0

66 14000.
1705075.0 7200.0 2 1 2 2
8 0

19 10545.
408962.0 7200.0 0 1 2 1
76 0

45 10545.
402468.0 7200.0 0 1 2 1
97 0

33 13277.
101098.0 7200.0 0 1 4 1
42 0
time_to_depart time_travel( f f f f
target
(s) s) 6 7 8 9

25
332681.0 6900.0 1 1 2 0 7458.0
49

37
320911.0 7200.0 2 1 2 1 8750.0
7

36 10655.
2181824.0 7200.0 2 1 3 2
43 0

13 13560.
7609835.0 7200.0 2 1 2 2
70 0

48
676423.0 7200.0 0 1 1 1 6810.0
32

df_test.sample(10)
time_to_depart time_travel( f f f f
(s) s) 6 7 8 9

19
1160731.0 7200.0 3 0 1 1
8

21
4916436.0 6900.0 1 0 2 0
5

79 80485.0 7200.0 2 0 1 2

10
1327903.0 7200.0 0 0 0 1
9

17
1092853.0 7200.0 3 0 1 1
7

20
1243921.0 7140.0 2 0 1 2
4

24 281144.0 7200.0 2 0 1 1

49 1536631.0 7200.0 0 0 1 1

24
346286.0 7200.0 0 0 0 1
5

16
589496.0 6900.0 1 0 1 0
9

df.describe()
time_to_depart time_travel(
f6 f7 f8 f9 target
(s) s)

cou 5000.0000 5000.0000 5000.0000 5000.0000 5000.0000


5.000000e+03 5000.000000
nt 00 00 00 00 00

mea 10104.351
1.349212e+06 7159.836000 0.953400 0.992800 2.292000 0.944600
n 800

3359.9361
std 1.679384e+06 169.613345 0.948371 0.084555 1.247817 0.607951
18

4990.0000
min 2.003000e+03 6900.000000 0.000000 0.000000 0.000000 0.000000
00
time_to_depart time_travel(
f6 f7 f8 f9 target
(s) s)

7796.0000
25% 3.606870e+05 7140.000000 0.000000 1.000000 2.000000 1.000000
00

9403.0000
50% 8.634945e+05 7200.000000 1.000000 1.000000 2.000000 1.000000
00

11245.000
75% 1.698816e+06 7200.000000 2.000000 1.000000 3.000000 1.000000
000

33720.000
max 1.916464e+07 7800.000000 3.000000 1.000000 6.000000 2.000000
000

# Splitting our data into x and y


X = df.drop(["target"] , axis=True)y = df["target"]
X.head()
time_to_depart time_travel( f f f f
(s) s) 6 7 8 9

0 1268192.0 7200.0 2 1 0 0

1 29688.0 7200.0 0 1 4 1

2 3033072.0 7200.0 2 1 2 1

3 363011.0 7200.0 0 1 1 1

4 2142368.0 6900.0 1 1 2 0

y.head()
0 7400.0
1 15377.0
2 6900.0
3 9707.0
4 6500.0
Name: target, dtype: float64

3) ML Modelling
# Regression Pipeline
from sklearn.linear_model import LinearRegressionfrom sklearn.tree import
DecisionTreeRegressorfrom sklearn.svm import SVRfrom sklearn.neighbors import
KNeighborsRegressorfrom sklearn.model_selection import train_test_splitfrom
sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
# Shorten the names
lr = LinearRegression()dt = DecisionTreeRegressor()svr = SVR()knn =
KNeighborsRegressor()
# Model Loop
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
for i in [lr, dt, svr, knn]:
i.fit(X_train, y_train)
predict = i.predict(X_test)
test_score = r2_score(y_test, predict)
train_score = r2_score(y_train, i.predict(X_train))

if abs(train_score - test_score) <= 0.1:


print(i)
print("R2 score is:" , r2_score(y_test, predict))
print("Mean Absolute error is: " , mean_absolute_error(y_test, predict))
print("Mean Squared error is: " , mean_squared_error(y_test, predict))
print("Root Mean Squared error is: " , mean_squared_error(y_test, predict,
squared=False))
print("---------------------------------------")

# Saving the predicted valuesres = pd.DataFrame(predict)res.index =


X_test.indexres.columns =
["prediction"]res.to_csv("prediction_results_traintestsplit.csv")
LinearRegression()
R2 score is: 0.0824576968227253
Mean Absolute error is: 2175.6552919422884
Mean Squared error is: 9604880.483617561
Root Mean Squared error is: 3099.1741615497444
---------------------------------------
SVR()
R2 score is: -0.06121290745272212
Mean Absolute error is: 2234.394635530172
Mean Squared error is: 11108831.830924733
Root Mean Squared error is: 3332.991423770054
---------------------------------------
# Final Data Prediction
lr = LinearRegression().fit(X, y)
predict_value = lr.predict(df_test)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy