Cse3036 Predictive Analytics Final Lab Manual
Cse3036 Predictive Analytics Final Lab Manual
Lab Manual
PROGRAM OUTCOMES
PO-1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems. - High
PO-2: Problem analysis: Identify, formulate, review research literature, and analyze complex engineering problems
reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering sciences.
PO-3: Design/development of solutions: Design solutions for complex engineering problems and design system
components or processes that meet the specified needs with appropriate consideration for the public health and
safety, and the cultural, societal, and environmental considerations.
PO-4: Conduct investigations of complex problems: Use research-based knowledge and research methods including
design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid
conclusions. - High
PO-5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering and
IT tools including prediction and modeling to complex engineering activities with an understanding of the
limitations. - High
PO-6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, health,
safety, legal and cultural issues and the consequent responsibilities relevant to the professional engineering practice.
PO-7: Environment and sustainability: Understand the impact of the professional engineering solutions in societal
and environmental contexts, and demonstrate the knowledge of and need for sustainable development.
PO-8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
PO-9: Individual and teamwork: Function effectively as an individual, and as a member or leader in diverse teams,
and in multidisciplinary settings. - High
PO-10: Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions. - High
PO-11: Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one's own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments. - High
PO-12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent
and life-long learning in the broadest context of technological change.
At the end of the B. Tech. Program in Computer Science and Engineering (Data Science) the students shall:
PSO 01: [Problem Analysis]: Identify, formulate, research literature, and analyze complex engineering problems
related to Data Science principles and practices, Programming and Computing technologies reaching substantiated
conclusions using first principles of mathematics, natural sciences, and engineering sciences.
PSO 02: [Design/development of Solutions]: Design solutions for complex engineering problems related to Data
Science principles and practices, Programming and Computing technologies and design system components or
processes that meet t h e specified needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.
PSO 03: [Modern Tool usage] : Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities related to Data
Science principles and practices, Programming and Computing technologies with an understanding of the
limitations
COURSE DESCRIPTION:
An XR Development course that focuses on the development of immersive and interactive experiences for virtual reality
(VR), augmented reality (AR), and mixed reality (MR) platforms. The course cover topics such as creating 3D environments
and models, programming interactive elements, user interface design, and optimization for different XR devices. Students
may learn how to use software and tools such as Unity, Unreal Engine, and Vuforia to develop XR applications for gaming,
education, training, and other industries.
COURSE DESCRIPTION:
Predictive Analytics subject is conceptual in nature. The students will be benefited in this course to know about
modern data analytic concepts and develop the skills for analyzing and synthesizing data sets for decision making
in the firms.
COURSE OBJECTIVES
The objective of the course is to familiarize the learners with the concepts of Predictive Analytics and attain
Employability through Experiential Learning techniques.
COURSE OUTCOMES: On successful completion of the course the students shall be able to: (The outcomes are to be developed
using the appropriate action verbs from the Bloom’s Taxonomy-the list of verbs are attached)
CO.
CO1 H -- M M -- -- -- M M M --
CO2 H -- M H -- -- -- M M H --
CO3 M -- M H -- -- -- H H M --
CO4 H -- M H -- -- -- M H H --
CO5 H -- M M -- -- -- M M M --
CO1 H - -
CO2 H - -
CO3 - - H
CO4 - L -
CO5 L M -
Assessment Component
Course
SL. Assessment Contents outco Marks Weightage Tentative Dates
No type me
Numb
er
List of Experiments
Maintain Minimum 75% Attendance: Regular attendance is mandatory to ensure continuity in learning and understanding of the
experiments.
Missing lab sessions can lead to gaps in understanding and affect your ability to complete experiments and assignments.
Carefully follow the instructions provided by the course instructor during both lectures and lab sessions.
Ensure all assignments, reports, and projects are submitted before the deadline. Late submissions may result in penalties.
Treat all lab equipment, including VR headsets, controllers, and computers, with care to avoid damage.
Experiment No 1
Name of the Experiment : Predicting buying behaviour for people interested in buying in car
and whether or not they bought the car
import numpy as np
import matplotlib.pyplot as
plt import pandas as pd
This dataset contains information about the age, gender and annual salary of people interested in buying in car and whether or not they bought the car. The
data will be used to train a model to help predict if future customers will buy the car or not.
car based on their age and salary, and then target their advertising and sales efforts accordingly. This can help the dealership to more efficiently allocate their
resources and increase their sales. Similarly, an auto manufacturer can use this model to gain insights into their target market and tailor their product offerings
to better meet the needs of their customers.
df = pd.read_csv('car_data.csv')
display(df.head())
X = df.iloc[:400,
[2,3]].values y =
df.iloc[:400, -1].values
print(X)
[[ 35 20000]
[ 40 43500]
[ 49 74000]
[ 40 107500]
[ 25 79000]
[ 47 33500]
[ 46 132500]
[ 42 64000]
[ 30 84500]
[ 41 52000]
[ 42 80000]
[ 47 23000]
[ 32 72500]
[ 27 57000]
[ 42 108000]
[ 33 149000]
[ 35 75000]
[ 35 53000]
[ 46 79000]
[ 39 134000]
[ 39 51500]
[ 49 39000]
[ 54 25500]
[ 41 61500]
[ 31 117500]
[ 24 58000]
[ 40 107000]
[ 40 97500]
[ 48 29000]
[ 38 147500]
[ 45 26000]
[ 32 67500]
[ 37 62000]
[ 41 79500]
[ 44 113500]
[ 47 41500]
[ 38 55000]
[
39 114500]
[ 42 73000]
[ 26 15000]
[ 21 37500]
[ 59 39500]
[ 39 66500]
[ 43 80500]
[ 49 86000]
[ 37 75000]
[ 49 76500]
[ 28 123000]
[ 59 48500]
[ 40 60500]
[ 38 99500]
[ 51 35500]
[ 55 130000]
[ 23 56500]
[ 49 43500]
[ 49 36000]
[ 48 21500]
[ 49 98500]
sampled_df = df.sample(n=400)
X = sampled_df.iloc[:,
[2,3]].values y =
sampled_df.iloc[:, -1].values
print(X)
[[ 41 71000]
[ 20 49000]
[ 30 29500]
[ 39 72500]
[ 19 87500]
[ 31 63500]
[ 48 119000]
[ 53 143000]
[ 33 151500]
[ 42 60500]
[ 18 44000]
[ 55 116500]
[ 36 40500]
[ 51 140500]
[ 26 15000]
[ 32 117000]
[ 48 21500]
[ 23 28500]
[ 33 136500]
[ 29 43000]
[ 30 15000]
[ 39 81500]
[ 18 52000]
[ 37 127500]
[ 38 58500]
[ 35 75000]
[ 39 71000]
[ 38 79500]
[ 40 60000]
[ 26 17000]
[ 59 106500]
[ 56 131500]
[ 25 33500]
[ 43 109500]
[ 30 89000]
[ 42 80000]
[ 21 72000]
[ 54 26000]
[ 43 133000]
[ 31 90500]
[ 27 44500]
[ 27 20000]
[ 24 29500]
[ 31 16500]
[ 40 79500]
[ 28 90500]
[ 37 53500]
[ 36 80500]
[ 45 75500]
[ 40 57000]
[ 28 55500]
[ 33 60000]
[ 33 69000]
[ 19 19000]
[ 26 43000]
[ 51 136500]
[ 41 48500]
[ 60 42000]
#splitting the dataset into the Training set and Test
set from sklearn.model_selection import
train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 0)
#feature scaling
from sklearn.linear_model import
LogisticRegression from sklearn.preprocessing
import StandardScaler classifier =
LogisticRegression()
sc = StandardScaler()
X_train =
sc.fit_transform(X_train) X_test
= sc.transform(X_test) y_train =
y_train.ravel()
classifier.fit(X_train, y_train)
LogisticRegression()
print(X_train)
[[ 1.86185204 2.03821389]
[ 0.10108128 -0.01879763]
[ 0.10108128 -0.31265642]
[ 0.00326069 -0.4385959 ]
[-0.77930409 -
1.53007141]
[ 0.2967224 1.03069804]
8
[-0.6814835 0.80680563]
[ 1.5683902 0.94673839]
5
[ 0.88364606 -
1.15225296]
[-0.3880217 0.02318219]
[- 0.2470746
1.26840708 ]
[ 1.0792872 0.72284598]
6
[ 0.19890188 -0.52255556]
[-0.09455991 0.2470746 ]
[-0.3880217 -0.59252194]
[-0.77930409 0.0791553 ]
[ 0.88364606 -1.06829331]
[ 0.00326069 -0.4385959 ]
[-1.75751007 -1.54406468]
[-1.26840708 0.6528796 ]
[-0.29020111 2.05220717]
[-2.05097186 -1.44611175]
[ 1.07928726 1.70237528]
[ 1.27492845 -1.40413193]
[-0.3880217 0.10714185]
[-0.5836629 -1.20822607]
[-2.14879246 -0.57852866]
[-0.29020111 0.06516202]
[-0.29020111 -0.29866315]
[- -
0.3880217 0.34064297]
[- -1.4740983
0.4858423 ]
[- -0.3826228
0.4858423 ]
[- 0.12113512]
0.09455991
[- 1.32455683]
0.77930409
[ 0.0032606 0.27506116]
9
[- 1.46448959]
0.3880217
[ 1.47056965 -1.26419917]
[-0.19238051 0.1911015 ]
[ 0.00326069 1.42250976]
[-1.65968947 -1.23621262]
[ 0.1989018 0.72284598
8 ]
[-1.75751007 0.23308133
]
[ 1.8618520 0.94673839
4 ]
[-0.4858423 0.00918892
]
[-1.56186887 0.45697374
]
[ 1.27492845 -1.08228659]
[-1.26840708 0.42898719]
[-0.19238051 0.2470746 ]
[-1.26840708 -1.57205123]
[- 0.69485943]
1.17058648
[ 0.0032606 0.2470746
9 ]
[- 0.20509478]
1.07276588
[ 0.10108128 -0.3826228
]
[-0.4858423 0.12113512]
[ 1.1771078 0.48496029]
6
[ 0.0032606 0.06516202]
9
[-1.75751007 -
1.53007141]
[ 1.07928726 0.47096702]
print(classifier.predict(sc.transform([[49, 74000]])))
[1]
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),
y_test.reshape(len(y_test),1)),1))
[[1 1]
[0 0]
[0 0]
[0 0]
[0 1]
[1 1]
[0 0]
[0 0]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 1]
[0 0]
[1 1]
[0 0]
[0 0]
[0 1]
[0 1]
[0 1]
[0 0]
[1 1]
[1 1]
[0 0]
[1 1]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 1]
[1 0]
[0 1]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
The model was trained using a training and test set and the
outcome of the confusion matrix is as follows:
The colored background shows the predicted class for each point in the graph. The dots in the foreground represent the training data points and their color
shows their true class. The line in the middle divides the graph into two parts for each class. The picture gives us an idea about how good the tool is at
predicting data and how it might work on new data. The legend tells us which color represents which class.
representation of the decision boundary that separates the two classes in the feature space. The dots represent the test data points, and their color indicates
their true class label. The plot helps us understand how good the model is at correctly predicting the class labels, and where it might make mistakes. The
legend tells us which color represents each class.
Probability
We can now use the trained model to predict whether or not someone will purchase a car based on their age and annual salary. The model is used to predict if
43 year old customer X with an annual salary of 43000
# scale the new data using the same scaler used on the training data
new_data_scaled = sc.transform(new_data)
print(prediction)
[0]
The outcome "0" indicates that it is predicted that this customer X will not buy the car. New
new_data_2 = [[35, 50000], [42, 15700], [51, 65000], [23, 45600], [36, 42000], [41, 72000]]
df = pd.DataFrame(new_data_2, columns=[ 'Age', 'Annual Salary'])
df.index = df.index + 1
display(df)
Age Annual
Salary
1 35 50000
2 42 15700
3 51 65000
4 23 45600
5 36 42000
6 41 72000
1 35 50000 No
2 42 15700 No
3 51 65000 Yes
4 23 45600 No
5 36 42000 No
6 41 72000 No
Name of the Experiment : Predicting the if a person would buy life insurance or not
1 import pandas as pd
1 df = pd.read_csv("insurance_data.csv")
2 df.head()
age bought_insuranc
e
0 22 0
1 25 0
2 47 1
3 52 0
4 46 1
1 plt.scatter(df.age,df.bought_insurance,marker='+',color='red')
<matplotlib.collections.PathCollection at 0x7f98a203fc70>
1 of 6 23-09-2023, 13:48
1 from sklearn.model_selection import train_test_split
2 model = LogisticRegression()
1 X_test
age
7 60
21 26
18 19
25 54
26 23
20 21
17 58
5 56
16 25
1 model.fit(X_train, y_train)
▾LogisticRegression
LogisticRegression()
1
X_test
age
7 60
21 26
18 19
25 54
26 23
20 21
17 58
5 56
16 25
1 y_predicted = model.predict(X_test)
1 model.predict_proba(X_test)
array([[0.0649100 0.93508991],
9,
[0.90192749, 0.09807251],
[0.96175932, 0.03824068],
[0.14120445, 0.85879555],
[0.93401012, 0.06598988],
[0.94966579, 0.05033421],
[0.08469503, 0.91530497],
[0.10980239, 0.89019761],
[0.91392635, 0.08607365]]
)
1 model.score(X_test,y_test)
0.8888888888888888
1 y_predicted
array([1, 0, 0, 1, 0, 0, 1, 1, 0])
1 X_test
age
3 of 6 23-09-2023, 13:48
7 60
21 26
18 19
25 54
26 23
20 21
17 58
5 56
16 25
1 model.coef_
array([[0.14371961]])
1 model.intercept_ array([-
5.95553684])
1 import math
2 def sigmoid(x):
3 return 1 / (1 + math.exp(-x))
1 def prediction_function(age):
3 y = sigmoid(z)
4 return y
1 age = 35
2 prediction_function(age)
0.28386895059715234
4 of 6 23-09-2023, 13:48
1 age = 43
2 prediction_function(age)
0.5558673456608243
1 confusion_matrix(y_test,y_predicted)
5 of 6 23-09-2023, 13:48
Experiment No 4
Name of the Experiment : Predict survival based on certain parameters decision tree
predict_survival_based_on_certain_parametersdecision_tree
import pandas as pd
df = pd.read_csv("salaries.csv")
df.head()
inputs = df.drop('salary_more_then_100k',axis='columns')
target = df['salary_more_then_100k']
inputs['company_n'] =
le_company.fit_transform(inputs['company']) inputs['job_n'] =
le_job.fit_transform(inputs['job']) inputs['degree_n'] =
le_degree.fit_transform(inputs['degree'])
inputs
6 of 6 23-09-2023, 1
9 abc pharma business manager masters 0 0 1
inputs_n = inputs.drop(['company','job','degree'],axis='columns')
inputs_n
1 2 2 1
2 2 0 0
3 2 0 1
4 2 1 0
5 2 1 1
6 0 2 1
7 0 1 0
8 0 0 0
9 0 0 1
10 1 2 0
11 1 2 1
12 1 0 0
13 1 0 1
14 1 1 0
15 1 1 1
targe
t
0 0
1 0
2 1
3 1
4 0
5 1
6 0
7 0
8 0
9 1
10 1
11 1
12 1
13 1
14 1
15 1
Name: salary_more_then_100k, dtype: int64
model.fit(inputs_n, target)
▾DecisionTreeClassifier DecisionTreeClassifier()
model.score(inputs_n,target)
1.0
model.predict([[2,1,0]])
model.predict([[2,1,1]])
In this file using following columns build a model to predict if person would survive or not,
1. Pclass
2. Sex
3. Age
4. Fare
Name of the Experiment : Build decision tree model to predict survival based on certain parameters
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
df.head()
4 0 3 male 35 0 8 0500
inputs = df.drop('Survived',axis='columns')
target = df.Survived
inputs.Age[:10]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()
0 3 1 22.0 7.2500
1 1 2 38.0 71.2833
2 3 2 26.0 7.9250
3 1 2 35.0 53.1000
4 3 1 35 0 8 0500
len(X_train)
712
len(X_test)
179
model.fit(X_train,y_train)
▾DecisionTreeClassifier DecisionTreeClassifier()
model.score(X_test,y_tes
t)
0.8100558659217877
Experiment No 6
import pandas as pd
df = pd.read_csv("spam.csv")
df.head()
Category Message
0 ham Go until jurong point, crazy.. Available only ...
df.groupby('Category').describe()
Message
Category
naive_bayes_2_email_spam_
emails = [
'Hey mohan, can we get together to watch footbal game tomorrow?',
'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)
array([0, 1])
MultinomialNB()
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)
0.9842067480258435
Sklearn Pipeline
clf.fit(X_train, y_train)
clf.score(X_test,y_test)
0.9842067480258435
clf.predict(emails)
array([0, 1])
Experiment No 7
Name of the Experiment : Naive Bayes: Predicting survival from titanic crash
import pandas as pd
df = pd.read_csv("titanic.csv") df.head()
PassengerI Name Pclass Sex Age SibSp Parch Ticket Fare Cabin
d
Braun
0 1 d, Mr. 3 male 22.0 1 0 A/5 7.2500 NaN
Owen 21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 female 38.0 1 0 PC 17599 71.2833 C85
Brigg
s
Th..
.
STON/O2.
3 female 26.0 0 7.9250 NaN
Heikkinen, 0 3101282
2 3 Miss.
Laina
Futrelle,
Mrs.
Allen,
4 5 Mr. 3 male 35.0 0 0 373450 8.0500 NaN
Willia
m
Henry
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns' df.head()
female male
0 0 1
1 1 0
2 1 0
inputs.drop(['Sex','male'],axis='columns',inplace=True) inputs.head(3)
0 3 22.0 7.2500 0
1 1 38.0 71.2833 1
2 3 26.0 7.9250 1
inputs.columns[inputs.isna().any()]
Index(['Age'], dtype='object')
inputs.Age[:10]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
0 3 22.0 7.2500 0
1 1 38.0 71.2833 1
2 3 26.0 7.9250 1
3 1 35.0 53.1000 1
4 3 35.0 8.0500 0
model.fit(X_train,y_train)
▾GaussianNB
GaussianNB()
model.score(X_test,y_test)
0.7723880597014925
X_test[0:10]
Pclass Age Fare female
y_test[0:10]
883 0
309 1
120 0
489 1
575 0
403 0
407 1
471 0
477 0
26 0
Name: Survived, dtype:
int64
model.predict(X_test[0:10])
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
model.predict_proba(X_test[:10])
array([[0.9224254 0.07757453],
7,
[0.03565211, 0.96434789],
[0.69856719, 0.30143281],
[0.9425538 , 0.0574462 ],
[0.95473038, 0.04526962],
[0.96153647, 0.03846353],
[0.86634408, 0.13365592],
[0.96586774, 0.03413226],
[0.96152186, 0.03847814],
[0.96195658, 0.03804342]]
Calculate the score using cross validation
Name of the Experiment : Prediction of system temperature using Linear Regression algorithms
#libraries
import numpy as np
dataset.dtypes
ds = dataset.drop(columns = 'CPU Package Temperature')
X = ds.iloc[::].values print(X)
print(X1)
[[1.91856384]
[1.9176445 ]
[1.92308807]
...
[2.12525177]
[2.12425232]
[2.12401581]]
data = pd.read_csv("Dataset.csv") data.head()
#output feature
Y = data.iloc[:,1].values
48 48]
#splitting
#polynomal regression
▾LinearRegression
LinearRegression()
X1.shape
(2190, 1)
X_Poly.shape
(2190, 5)
# Predicting a new result
y_pred = regressor1.predict(X_Poly)
print(y_pred)
print(mse)
9.203938505726908
regressor.fit(X1,Y)
▾LinearRegression
LinearRegression()
plt.xlabel('STORAGETEMP')
plt.ylabel('CPU TEMP')
plt.show()
prediction of system temperature using Polynomial Regression
algorithms
plt.xlabel('STOREGETEMP') plt.ylabel('CPU
TEMP')
plt.show()
Experiment No 9
import pandas as pd
import numpy as np
data = pd.DataFrame({
'sensor_1': np.random.normal(0, 1, 1000),
'sensor_2': np.random.normal(0, 1, 1000),
'sensor_3': np.random.normal(0, 1, 1000),
'operational_hours': np.random.randint(100, 5000, 1000),
'maintenance': np.random.choice([0, 1], 1000, p=[0.95, 0.05])
})
# Simulating remaining useful life (RUL) based on operational hours and sensor readings
data['RUL'] = 5000 - data['operational_hours'] - (data['sensor_1'] + data['sensor_2'] +
data['sensor_3']).cumsum()
# Save to CSV
data.to_csv('machinery_data.csv', index=False)
In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load data
data = pd.read_csv('machinery_data.csv')
# Feature selection
features = ['sensor_1', 'sensor_2', 'sensor_3', 'operational_hours']
target_rul = 'RUL'
target_maintenance = 'maintenance'
# Normalize features
scaler = StandardScaler()
data[features] = scaler.fit_transform(data[features])
Scrapping the data from the website and after taken data we calculated the daily hours Air Quality data into
daily average data of 2013 to 2017 for target value of PM2.5 after we Predicting the Air Quality using Linear
Regression
Columns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Lets extract the data from the file
In [4]:
df = pd.read_csv('Real_Combine.csv')
In [5]:
df.head()
Out[5]:
Lets Check the dataset have null values or not uisng heatmap
In [6]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<AxesSubplot:>
In [10]:
df.isnull().sum()
Out[10]:
T 0
TM 0
Tm 0
SLP 0
H 0
VV 0
V 0
VM 0
PM 2.5 1
dtype: int64
As we seen here PM 2.5 has one null values lets drop the one column beacause one values of drop will not effect
to the model...
In [16]:
df = df.dropna()
Lets split the dataset into independent and dependent features
In [17]:
sns.pairplot(df)
Out[18]:
<seaborn.axisgrid.PairGrid at 0x7f8c98da28e0>
Check the all Features Correlations
df.corr()
Correlation Matrix with Heatmap
Correlation states how the features are related to each other or the target variable.
Correlation can be positive (increase in one value of feature increases the value of the target variable) or
negative (increase in one value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of
correlated features using the seaborn library.
In [25]:
corrmat.index
Out[26]:
Index(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'], dtype='object')
In [23]:
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g = sns.heatmap(df[top_corr_features].corr(),annot=True,cmap='RdYlGn')
Feature Importance
You can get the feature importance of each feature of your dataset by using the feature importance property of
the model.
Feature importance gives you a score for each feature of your data, the higher the score more important or
relevant is the feature towards your output variable.
Feature importance is an inbuilt class that comes with Tree Based Regressor, we will be using Extra
Tree Regressor for extracting the top 10 features for the dataset.
In [31]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,Y)
ExtraTreesRegressor()
In [33]:
X.head()
Out[33]:
Lets Check the which features have more values using ensemble techniques of Feature selection
print(model.feature_importances_)
[0.165886 0.09391187 0.17356502 0.1292003 0.08139158 0.25912898
0.05701879 0.03989745]
Lets visualize the data with top 5 features and top 5 features will aply to the model
<AxesSubplot:xlabel='PM 2.5',
Distribution
Train Test Split
In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test , Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=0)
In [51]:
from sklearn.linear_model import LinearRegression
In [53]:
Regressor = LinearRegression()
In [54]:
Regressor.fit(X_train,Y_train)
Out[54]:
LinearRegression()
In [55]:
Regressor.coef_
Out[55]:
Regressor.intercept_
Out[56]:
-157.37425475061315
Lets find the R Squared value for train data
In [92]:
print("Coefficient of determination R^2 <-- on train set: {}".format(Regressor.score(X_train,
Y_train))) Coefficient of determination R^2 <-- on train set: 0.6007706404750855
R square Value of train data is near to 1 so its prety much good
Lets find the R Squared Value for Test Data
In [61]:
print("Coefficient of determination R^2 <-- on train set: {}".format(Regressor.score(X_test, Y_test)))
Coefficient of determination R^2 <-- on train set: 0.5316188612878152
R square Value of test data is also near to 1 so its prety much
good Lets check cross validation score
In [65]:
from sklearn.model_selection import cross_val_score
score=cross_val_score(Regressor,X,Y,cv=5)
In [68]:
score.mean()
Out[68]:
0.46724362258523333
Model Evalaution
In [74]:
coeff_df = pd.DataFrame(Regressor.coef_,X.columns,columns=['Coefficient'])
In [75]:
coeff_df
Out[75]:
Coefficient
T 2.639490
TM 0.519979
Tm -7.598118
SLP 0.493220
H -0.837064
VV -50.430135
V -2.754178
VM -0.039266
Holding all other features fixed, a 1 unit increase in T is associated with an *increase of 2.639490
in AQI PM2.5 *.
Holding all other features fixed, a 1 unit increase in TM is associated with an *increase of 0.519979
in AQI PM 2.5 *.
Holding all other features fixed, a 1 unit increase in Tm is associated with an *decrease of -7.598118
in AQI PM 2.5 *.
Holding all other features fixed, a 1 unit increase in SLP is associated with an *increase of 0.493220
in AQI PM2.5 *.
Holding all other features fixed, a 1 unit increase in H is associated with an *increase of -0.837064 in
AQI PM 2.5 *.
Holding all other features fixed, a 1 unit increase in VV is associated with an *decrease of -
50.430135 in AQI PM 2.5 *.
Holding all other features fixed, a 1 unit increase in V is associated with an *decrease of -2.754178
in AQI PM 2.5 *.
Holding all other features fixed, a 1 unit increase in VM is associated with an *decrease of -
0.039266 in AQI PM 2.5 *.
Out[77]:
<AxesSubplot:xlabel='PM 2.5', ylabel='Density'>
The test data has Gaussian Distributiion it so good to help to predict the data
In [79]:
plt.scatter(Y_test,prediction)
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
1n∑i=1n|yi−y^i|
1n∑i=1n(yi−y^i)2
Root Mean Squared Error (RMSE) is the square root of the mean of the squared
errors: 1n∑i=1n(yi−y^i)2
Name of the Experiment : Predicting the direction of stock market prices using random forest.
Goal
Reproduce the results from the paper "Predicting the direction of stock market prices using random
forest."
Import Libraries
%matplotlib inlineimport matplotlib.pyplot as pltplt.rcParams['figure.figsize']
= (7,4.5) # Make the default figures a bit bigger
import numpy as npimport random
#Let's make this notebook reproducible np.random.seed(42)random.seed(42)
import pandas_techinal_indicators as ta #https://github.com/Crypto-toolbox/pandas-
technical-indicators/blob/master/technical_indicators.py
import pandas as pdfrom sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, precision_score, confusion_matrix,
recall_score, accuracy_scorefrom sklearn.model_selection import train_test_split
Data
In order to reproduce the same results as the authors, we try to use the same data they used in page
18, table 15.
Since the authors state that they compare these results with other authors using the data from yahoo
finance in the period [2010-01-04 to 2014-12-10] we will use the same data. It is not clear which
periods they used as traning set and testing set.
aapl = pd.read_csv('AAPL.csv')del(aapl['Date'])del(aapl['Adj Close'])aapl.head()
Open High Low Close Volume
Exponential smoothing
The authors don't give any guideline for alpha, so let's assume it is 0.9
def get_exp_preprocessing(df, alpha=0.9):
edata = df.ewm(alpha=alpha).mean()
return edata
saapl = get_exp_preprocessing(aapl)saapl.head() #saapl stands for smoothed aapl
Open High Low Close Volume
note: the Williams %R indicator does not seem to be available in this library yet
def feature_extraction(data):
for x in [5, 14, 26, 44, 66]:
data = ta.relative_strength_index(data, n=x)
data = ta.stochastic_oscillator_d(data, n=x)
data = ta.accumulation_distribution(data, n=x)
data = ta.average_true_range(data, n=x)
data = ta.momentum(data, n=x)
data = ta.money_flow_index(data, n=x)
data = ta.rate_of_change(data, n=x)
data = ta.on_balance_volume(data, n=x)
data = ta.commodity_channel_index(data, n=x)
data = ta.ease_of_movement(data, n=x)
data = ta.trix(data, n=x)
data = ta.vortex_indicator(data, n=x)
del(data['Open'])
del(data['High'])
del(data['Low'])
del(data['Volume'])
return data
def compute_prediction_int(df, n):
pred = (df.shift(-n)['Close'] >= df['Close'])
pred = pred.iloc[:-n]
return pred.astype(int)
def prepare_data(df, horizon):
data = feature_extraction(df).dropna().iloc[:-horizon]
data['pred'] = compute_prediction_int(data, n=horizon)
del(data['Close'])
return data.dropna()
Make sure that future data is not used by splitting the data in first 2/3 for training and the
last 1/3 for testing
train_size = 2*len(X) // 3
X_train = X[:train_size]X_test = X[train_size:]y_train = y[:train_size]y_test =
y[train_size:]
print('len X_train', len(X_train))print('len y_train', len(y_train))print('len X_test',
len(X_test))print('len y_test', len(y_test))
len X_train 644
len y_train 644
len X_test 323
len y_test 323
Random Forests
rf = RandomForestClassifier(n_jobs=-1, n_estimators=65,
random_state=42)rf.fit(X_train, y_train.values.ravel());
The expected results for a 10 days prediction according to the paper in table 15 for Apple stock
should be around 92%
pred = rf.predict(X_test)precision = precision_score(y_pred=pred, y_true=y_test)
recall = recall_score(y_pred=pred, y_true=y_test)f1 = f1_score(y_pred=pred,
y_true=y_test)accuracy = accuracy_score(y_pred=pred, y_true=y_test)confusion =
confusion_matrix(y_pred=pred, y_true=y_test)print('precision: {0:1.2f}, recall:
{1:1.2f}, f1: {2:1.2f}, accuracy: {3:1.2f}'.format(precision, recall, f1,
accuracy))print('Confusion Matrix')print(confusion)
precision: 0.66, recall: 0.68, f1: 0.67, accuracy: 0.58
Confusion Matrix
[[ 47 71]
[ 66 139]]
plt.figure(figsize=(20,7))proba = rf.predict_proba(X_test)
[:,1]plt.figure(figsize=(20,7))plt.plot(np.arange(len(proba)), proba,
label='pred_probability')plt.plot(np.arange(len(y_test)), y_test, label='real'
);plt.title('Prediction probability versus reality in the test set');plt.legend();plt.show();
<matplotlib.figure.Figure at 0x1fec5d7a908>
Let's now duplicate the analysis for the case where the test
set is shuffled
This means that there is data leakage in the training set, as the future and the past are together in
the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 2*len(X) // 3)
print('len X_train', len(X_train))print('len y_train', len(y_train))print('len X_test',
len(X_test))print('len y_test', len(y_test))
len X_train 644
len y_train 644
len X_test 323
len y_test 323
The accuracy results almost match those expected from the paper 87% vs the expected 92%
plt.figure(figsize=(20,7))plt.plot(np.arange(len(pred)), pred, alpha=0.7,
label='pred')plt.plot(np.arange(len(y_test)), y_test, alpha=0.7, label='real'
);plt.title('Prediction versus reality in the test set - Using Leaked data')plt.legend();
plt.figure(figsize=(20,7))proba = rf.predict_proba(X_test)
[:,1]plt.figure(figsize=(20,7))plt.plot(np.arange(len(proba)), proba, alpha = 0.7,
label='pred_probability')plt.plot(np.arange(len(y_test)), y_test, alpha = 0.7,
label='real' );plt.title('Prediction probability versus reality in the test set - Using
Leaked data');plt.legend();plt.show();
<matplotlib.figure.Figure at 0x1fec591be10>
Experiment No 13
import pandas as
pd # Load the
stock data
data = pd.read_csv(AAPL_short_volume.csv)
close_prices_AAPL = data['Close']
# Data preprocessing
data = close_prices_AAPL_reverse.values.reshape(-1, 1) # Reshape the
data data_normalized = data / np.max(data) # Normalize the data
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_data[1:], test_predictions))
svm_params = {
'C': [0.1, 1, 10],
'gamma': [0.01, 0.1, 1]
}
rf_params = {
lstm_activations = ['relu', 'tanh']
learning_rates = [0.001, 0.01, 0.1]
epochs = 100
batch_size = 32
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_data[1:], test_predictions))
for _ in range(num_predictions):
# Predict the next time step
prediction = model.predict(last_data_point.reshape(1, 1, 1))
predictions.append(prediction[0, 0])
# Update last_data_point to include the predicted value for the next iteration
last_data_point = np.append(last_data_point[1:], prediction)
return predictions
Experiment No 14
Name of the Experiment : Medical Insurance Price Prediction Using Machine learning Algorithms
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
mea 13270.422
39.207025 0.494768 30.663397 1.094918 0.204783 1.514948
n 265
12110.011
std 14.049960 0.500160 6.098187 1.205493 0.403694 1.105572
237
1121.8739
min 18.000000 0.000000 15.960000 0.000000 0.000000 0.000000
00
4740.2871
25% 27.000000 0.000000 26.296250 0.000000 0.000000 1.000000
50
9382.0330
50% 39.000000 0.000000 30.400000 1.000000 0.000000 2.000000
00
16639.912
75% 51.000000 1.000000 34.693750 2.000000 0.000000 2.000000
515
63770.428
max 64.000000 1.000000 53.130000 5.000000 1.000000 3.000000
010
corr = data_copy.corr()fig, ax =
plt.subplots(figsize=(10,8))sns.heatmap(corr,cmap='BuPu',annot=True,fmt=".2f",ax=
ax)plt.title("Dependencies of Medical
Charges")plt.savefig('./sampleImages/Cor')plt.show()
Smoker, BMI and Age are most important factor that determnines - Charges
Also we see that Sex, Children and Region do not affect the Charges. We might drop these 3 columns
as they have less correlation
print(data['sex'].value_counts().sort_values())
print(data['smoker'].value_counts().sort_values())print(data['region'].
value_counts().sort_values())
female 662
male 676
Name: sex, dtype: int64
yes 274
no 1064
Name: smoker, dtype: int64
northeast 324
northwest 325
southwest 325
southeast 364
Name: region, dtype: int64
Now we are confirmed that there are no other values in above pre-preocessed column,
We can proceed with EDA
plt.figure(figsize=(12,9))plt.title('Age vs
Charge')sns.barplot(x='age',y='charges',data=data_copy,palette='husl')plt.savefig('./
sampleImages/AgevsCharges')
plt.figure(figsize=(10,7))plt.title('Region vs
Charge')sns.barplot(x='region',y='charges',data=data_copy,palette='Set3')
<matplotlib.axes._subplots.AxesSubplot at 0x2e69a038b38>
plt.figure(figsize=(7,5))sns.scatterplot(x='bmi',y='charges',hue='sex',data=data_copy,
palette='Reds')plt.title('BMI VS Charge')
Text(0.5, 1.0, 'BMI VS Charge')
plt.figure(figsize=(10,7))plt.title('Smoker vs
Charge')sns.barplot(x='smoker',y='charges',data=data_copy,palette='Blues',hue='sex'
)
<matplotlib.axes._subplots.AxesSubplot at 0x2e69a19b2e8>
plt.figure(figsize=(10,7))plt.title('Sex vs
Charges')sns.barplot(x='sex',y='charges',data=data_copy,palette='Set1')
<matplotlib.axes._subplots.AxesSubplot at 0x2e69a291a20>
Plotting Skew and Kurtosis
print('Printing Skewness and Kurtosis for all columns')print()for col in
list(data_copy.columns):
print('{0} : Skewness {1:.3f} and Kurtosis
{2:.3f}'.format(col,data_copy[col].skew(),data_copy[col].kurt()))
Printing Skewness and Kurtosis for all columns
plt.figure(figsize=(10,7))sns.distplot(data_copy['charges'])plt.title('Plot for
charges')plt.xlabel('charges')plt.ylabel('Count')
Text(0, 0.5, 'Count')
There might be few outliers in Charges but then we cannot say that the value is an outlier
as there might be cases in which Charge for medical was very les actually!
Prepating data - We can scale BMI and Charges Column before proceeding with
Prediction
rom sklearn.preprocessing import StandardScalerdata_pre = data_copy.copy()
tempBmi = data_pre.bmitempBmi = tempBmi.values.reshape(-1,1)data_pre['bmi'] =
StandardScaler().fit_transform(tempBmi)
tempAge = data_pre.agetempAge = tempAge.values.reshape(-1,1)data_pre['age'] =
StandardScaler().fit_transform(tempAge)
tempCharges = data_pre.chargestempCharges = tempCharges.values.reshape(-
1,1)data_pre['charges'] = StandardScaler().fit_transform(tempCharges)
data_pre.head()
se childre smok regio
age bmi charges
x n er n
- -
0.29858
0 1.43876 1 0.45332 0 1 3
4
4 0
- -
0.50962
1 1.50996 0 1 0 2 0.95368
1
5 9
- -
0.38330
2 0.79795 0 3 0 2 0.72867
7
4 5
- -
0.71984
3 0.44194 0 1.30553 0 0 0
3
8 1
4 - 0 - 0 0 0 -
0.51314 0.29255 0.77680
se childre smok regio
age bmi charges
x n er n
9 6 2
X = data_pre.drop('charges',axis=1).valuesy = data_pre['charges'].values.reshape(-
1,1)
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.2, random_state=42)
print('Size of X_train : ', X_train.shape)print('Size of y_train : ',
y_train.shape)print('Size of X_test : ', X_test.shape)print('Size of Y_test : ',
y_test.shape)
Size of X_train : (1070, 6)
Size of y_train : (1070, 1)
Size of X_test : (268, 6)
Size of Y_test : (268, 1)
Importing Libraries
from sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import
RandomForestRegressorfrom sklearn.tree import DecisionTreeRegressorfrom
sklearn.svm import SVRimport xgboost as xgb
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score,
confusion_matrixfrom sklearn.model_selection import cross_val_score,
RandomizedSearchCV, GridSearchCV
Linear Regression
%%timelinear_reg = LinearRegression()linear_reg.fit(X_train, y_train)
Wall time: 32 ms
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
cv_linear_reg = cross_val_score(estimator = linear_reg, X = X, y = y, cv = 10)
y_pred_linear_reg_train = linear_reg.predict(X_train)r2_score_linear_reg_train =
r2_score(y_train, y_pred_linear_reg_train)
y_pred_linear_reg_test = linear_reg.predict(X_test)r2_score_linear_reg_test =
r2_score(y_test, y_pred_linear_reg_test)
rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_reg_test)))
print('CV Linear Regression : {0:.3f}'.format(cv_linear_reg.mean()))print('R2_score
(train) : {0:.3f}'.format(r2_score_linear_reg_train))print('R2_score (test) :
{0:.3f}'.format(r2_score_linear_reg_test))print('RMSE : {0:.3f}'.format(rmse_linear))
CV Linear Regression : 0.745
R2_score (train) : 0.741
R2_score (test) : 0.783
RMSE : 0.480
GridSearchCV(cv=10, error_score='raise-deprecating',
estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
epsilon=0.1, gamma='auto_deprecated', kernel='rbf',
max_iter=-1, shrinking=True, tol=0.001,
verbose=False),
iid='warn', n_jobs=-1,
param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 'scale'],
'kernel': ['rbf', 'sigmoid'], 'tol': [0.0001]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=4)
svr = SVR(C=10, gamma=0.1, tol=0.0001)svr.fit(X_train_scaled,
y_train_scaled.ravel())print(svr_grid.best_estimator_)print(svr_grid.best_score_)
SVR(C=10, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.0001, verbose=False)
0.8311303137187737
cv_svr = svr_grid.best_score_
y_pred_svr_train = svr.predict(X_train_scaled)r2_score_svr_train =
r2_score(y_train_scaled, y_pred_svr_train)
y_pred_svr_test = svr.predict(X_test_scaled)r2_score_svr_test =
r2_score(y_test_scaled, y_pred_svr_test)
rmse_svr = (np.sqrt(mean_squared_error(y_test_scaled, y_pred_svr_test)))
print('CV : {0:.3f}'.format(cv_svr.mean()))print('R2_score (train) :
{0:.3f}'.format(r2_score_svr_train))print('R2 score (test) :
{0:.3f}'.format(r2_score_svr_test))print('RMSE : {0:.3f}'.format(rmse_svr))
CV : 0.831
R2_score (train) : 0.857
R2 score (test) : 0.871
RMSE : 0.359
Ridge Regressor
from sklearn.preprocessing import PolynomialFeatures, StandardScalerfrom
sklearn.pipeline import Pipelinefrom sklearn.linear_model import Ridge
steps = [ ('scalar', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('model', Ridge())]
ridge_pipe = Pipeline(steps)
reg_ridge.best_estimator_, reg_ridge.best_score_
(Pipeline(memory=None,
steps=[('scalar',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('poly',
PolynomialFeatures(degree=2, include_bias=True,
interaction_only=False, order='C')),
('model',
Ridge(alpha=20, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=42, solver='auto',
tol=0.001))],
verbose=False),
0.8259990140429396)
RandomForest Regressor
%%timereg_rf = RandomForestRegressor()parameters = { 'n_estimators':
[600,1000,1200],
'max_features': ["auto"],
'max_depth':[40,50,60],
'min_samples_split': [5,7,9],
'min_samples_leaf': [7,10,12],
'criterion': ['mse']}
reg_rf_gscv = GridSearchCV(estimator=reg_rf, param_grid=parameters, cv=10,
n_jobs=-1)reg_rf_gscv = reg_rf_gscv.fit(X_train_scaled, y_train_scaled.ravel())
Wall time: 9min 47s
reg_rf_gscv.best_score_, reg_rf_gscv.best_estimator_
(0.8483687880955955,
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_split=7,
min_weight_fraction_leaf=0.0, n_estimators=1200,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False))
rf_reg = RandomForestRegressor(max_depth=50, min_samples_leaf=12,
min_samples_split=7,
n_estimators=1200)rf_reg.fit(X_train_scaled, y_train_scaled.ravel())
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_split=7,
min_weight_fraction_leaf=0.0, n_estimators=1200,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
cv_rf = reg_rf_gscv.best_score_
y_pred_rf_train = rf_reg.predict(X_train_scaled)r2_score_rf_train = r2_score(y_train,
y_pred_rf_train)
y_pred_rf_test = rf_reg.predict(X_test_scaled)r2_score_rf_test =
r2_score(y_test_scaled, y_pred_rf_test)
rmse_rf = np.sqrt(mean_squared_error(y_test_scaled, y_pred_rf_test))
print('CV : {0:.3f}'.format(cv_rf.mean()))print('R2 score (train) :
{0:.3f}'.format(r2_score_rf_train))print('R2 score (test) :
{0:.3f}'.format(r2_score_rf_test))print('RMSE : {0:.3f}'.format(rmse_rf))
CV : 0.848
R2 score (train) : 0.884
R2 score (test) : 0.879
RMSE : 0.348
models = [('Linear Regression', rmse_linear, r2_score_linear_reg_train,
r2_score_linear_reg_test, cv_linear_reg.mean()),
('Ridge Regression', rmse_ridge, r2_score_ridge_train, r2_score_ridge_test,
cv_ridge.mean()),
('Support Vector Regression', rmse_svr, r2_score_svr_train, r2_score_svr_test,
cv_svr.mean()),
('Random Forest Regression', rmse_rf, r2_score_rf_train, r2_score_rf_test,
cv_rf.mean())
]
predict = pd.DataFrame(data = models, columns=['Model', 'RMSE',
'R2_Score(training)', 'R2_Score(test)', 'Cross-Validation'])predict
R2_Score(traini R2_Score(te Cross-
Model RMSE
ng) st) Validation
0.4798
0 Linear Regression 0.741410 0.782694 0.744528
08
0.4652
1 Ridge Regression 0.741150 0.783800 0.825999
06
plt.figure(figsize=(12,7))predict.sort_values(by=['Cross-Validation'], ascending=False,
inplace=True)
sns.barplot(x='Cross-Validation', y='Model',data = predict,
palette='Reds')plt.xlabel('Cross Validation Score')plt.ylabel('Model')plt.show()
18999.110
Experiment No 15
Name of the Experiment : Flight Price PredictionUsing Machine learning Algorithms
Instructions:-
1. You will have the dataset.
2. Find the cheapest and most expensive flight at a specific time.
3. You will have to go through EDA.
4. Train ML Model.
5. Find a sweet spot for a cheap ticket.
00:00 00:00
2021-07- 2021-07-
2021-07-01
12092 12092 01 01 alp Tr 35. a- 12092 1537
1 04:45:11.39754 x y 1
463 463 13:00:00+ 15:00:00+ ha ue 0 9 463 7.0
1+00:00
00:00 00:00
2021-07- 2021-07-
2021-06-24 ga
11061 11061 29 29 Tr 20. c- 11061 6900
2 11:28:47.56511 x y mm 1
788 788 14:00:00+ 16:00:00+ ue 0 4 788 .0
5+00:00 a
00:00 00:00
2021-06- 2021-06-
2021-06-05 a-
87998 87998 09 09 alp Tr 15. 87998 9707
3 11:09:48.65592 x y 1 2
08 08 16:00:00+ 18:00:00+ ha ue 0 08 .0
7+00:00 3
00:00 00:00
2021-08- 2021-08-
2021-07-29
16391 16391 23 23 bet Tr 20. b- 16391 6500
4 09:53:51.06530 x y 0
150 150 05:00:00+ 06:55:00+ a ue 0 1 150 .0
6+00:00
00:00 00:00
df_test = testdf_test.head()
Unnam f f f f1
f1 f4 f5 f6 f7 f8
ed: 0 2 3 9 0
2021-09-16
269444 2021-10-03 2021-10-03 ome Tru 20. d-
0 12:20:01.578279+00:0 x y 1
9 04:40:00+00:00 06:40:00+00:00 ga e 0 1
0
2021-09-18
308855 2021-09-23 2021-09-23 ome Tru 20. d-
1 20:13:13.612131+00:0 x y 1
6 17:05:00+00:00 19:05:00+00:00 ga e 0 5
0
2021-09-24
391489 2021-11-10 2021-11-10 alph Tru 20. a-
2 17:53:41.424953+00:0 x y 1
9 13:00:00+00:00 15:00:00+00:00 a e 0 9
0
2021-09-07
113985 2021-09-13 2021-09-13 Tru 40. b-
3 19:39:07.182848+00:0 x y beta 0
9 05:00:00+00:00 06:55:00+00:00 e 0 1
0
2021-09-05
2021-09-22 2021-09-22 alph Tru 20. a-
4 594648 03:48:20.099555+00:0 x y 1
04:00:00+00:00 06:00:00+00:00 a e 0 1
0
# Structure
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0.2 5000 non-null int64
1 Unnamed: 0 5000 non-null int64
2 f1 5000 non-null object
3 f2 5000 non-null object
4 f3 5000 non-null object
5 f4 5000 non-null object
6 f5 5000 non-null object
7 f6 5000 non-null object
8 f7 5000 non-null bool
9 f8 5000 non-null float64
10 f9 5000 non-null int64
11 f10 5000 non-null object
12 Unnamed: 0.1 5000 non-null int64
13 target 5000 non-null float64
dtypes: bool(1), float64(2), int64(4), object(7)
memory usage: 512.8+ KB
# Check null values
df.isnull().sum()
Unnamed: 0.2 0
Unnamed: 0 0
f1 0
f2 0
f3 0
f4 0
f5 0
f6 0
f7 0
f8 0
f9 0
f10 0
Unnamed: 0.1 0
target 0
dtype: int64
# Get summary statistics of the data
df.describe()
Unnamed: Unnamed: Unnamed:
f8 f9 target
0.2 0 0.1
05:00:00+ 07:00:00+
8+00:00 a
00:00 00:00
2021-07- 2021-07-
2021-07-01
12092 12092 01 01 alp Tr 35. a- 12092 1537
1 04:45:11.39754 x y 1
463 463 13:00:00+ 15:00:00+ ha ue 0 9 463 7.0
1+00:00
00:00 00:00
2021-07- 2021-07-
2021-06-24 ga
11061 11061 29 29 Tr 20. c- 11061 6900
2 11:28:47.56511 x y mm 1
788 788 14:00:00+ 16:00:00+ ue 0 4 788 .0
5+00:00 a
00:00 00:00
2021-06- 2021-06-
2021-06-05 a-
87998 87998 09 09 alp Tr 15. 87998 9707
3 11:09:48.65592 x y 1 2
08 08 16:00:00+ 18:00:00+ ha ue 0 08 .0
7+00:00 3
00:00 00:00
2021-08- 2021-08-
2021-07-29
16391 16391 23 23 bet Tr 20. b- 16391 6500
4 09:53:51.06530 x y 0
150 150 05:00:00+ 06:55:00+ a ue 0 1 150 .0
6+00:00
00:00 00:00
df_test.head()
f
f1 f4 f5 f6 f7 f8
9
df_test.head()
time_to_depart time_travel( f
f1 f4 f5 f6 f7 f8
(s) s) 9
df.isnull().sum()
time_to_depart(s) 0
time_travel(s) 0
f1 0
f4 0
f5 0
f6 0
f7 0
f8 0
f9 0
target 0
dtype: int64
# Separating categorical columns
cats_cols = ["f6" , "f7" , "f8" , "f9"]num_cols = ["time_to_depart(s)" , "time_travel(s)"]
# Plotting categorical columns
c = 1plt.figure(figsize=(15,20))
for i in cats_cols:
plt.subplot(4, 2, c)
sns.countplot(x=df[i])
c=c+1
plt.show()
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(df[i])
C:\Users\muham\AppData\Local\Temp\ipykernel_5280\2961181608.py:8: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(df[i])
# Check Skew
df.skew()
df.skew()
time_to_depart(s) 3.703644
time_travel(s) 1.319374
f7 -11.660949
f8 0.418300
f9 0.027547
target 2.056345
dtype: float64
# Check Kurtosis
df.kurtosis()
df.kurtosis()
time_to_depart(s) 20.796049
time_travel(s) 5.504109
f7 134.031335
f8 0.046374
f9 -0.310364
target 6.344666
dtype: float64
2) Encoding of variables
df.drop(["f1" , "f4" , "f5"] , axis=1 , inplace=True)df_test.drop(["f1" , "f4" , "f5"] ,
axis=1 , inplace=True)
df.head()
time_to_depart time_travel( f
f6 f7 f8 target
(s) s) 9
gam Tru
0 1268192.0 7200.0 0.0 0 7400.0
ma e
Tru 20.
4 2142368.0 6900.0 beta 0 6500.0
e 0
# Encoding
le = LabelEncoder()
df["f6"] = le.fit_transform(df["f6"])df["f7"] = le.fit_transform(df["f7"])df["f8"] =
le.fit_transform(df["f8"])
df_test["f6"] = le.fit_transform(df_test["f6"])df_test["f7"] =
le.fit_transform(df_test["f7"])df_test["f8"] = le.fit_transform(df_test["f8"])
df.sample(10)
time_to_depart time_travel( f f f f
target
(s) s) 6 7 8 9
31 11900.
1317988.0 7140.0 2 1 2 2
63 0
66 14000.
1705075.0 7200.0 2 1 2 2
8 0
19 10545.
408962.0 7200.0 0 1 2 1
76 0
45 10545.
402468.0 7200.0 0 1 2 1
97 0
33 13277.
101098.0 7200.0 0 1 4 1
42 0
time_to_depart time_travel( f f f f
target
(s) s) 6 7 8 9
25
332681.0 6900.0 1 1 2 0 7458.0
49
37
320911.0 7200.0 2 1 2 1 8750.0
7
36 10655.
2181824.0 7200.0 2 1 3 2
43 0
13 13560.
7609835.0 7200.0 2 1 2 2
70 0
48
676423.0 7200.0 0 1 1 1 6810.0
32
df_test.sample(10)
time_to_depart time_travel( f f f f
(s) s) 6 7 8 9
19
1160731.0 7200.0 3 0 1 1
8
21
4916436.0 6900.0 1 0 2 0
5
79 80485.0 7200.0 2 0 1 2
10
1327903.0 7200.0 0 0 0 1
9
17
1092853.0 7200.0 3 0 1 1
7
20
1243921.0 7140.0 2 0 1 2
4
24 281144.0 7200.0 2 0 1 1
49 1536631.0 7200.0 0 0 1 1
24
346286.0 7200.0 0 0 0 1
5
16
589496.0 6900.0 1 0 1 0
9
df.describe()
time_to_depart time_travel(
f6 f7 f8 f9 target
(s) s)
mea 10104.351
1.349212e+06 7159.836000 0.953400 0.992800 2.292000 0.944600
n 800
3359.9361
std 1.679384e+06 169.613345 0.948371 0.084555 1.247817 0.607951
18
4990.0000
min 2.003000e+03 6900.000000 0.000000 0.000000 0.000000 0.000000
00
time_to_depart time_travel(
f6 f7 f8 f9 target
(s) s)
7796.0000
25% 3.606870e+05 7140.000000 0.000000 1.000000 2.000000 1.000000
00
9403.0000
50% 8.634945e+05 7200.000000 1.000000 1.000000 2.000000 1.000000
00
11245.000
75% 1.698816e+06 7200.000000 2.000000 1.000000 3.000000 1.000000
000
33720.000
max 1.916464e+07 7800.000000 3.000000 1.000000 6.000000 2.000000
000
0 1268192.0 7200.0 2 1 0 0
1 29688.0 7200.0 0 1 4 1
2 3033072.0 7200.0 2 1 2 1
3 363011.0 7200.0 0 1 1 1
4 2142368.0 6900.0 1 1 2 0
y.head()
0 7400.0
1 15377.0
2 6900.0
3 9707.0
4 6500.0
Name: target, dtype: float64
3) ML Modelling
# Regression Pipeline
from sklearn.linear_model import LinearRegressionfrom sklearn.tree import
DecisionTreeRegressorfrom sklearn.svm import SVRfrom sklearn.neighbors import
KNeighborsRegressorfrom sklearn.model_selection import train_test_splitfrom
sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
# Shorten the names
lr = LinearRegression()dt = DecisionTreeRegressor()svr = SVR()knn =
KNeighborsRegressor()
# Model Loop
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
for i in [lr, dt, svr, knn]:
i.fit(X_train, y_train)
predict = i.predict(X_test)
test_score = r2_score(y_test, predict)
train_score = r2_score(y_train, i.predict(X_train))