Flight Fare Prediction: Project Report
Flight Fare Prediction: Project Report
Flight Fare Prediction: Project Report
PROJECT REPORT
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
submitted by
Mr. PRATYUSH KUMAR
[Registration No. RA1811003030370]
April 2022
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Assistant Professor
Dept. of Computer Science & Engineering
2
3
ABSTRACT
Flight fare in India was hiked in 2019 comprehensive statistics now compiled annually by The
Ministry of Civil Aviation.
By examining the peaks, troughs, and turning points, this study examines the similarities and
differences in long-term trends between different flight companies. The data for our study are
drawn from Kaggle. The results suggest that rates of international flight and domestic flight rates
has been rising since 2019. The airline suffered a great loss due to the pandemic (around $314
billion).
In this project we are going to analyze the data or datasets on the basis of different categories like
different Durations, Total stops, Airline, Source and various other factors etc. And at last, we are
going to predict the flight fares of different destination and different airlines.
4
5
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my guide, Mr. Rahul Pandey his valuable
guidance, consistent encouragement, personal caring, timely help and providing me with
an excellent atmosphere for doing research. All through the work, in spite of his busy
schedule, he has extended cheerful and cordial support to me for completing this research
work.
Pratyush Kumar
6
TABLE OF CONTENTS
7
LIST OF FIGURES
8
ABBREVITIONS
ML - Machine Learning
9
CHAPTER 1
INTRODUCTION
This project can be considered as the reason for the varying flight prices that depend on the
various factors including the destination, date, journey timing, length of flight journey and
various other factors. As we go deeper in the project we will see how this project covers the
various aspects to find the accuracy of both our results using the test data set and predicting the
flight fares in accordance to the given data. Flight ticket prices can be something hard to guess,
today we might see a price, check out the price of the same flight tomorrow, it will be a different
story. We might have often heard travelers saying that flight ticket prices are so unpredictable. As
data scientists, we are going to prove that given the right data anything can be predicted. Here you
will be provided with prices of flight tickets for various airlines between the months of March and
1. To construct and implement this aggregator function, we have used flight fare data of
leading air carrier companies from the year 2019.
2. To find the data accuracy of the given result in comparison to the training data and also
find the different parametrized score level such as SME and RSME score.
3. To Collect and process the various flight fares to predict charges in future.
10
CHAPTER 2
LITERATURE SURVEY
Dadoun, A., Defoin-Platel, M., R. How recommender systems Journal of Revenue and Pricing
Fiig, T., Landra, C., & Troncy, can transform airline offer Management
R. construction
Dadoun, A., Defoin-Platel, M., Flight Fare Journal of Revenue and Pricing
Fiig, T., Landra, C. and Troncy, Recommendation System Management
Fig 1
11
CHAPTER 3
OVERALL DESIGN FOR PROPOSED SYTEM
SK Hardware @Interfaces:
1. Processor: The 5th gen of Intel-Core i5 nd with minimum speed of 2.9 GHz.
2. RAM SK: The minimum requirement is 4GB.
3. Hard Disk: The minimum requirement is 250GB
4. Software Interfaces:
i. MS Word (2016)
ii. Data-Base storage: MS Excel
iii. Operating System: Win-10
3.2 PLANNING
The various steps required when making this project are below:
12. The code should be according to the advanced PYTHON to enhance accuracy.
13
CHAPTER 4
DATASET AND DATABASE SPECIFICATION
We have used past two years dataset of Flight Fares from Kaggle for the flight fare
prediction namely:- Date, Route, Day, Airline (Predictors), Fare,etc. The Dataset open
and closing values are selected between the years of 2019. Significant Important
attributes present in the dataset are and Close where for the current model we have
considered Open attribute as a predicant while other attributes are taken as Predictors for
the Model through the Price Fare Dataset. The flight_fare.csv dataset had null values or
missing values which were replaced by mean values for their respective columns.
14
CHAPTER 5
METHODOLOGY
DATA ANALYSIS
Data Analysis is a process of collecting, transforming, cleaning, and modelling data with
the goal of discovering the required information. The results so obtained are
communicated, suggesting conclusions, and supporting decision-making.
Data analytics is performed on the flight database so that the data can be cleaned and data
that is not required can be deleted. It is used to model the complex flight data to a simpler
form so that it can be used as input for prediction process.
DATA CATEGORIZATION
Data is categorized and extracted to identify and analyze similar behavioral data and
patterns, and techniques vary according to organizational requirements.
Data analysis is linked to data visualization. It is used to make relationship between
different columns of database so that it can be used for prediction and visualization.
DATA VISUALIZATION
15
Linear Regression?
A supervised Machine Learning framework that illustrates the best fit linear model between the
relationship between variables is called Linear Regression. It also determines the upright and
level relationship between two variables.
Figure:1.1 Simple-Linear-Regression
16
Null Values and Outliers in ML?
● NULL VALUES: In Datasets which have been used in the models in Machine
learning there certain values on those datasets that are missing or black space
on the particular rows of a particular attribute those values are represented as
Null Values. These values can be overcome by either removing the entire row of
that particular missing value or filling the value with respective mean mode or
medium value.
● OUTLIERS: There are certain values in the datasets whose values are of same
data types of other data but those values are either quite large or quite low due
to which it looks different from other values while plotting overall datasets.
Outliers are hugely neglected since it makes the datasets linearly inseparable. To
overcome those outliers we usually neglect them or replace them with the
overall average of that attributes values of the datasets.
The integrated development environment (IDE) is platform in which applications are
build, that have tools which are used by the developers into a single graphical user
interface (GUI).
Types of IDES:
● GOOGLE COLAB
● PYCHARM
17
Machine Learning Libraries and their types?
● Scikit Learn[10]
● Tensorflow
● Pandas
● Numpy[11]
● Matplotlib[12]
18
19
CHAPTER 6
Fig 2
Importing Libraries
Fig 3
20
Columns in Data Train
Fig 4
21
Figure 5
Data Pre-Processing
22
23
Showing relation between Dependent and Independent Variables
Fig 8
Result with RMSE ERROR and Accuracy
24
Flight Fare
Prediction
CHAPTER 7
COMPARISON WITH THE EXISTING SYSTEM
The Existing system which are using the The Current System using Regression Model is
SVM(Support vector Machine)[10] Algorithm way more efficient with sequence prediction
turns out to be less efficient when it comes problems as they are able to store past
Sequence Prediction problem information.
The Existing system using SVM fails to give The Current System outperforms on a large
better accuracy when the dataset contains a dataset irrespective of the number of outliers on
large number of outliers. any dataset.
The Models using SVM do not go well with Since the LSTM model with moving averages
moving averages as their values does depend on was applied through which the past datas were
the accuracy of the model since SVM does not stored over the overall dataset and eventually
remember past data so any changes in averages evaluated to be the most effective model in
does lower the accuracy of the model. estimating the stock prices for the future.
25
Algorithm for the Existing System- Algorithm for the Current System-
-Removal of Null Values and Outliers. -Removal of Null Values and Outliers.
-Splitting the Dataset into 70% of training data -Splitting the Dataset into 70% of training data
Algorithm
Datasets
The test data is similar to the training data set, minus the ‘Price’ column (To be predicted using
the model)
Python Coding
27
Step 2: Import Train and Test data sets and append them
Appending of the data set is done to work together with both train and test at a same time and
don,t have to make changes separately . After we apply the transformation then we can separate
In this step we mainly work on the data set and do some transformation like creating different bins
of particular columns clean the messy data so that it can be used in our ML model . This step is
very important because for a high prediction score you need to continuously make changes in it
Feature Selection
Finding out the best feature which will contribute and have good relation with target variable.
Following are some of the feature selection methods,
1. heatmap
2. feature_importance_
3. SelectKBest
Date_of_Journey:
In the column ‘Date_of_Journey’, we can see the date format is given as dd/mm/yyyy and as you
can see the datatype is given as object So there is two ways to tackle this column, either convert
28
the column into Timestamp or divide the column into date,Month ,Year. Here , i am splitting the
columns
In the column ‘Arrival_Time’,if we see we have combination of both time and month but we need only
the time details out of it so we split the time into ‘Hours’ and ‘Minute’.
Total_Stops:
This column is combination of number and a categorical variable like ‘1 stop’ . So we need only
the number details from this column so we split that and take the number details only also we
change the ‘non stop’ into ‘0 stop’ and convert the column into integer type
Dep_Time:
As same as ‘Arrival_time’ .we split this column also in hour and minute and convert it into
integer
29
Dep_Time split into 2 variables (Hour, Minute)
Route:
The ‘Route’ columns mainly tell us that how many cities they have taken to reach from source to
destination .This column is very important because based on the route they took will directly
effect the price of the flight So We split the Route column to extract the information .Regarding
the ‘Nan’ values we replace those ‘Nan’ values with ‘None’ .
30
Before splitting
After splitting
31
To convert categorical text data into model-understandable numerical data, we use the Label
Encoder class. So all we have to do, to label encode a column is import the LabelEncoder class
from the sklearn library, fit and transform the column of the data, and then replace the existing
1. Nominal data --> data are not in any order --> OneHotEncoder is used in this case
2. Ordinal data --> data are in order --> LabelEncoder is used in this case
Now that all our data is numerical after label encoding so we split the data into test and train and
drop the price column from the test set because we have to predict the price with our test data set
32
X — independent variables; y — dependent variable
The goal in this step is to develop a benchmark model that serves us as a baseline, upon which we
will measure the performance of a better and more tuned algorithm. We are using different
Regression Technique and comparing them to see which algorithm is giving better performance
then other and At the end we will combine all of them using Stacking and see how our model is
predicting
33
REFERENCES
34
35
CHAPTER 5
CONCLUSION
Key Findings
The objective of the analysis is to provide an overview on the pattern and conditions based on
which flight fares are charged, efficient and clear knowledge was achieved, thus proving that the
data is reliable enough to be used for predicting fares. Using Machine Learning, the accuracy of
the proposed model is measured and verified. The accuracy on training data set was achieve to
be 95% and on test set the accuracy percentage is 79.7%.
Significance
The flight rates in India is increasing day by day due to many factors such as cost of aviation
fuel, frequent travel, etc. This analysis was done by using datasets from 2019. Several modules
were kept in mind while performing the data analysis. The data which was received was not
efficient, and thus data cleaning, preprocessing was done extensively to make the data efficient
for use.
Limitations
There are various other model that have different algorithms supporting the current
system in order to provide various offers to the customer.
These include the relation between other dependent and dependent variable including the
holiday dates and festive season
Future Scope
As future work, newer datasets could be analyzed, so that new fares can be framed. This will
lead to improved fares of flight and will help the general public. Technology will help increase
36
accuracy and efficiency of predicting fares. Analytics can also help discover and identify trends
to improve operational effectiveness.
37
CHAPTER 7
REFERENCES
● Kaggle Datasets :
● https://www.kaggle.com/nikhilmittal/flight-fare-prediction-mh
● https://github.com/Mandal-21/Flight-Price-Prediction/find/master
38