Flight Fare Prediction: Project Report

FLIGHT FARE PREDICTION
PROJECT REPORT
submitted to the partial fulfillment of the requirement

for the Major Project and
for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
submitted by
Mr. PRATYUSH KUMAR
[Registration No. RA1811003030370]
Under the guidance of

Mr. Rahul Pandey
(Assistant Professor, Department of Computer Science and Engineering)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
(Deemed to be University under Section 3 of the UGC Act, 1956)
April 2022
SRM INSTITUTE OF SCIENCE & TECHNOLOGY

1
(Under Section 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
Certified that this project report titled “FLIGHT FARE PREDICTION

SYSTEM” is the bonafide work of “PRATYUSH KUMAR[Reg No:
RA1811003030370], who carried out the project work under my supervision.
Certified further, that to the best of my knowledge the work reported herein
does not form any other project report or dissertation on the basis of which a
degree or award was conferred on an earlier occasion on this or any other
candidate.
SIGNATURE SIGNATURE
HEAD OF THE DEPARTMENT

Dept. of Computer Science & Engineering
Signature of the External Examiner
Assistant Professor
Dept. of Computer Science & Engineering
2
3
ABSTRACT
Flight fare in India was hiked in 2019 comprehensive statistics now compiled annually by The
Ministry of Civil Aviation.
By examining the peaks, troughs, and turning points, this study examines the similarities and
differences in long-term trends between different flight companies. The data for our study are
drawn from Kaggle. The results suggest that rates of international flight and domestic flight rates
has been rising since 2019. The airline suffered a great loss due to the pandemic (around $314
billion).
In this project we are going to analyze the data or datasets on the basis of different categories like
different Durations, Total stops, Airline, Source and various other factors etc. And at last, we are
going to predict the flight fares of different destination and different airlines.
4
5
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my guide, Mr. Rahul Pandey his valuable
guidance, consistent encouragement, personal caring, timely help and providing me with
an excellent atmosphere for doing research. All through the work, in spite of his busy
schedule, he has extended cheerful and cordial support to me for completing this research
work.
Pratyush Kumar
6
TABLE OF CONTENTS
7
LIST OF FIGURES
8
ABBREVITIONS
RMS - Root Mean Square
ML - Machine Learning
SVM - Support Vector Machine
9
CHAPTER 1
INTRODUCTION
This project can be considered as the reason for the varying flight prices that depend on the
various factors including the destination, date, journey timing, length of flight journey and
various other factors. As we go deeper in the project we will see how this project covers the
various aspects to find the accuracy of both our results using the test data set and predicting the
flight fares in accordance to the given data. Flight ticket prices can be something hard to guess,
today we might see a price, check out the price of the same flight tomorrow, it will be a different
story. We might have often heard travelers saying that flight ticket prices are so unpredictable. As
data scientists, we are going to prove that given the right data anything can be predicted. Here you
will be provided with prices of flight tickets for various airlines between the months of March and
June of 2019 and between various cities.
1.1 SIGNIFICANT TARGETS INCLUDE:
1. To construct and implement this aggregator function, we have used flight fare data of
leading air carrier companies from the year 2019.
2. To find the data accuracy of the given result in comparison to the training data and also
find the different parametrized score level such as SME and RSME score.
3. To Collect and process the various flight fares to predict charges in future.
4. To analyze the increasing and decreasing flight fares.
10
CHAPTER 2
LITERATURE SURVEY
Author Name Title Source

Dadoun, Amine How recommender systems Journal of Revenue and Pricing
can transform airline offer Management
construction and retailing."
Dadoun, A., Defoin-Platel, M., R. How recommender systems Journal of Revenue and Pricing
Fiig, T., Landra, C., & Troncy, can transform airline offer Management
R. construction
Dadoun, A., Defoin-Platel, M., Flight Fare Journal of Revenue and Pricing
Fiig, T., Landra, C. and Troncy, Recommendation System Management
Fig 1
11
CHAPTER 3
OVERALL DESIGN FOR PROPOSED SYTEM
3.1 SOFTWARES REUIREMENT SK:

For the development of this project, the version of python used is 3.6.5
SK Hardware @Interfaces:
1. Processor: The 5th gen of Intel-Core i5 nd with minimum speed of 2.9 GHz.
2. RAM SK: The minimum requirement is 4GB.
3. Hard Disk: The minimum requirement is 250GB
4. Software Interfaces:
i. MS Word (2016)
ii. Data-Base storage: MS Excel
iii. Operating System: Win-10
3.2 PLANNING
The various steps required when making this project are below:
1. Research of problem statement.
2. Collection of the set requirements
3. Project feasibility analysis AMAR MONDAL (2012)
4. Development of a standard structure.
5. Journals about past activities related to this area.
6. To choose the best method for improving the algorithm.
7. We had various good and bad analyzes.
8. Initiate project development
9. Installation of software and PIP packages.

12
10. Improving algorithm.
11. Manual algorithm analysis.
12. The code should be according to the advanced PYTHON to enhance accuracy.
13
CHAPTER 4
DATASET AND DATABASE SPECIFICATION
We have used past two years dataset of Flight Fares from Kaggle for the flight fare
prediction namely:- Date, Route, Day, Airline (Predictors), Fare,etc. The Dataset open
and closing values are selected between the years of 2019. Significant Important
attributes present in the dataset are and Close where for the current model we have
considered Open attribute as a predicant while other attributes are taken as Predictors for
the Model through the Price Fare Dataset. The flight_fare.csv dataset had null values or
missing values which were replaced by mean values for their respective columns.
14
CHAPTER 5
METHODOLOGY
DATA ANALYSIS
 Data Analysis is a process of collecting, transforming, cleaning, and modelling data with
the goal of discovering the required information. The results so obtained are
communicated, suggesting conclusions, and supporting decision-making.
 Data analytics is performed on the flight database so that the data can be cleaned and data
that is not required can be deleted. It is used to model the complex flight data to a simpler
form so that it can be used as input for prediction process.
DATA CATEGORIZATION
 Data is categorized and extracted to identify and analyze similar behavioral data and
patterns, and techniques vary according to organizational requirements.
 Data analysis is linked to data visualization. It is used to make relationship between
different columns of database so that it can be used for prediction and visualization.
DATA VISUALIZATION
 Data visualization is the way of presenting data in a pictorial or graphical format. It

enables decision makers to see analytics presented visually, to understand difficult
concepts or to identify new patterns.
 Visualization helps to make charts and graphs for more detail, thereby changing what
data you see and how it’s processed. This helps us to understand which features of data
have strong relations between them.
15
Linear Regression?
A supervised Machine Learning framework that illustrates the best fit linear model between the
relationship between variables is called Linear Regression. It also determines the upright and
level relationship between two variables.
There are two types of Linear-Regression:-
● Simple-Linear-Regression: Simple-Linear-Regression is a Linear-Regression

approach which uses a single independent parameter to estimate the price of a
quantitative dependent variable.
Figure:1.1 Simple-Linear-Regression
16
Null Values and Outliers in ML?
● NULL VALUES: In Datasets which have been used in the models in Machine
learning there certain values on those datasets that are missing or black space
on the particular rows of a particular attribute those values are represented as
Null Values. These values can be overcome by either removing the entire row of
that particular missing value or filling the value with respective mean mode or
medium value.
● OUTLIERS: There are certain values in the datasets whose values are of same
data types of other data but those values are either quite large or quite low due
to which it looks different from other values while plotting overall datasets.
Outliers are hugely neglected since it makes the datasets linearly inseparable. To
overcome those outliers we usually neglect them or replace them with the
overall average of that attributes values of the datasets.
IDE and its types?
The integrated development environment (IDE) is platform in which applications are
build, that have tools which are used by the developers into a single graphical user
interface (GUI).
Types of IDES:
● GOOGLE COLAB
● PYCHARM
17
Machine Learning Libraries and their types?
A set of routines, commands and functions which are programmed in a programming

language is called a framework or a library and in this case such kinds of libraries are
called Machine-Learning-Libraries.
Types Of Machine Learning Libraries:-
● Scikit Learn[10]
● Tensorflow
● Pandas
● Numpy[11]
● Matplotlib[12]
18
19
CHAPTER 6
CODING AND TESTING
Fig 2
Importing Libraries
Fig 3
20
Columns in Data Train
Fig 4
Checking for Non-Null Values
21
Figure 5
Data Pre-Processing
22
23
Showing relation between Dependent and Independent Variables
Fig 8
Result with RMSE ERROR and Accuracy
24
Flight Fare
Prediction
CHAPTER 7
COMPARISON WITH THE EXISTING SYSTEM
EXISTING SYSTEM CURRENT PROPOSED SYSTEM
The Existing system which are using the The Current System using Regression Model is
SVM(Support vector Machine)[10] Algorithm way more efficient with sequence prediction
turns out to be less efficient when it comes problems as they are able to store past
Sequence Prediction problem information.
The Existing system using SVM fails to give The Current System outperforms on a large
better accuracy when the dataset contains a dataset irrespective of the number of outliers on
large number of outliers. any dataset.
The Models using SVM do not go well with Since the LSTM model with moving averages
moving averages as their values does depend on was applied through which the past datas were
the accuracy of the model since SVM does not stored over the overall dataset and eventually
remember past data so any changes in averages evaluated to be the most effective model in
does lower the accuracy of the model. estimating the stock prices for the future.
25
Algorithm for the Existing System- Algorithm for the Current System-
-Importing the dataset -Importing the dataset
-Removal of Null Values and Outliers. -Removal of Null Values and Outliers.
-Feature Selection. -Feature Selection.
-Splitting the Dataset into 70% of training data -Splitting the Dataset into 70% of training data
and 30% of testing data. and 30% of testing data.
-Preprocessing the dataset. -Preprocessing the dataset.
-Using SVM as SVC classifier -Using the extra tree regressor
Testing the Dataset. -Testing the Dataset.
-Checking the Accuracy Score. -Checking the Accuracy Score.
Algorithm
Datasets
We will be using two datasets — Train data and Test data

26
Training data is combination of both categorical and numerical also we can see some special
character also being used because of which we have to do data Transformation on it before
applying it to our model
The test data is similar to the training data set, minus the ‘Price’ column (To be predicted using
the model)
Python Coding
Step 1: Import the relevant libraries in Python.
27
Step 2: Import Train and Test data sets and append them
Appending of the data set is done to work together with both train and test at a same time and
don,t have to make changes separately . After we apply the transformation then we can separate
them again into test and train
Step 3: Feature Generation
In this step we mainly work on the data set and do some transformation like creating different bins
of particular columns clean the messy data so that it can be used in our ML model . This step is
very important because for a high prediction score you need to continuously make changes in it
Feature Selection
Finding out the best feature which will contribute and have good relation with target variable.
Following are some of the feature selection methods,
1. heatmap
2. feature_importance_
3. SelectKBest
Date_of_Journey:
In the column ‘Date_of_Journey’, we can see the date format is given as dd/mm/yyyy and as you
can see the datatype is given as object So there is two ways to tackle this column, either convert
28
the column into Timestamp or divide the column into date,Month ,Year. Here , i am splitting the
columns
Date_of_Journey split into 3 variables (Date, Month, Year )

Arrival_Time:
In the column ‘Arrival_Time’,if we see we have combination of both time and month but we need only
the time details out of it so we split the time into ‘Hours’ and ‘Minute’.
Arrival_Time split into 2 variables (Hour, Minute)
Total_Stops:
This column is combination of number and a categorical variable like ‘1 stop’ . So we need only
the number details from this column so we split that and take the number details only also we
change the ‘non stop’ into ‘0 stop’ and convert the column into integer type
Dep_Time:
As same as ‘Arrival_time’ .we split this column also in hour and minute and convert it into
integer
29
Dep_Time split into 2 variables (Hour, Minute)
Route:
The ‘Route’ columns mainly tell us that how many cities they have taken to reach from source to
destination .This column is very important because based on the route they took will directly
effect the price of the flight So We split the Route column to extract the information .Regarding
the ‘Nan’ values we replace those ‘Nan’ values with ‘None’ .
Route split into 5 variables
Replacing the Nan values with ‘None’
30
Before splitting
After splitting
Step 4: Prepare categorical variables for model using label encoder
31
To convert categorical text data into model-understandable numerical data, we use the Label
Encoder class. So all we have to do, to label encode a column is import the LabelEncoder class
from the sklearn library, fit and transform the column of the data, and then replace the existing
text data with the new encoded data.
Label encoding of Categorical variables
Handling Categorical Data

One can find many ways to handle categorical data. Some of them categorical data are,
1. Nominal data --> data are not in any order --> OneHotEncoder is used in this case
2. Ordinal data --> data are in order --> LabelEncoder is used in this case
Step 5: Divide the data set into test and train
Now that all our data is numerical after label encoding so we split the data into test and train and
drop the price column from the test set because we have to predict the price with our test data set
32
X — independent variables; y — dependent variable
Step 6: Build Model
The goal in this step is to develop a benchmark model that serves us as a baseline, upon which we
will measure the performance of a better and more tuned algorithm. We are using different
Regression Technique and comparing them to see which algorithm is giving better performance
then other and At the end we will combine all of them using Stacking and see how our model is
predicting
33
REFERENCES
34
35
CHAPTER 5
CONCLUSION
Key Findings
The objective of the analysis is to provide an overview on the pattern and conditions based on
which flight fares are charged, efficient and clear knowledge was achieved, thus proving that the
data is reliable enough to be used for predicting fares. Using Machine Learning, the accuracy of
the proposed model is measured and verified. The accuracy on training data set was achieve to
be 95% and on test set the accuracy percentage is 79.7%.
Significance
The flight rates in India is increasing day by day due to many factors such as cost of aviation
fuel, frequent travel, etc. This analysis was done by using datasets from 2019. Several modules
were kept in mind while performing the data analysis. The data which was received was not
efficient, and thus data cleaning, preprocessing was done extensively to make the data efficient
for use.
Limitations
 There are various other model that have different algorithms supporting the current
system in order to provide various offers to the customer.
 These include the relation between other dependent and dependent variable including the
holiday dates and festive season
Future Scope
As future work, newer datasets could be analyzed, so that new fares can be framed. This will
lead to improved fares of flight and will help the general public. Technology will help increase
36
accuracy and efficiency of predicting fares. Analytics can also help discover and identify trends
to improve operational effectiveness.
37
CHAPTER 7
REFERENCES
● Kaggle Datasets :
● https://www.kaggle.com/nikhilmittal/flight-fare-prediction-mh
● https://github.com/Mandal-21/Flight-Price-Prediction/find/master
38

Flight Fare Prediction: Project Report

Uploaded by

Copyright:

Available Formats

Flight Fare Prediction: Project Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flight Fare Prediction: Project Report

Uploaded by

Copyright:

Available Formats

FLIGHT FARE PREDICTION

submitted to the partial fulfillment of the requirement

Under the guidance of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SRM INSTITUTE OF SCIENCE & TECHNOLOGY

Certified that this project report titled “FLIGHT FARE PREDICTION

HEAD OF THE DEPARTMENT

Signature of the External Examiner

RMS - Root Mean Square

SVM - Support Vector Machine

June of 2019 and between various cities.

1.1 SIGNIFICANT TARGETS INCLUDE:

4. To analyze the increasing and decreasing flight fares.

Author Name Title Source

3.1 SOFTWARES REUIREMENT SK:

1. Research of problem statement.

2. Collection of the set requirements

3. Project feasibility analysis AMAR MONDAL (2012)

4. Development of a standard structure.

5. Journals about past activities related to this area.

6. To choose the best method for improving the algorithm.

7. We had various good and bad analyzes.

8. Initiate project development

9. Installation of software and PIP packages.

11. Manual algorithm analysis.

 Data visualization is the way of presenting data in a pictorial or graphical format. It

There are two types of Linear-Regression:-

● Simple-Linear-Regression: Simple-Linear-Regression is a Linear-Regression

IDE and its types?

A set of routines, commands and functions which are programmed in a programming

Types Of Machine Learning Libraries:-

CODING AND TESTING

Checking for Non-Null Values

EXISTING SYSTEM CURRENT PROPOSED SYSTEM

-Importing the dataset -Importing the dataset

-Feature Selection. -Feature Selection.

and 30% of testing data. and 30% of testing data.

-Preprocessing the dataset. -Preprocessing the dataset.

-Using SVM as SVC classifier -Using the extra tree regressor

Testing the Dataset. -Testing the Dataset.

-Checking the Accuracy Score. -Checking the Accuracy Score.

We will be using two datasets — Train data and Test data

Step 1: Import the relevant libraries in Python.

them again into test and train

Step 3: Feature Generation

Date_of_Journey split into 3 variables (Date, Month, Year )

Arrival_Time split into 2 variables (Hour, Minute)

Route split into 5 variables

Replacing the Nan values with ‘None’

Step 4: Prepare categorical variables for model using label encoder

text data with the new encoded data.

Label encoding of Categorical variables

Handling Categorical Data

Step 5: Divide the data set into test and train

Step 6: Build Model

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.