Flight Fare Prediction: Project Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38

FLIGHT FARE PREDICTION

PROJECT REPORT

submitted to the partial fulfillment of the requirement


for the Major Project and
for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

submitted by
Mr. PRATYUSH KUMAR
[Registration No. RA1811003030370]

Under the guidance of


Mr. Rahul Pandey
(Assistant Professor, Department of Computer Science and Engineering)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
(Deemed to be University under Section 3 of the UGC Act, 1956)

April 2022

SRM INSTITUTE OF SCIENCE & TECHNOLOGY


1
(Under Section 3 of UGC Act, 1956)

BONAFIDE CERTIFICATE

Certified that this project report titled “FLIGHT FARE PREDICTION


SYSTEM” is the bonafide work of “PRATYUSH KUMAR[Reg No:
RA1811003030370], who carried out the project work under my supervision.
Certified further, that to the best of my knowledge the work reported herein
does not form any other project report or dissertation on the basis of which a
degree or award was conferred on an earlier occasion on this or any other
candidate.

SIGNATURE SIGNATURE

HEAD OF THE DEPARTMENT


Dept. of Computer Science & Engineering

Signature of the External Examiner

Assistant Professor
Dept. of Computer Science & Engineering

2
3
ABSTRACT
Flight fare in India was hiked in 2019 comprehensive statistics now compiled annually by The
Ministry of Civil Aviation.

By examining the peaks, troughs, and turning points, this study examines the similarities and
differences in long-term trends between different flight companies. The data for our study are
drawn from Kaggle. The results suggest that rates of international flight and domestic flight rates
has been rising since 2019. The airline suffered a great loss due to the pandemic (around $314
billion).

In this project we are going to analyze the data or datasets on the basis of different categories like
different Durations, Total stops, Airline, Source and various other factors etc. And at last, we are
going to predict the flight fares of different destination and different airlines.

4
5
ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my guide, Mr. Rahul Pandey his valuable
guidance, consistent encouragement, personal caring, timely help and providing me with
an excellent atmosphere for doing research. All through the work, in spite of his busy
schedule, he has extended cheerful and cordial support to me for completing this research
work.

Pratyush Kumar

6
TABLE OF CONTENTS

7
LIST OF FIGURES

8
ABBREVITIONS

RMS - Root Mean Square

ML - Machine Learning

SVM - Support Vector Machine

9
CHAPTER 1
INTRODUCTION

This project can be considered as the reason for the varying flight prices that depend on the

various factors including the destination, date, journey timing, length of flight journey and

various other factors. As we go deeper in the project we will see how this project covers the

various aspects to find the accuracy of both our results using the test data set and predicting the

flight fares in accordance to the given data. Flight ticket prices can be something hard to guess,

today we might see a price, check out the price of the same flight tomorrow, it will be a different

story. We might have often heard travelers saying that flight ticket prices are so unpredictable. As

data scientists, we are going to prove that given the right data anything can be predicted. Here you

will be provided with prices of flight tickets for various airlines between the months of March and

June of 2019 and between various cities.

1.1 SIGNIFICANT TARGETS INCLUDE:

1. To construct and implement this aggregator function, we have used flight fare data of
leading air carrier companies from the year 2019.
2. To find the data accuracy of the given result in comparison to the training data and also
find the different parametrized score level such as SME and RSME score.
3. To Collect and process the various flight fares to predict charges in future.

4. To analyze the increasing and decreasing flight fares.

10
CHAPTER 2
LITERATURE SURVEY

Author Name Title Source


Dadoun, Amine How recommender systems Journal of Revenue and Pricing
can transform airline offer Management
construction and retailing."

Dadoun, A., Defoin-Platel, M.,  R. How recommender systems Journal of Revenue and Pricing
Fiig, T., Landra, C., & Troncy, can transform airline offer Management
R. construction

Dadoun, A., Defoin-Platel, M., Flight Fare Journal of Revenue and Pricing
Fiig, T., Landra, C. and Troncy, Recommendation System Management

Fig 1

11
CHAPTER 3
OVERALL DESIGN FOR PROPOSED SYTEM

3.1 SOFTWARES REUIREMENT SK:


For the development of this project, the version of python used is 3.6.5

SK Hardware @Interfaces:

1. Processor: The 5th gen of Intel-Core i5 nd with minimum speed of 2.9 GHz.
2. RAM SK: The minimum requirement is 4GB.
3. Hard Disk: The minimum requirement is 250GB
4. Software Interfaces:
i. MS Word (2016)
ii. Data-Base storage: MS Excel
iii. Operating System: Win-10

3.2 PLANNING

The various steps required when making this project are below:

1. Research of problem statement.

2. Collection of the set requirements

3. Project feasibility analysis AMAR MONDAL (2012)

4. Development of a standard structure.

5. Journals about past activities related to this area.

6. To choose the best method for improving the algorithm.

7. We had various good and bad analyzes.

8. Initiate project development

9. Installation of software and PIP packages.


12
10. Improving algorithm.

11. Manual algorithm analysis.

12. The code should be according to the advanced PYTHON to enhance accuracy.

13
CHAPTER 4
DATASET AND DATABASE SPECIFICATION
We have used past two years dataset of Flight Fares from Kaggle for the flight fare
prediction namely:- Date, Route, Day, Airline (Predictors), Fare,etc. The Dataset open
and closing values are selected between the years of 2019. Significant Important
attributes present in the dataset are and Close where for the current model we have
considered Open attribute as a predicant while other attributes are taken as Predictors for
the Model through the Price Fare Dataset. The flight_fare.csv dataset had null values or
missing values which were replaced by mean values for their respective columns.

14
CHAPTER 5
METHODOLOGY

DATA ANALYSIS

 Data Analysis is a process of collecting, transforming, cleaning, and modelling data with
the goal of discovering the required information. The results so obtained are
communicated, suggesting conclusions, and supporting decision-making.
 Data analytics is performed on the flight database so that the data can be cleaned and data
that is not required can be deleted. It is used to model the complex flight data to a simpler
form so that it can be used as input for prediction process.

DATA CATEGORIZATION

 Data is categorized and extracted to identify and analyze similar behavioral data and
patterns, and techniques vary according to organizational requirements.
 Data analysis is linked to data visualization. It is used to make relationship between
different columns of database so that it can be used for prediction and visualization.

DATA VISUALIZATION

 Data visualization is the way of presenting data in a pictorial or graphical format. It


enables decision makers to see analytics presented visually, to understand difficult
concepts or to identify new patterns.
 Visualization helps to make charts and graphs for more detail, thereby changing what
data you see and how it’s processed. This helps us to understand which features of data
have strong relations between them.

15
Linear Regression?
A supervised Machine Learning framework that illustrates the best fit linear model between the
relationship between variables is called Linear Regression. It also determines the upright and
level relationship between two variables.

There are two types of Linear-Regression:-

● Simple-Linear-Regression: Simple-Linear-Regression is a Linear-Regression


approach which uses a single independent parameter to estimate the price of a
quantitative dependent variable.

Figure:1.1 Simple-Linear-Regression

16
Null Values and Outliers in ML?

● NULL VALUES: In Datasets which have been used in the models in Machine
learning there certain values on those datasets that are missing or black space
on the particular rows of a particular attribute those values are represented as
Null Values. These values can be overcome by either removing the entire row of
that particular missing value or filling the value with respective mean mode or
medium value.

● OUTLIERS: There are certain values in the datasets whose values are of same
data types of other data but those values are either quite large or quite low due
to which it looks different from other values while plotting overall datasets.
Outliers are hugely neglected since it makes the datasets linearly inseparable. To
overcome those outliers we usually neglect them or replace them with the
overall average of that attributes values of the datasets.

IDE and its types?

The integrated development environment (IDE) is  platform in which applications are
build, that have tools which are used by the developers into a single graphical user
interface (GUI).

Types of IDES:

● GOOGLE COLAB

● PYCHARM

17
Machine Learning Libraries and their types?

A set of routines, commands and functions which are programmed in a programming


language is called a framework or a library and in this case such kinds of libraries are
called Machine-Learning-Libraries.

Types Of Machine Learning Libraries:-

● Scikit Learn[10]

● Tensorflow

● Pandas

● Numpy[11]

● Matplotlib[12]

18
19
CHAPTER 6

CODING AND TESTING

Fig 2
Importing Libraries

Fig 3

20
Columns in Data Train

Fig 4

Checking for Non-Null Values

21
Figure 5
Data Pre-Processing

22
23
Showing relation between Dependent and Independent Variables

Fig 8
Result with RMSE ERROR and Accuracy

24
Flight Fare
Prediction

CHAPTER 7
COMPARISON WITH THE EXISTING SYSTEM

EXISTING SYSTEM CURRENT PROPOSED SYSTEM

The Existing system which are using the The Current System using Regression Model is
SVM(Support vector Machine)[10] Algorithm way more efficient with sequence prediction
turns out to be less efficient when it comes problems as they are able to store past
Sequence Prediction problem information.

The Existing system using SVM fails to give The Current System outperforms on a large
better accuracy when the dataset contains a dataset irrespective of the number of outliers on
large number of outliers. any dataset.

The Models using SVM do not go well with Since the LSTM model with moving averages
moving averages as their values does depend on was applied through which the past datas were
the accuracy of the model since SVM does not stored over the overall dataset and eventually
remember past data so any changes in averages evaluated to be the most effective model in
does lower the accuracy of the model. estimating the stock prices for the future.

25
Algorithm for the Existing System- Algorithm for the Current System-

-Importing the dataset -Importing the dataset

-Removal of Null Values and Outliers. -Removal of Null Values and Outliers.

-Feature Selection. -Feature Selection.

-Splitting the Dataset into 70% of training data -Splitting the Dataset into 70% of training data

and 30% of testing data. and 30% of testing data.

-Preprocessing the dataset. -Preprocessing the dataset.

-Using SVM as SVC classifier -Using the extra tree regressor

Testing the Dataset. -Testing the Dataset.

-Checking the Accuracy Score. -Checking the Accuracy Score.

Algorithm

Datasets

We will be using two datasets — Train data and Test data


26
Training data is combination of both categorical and numerical also we can see some special
character also being used because of which we have to do data Transformation on it before
applying it to our model

The test data is similar to the training data set, minus the ‘Price’ column (To be predicted using
the model)

Python Coding

Step 1: Import the relevant libraries in Python.

27
Step 2: Import Train and Test data sets and append them

Appending of the data set is done to work together with both train and test at a same time and

don,t have to make changes separately . After we apply the transformation then we can separate

them again into test and train

Step 3: Feature Generation

In this step we mainly work on the data set and do some transformation like creating different bins

of particular columns clean the messy data so that it can be used in our ML model . This step is

very important because for a high prediction score you need to continuously make changes in it
Feature Selection
Finding out the best feature which will contribute and have good relation with target variable.
Following are some of the feature selection methods,

1. heatmap
2. feature_importance_
3. SelectKBest

Date_of_Journey:

In the column ‘Date_of_Journey’, we can see the date format is given as dd/mm/yyyy and as you

can see the datatype is given as object So there is two ways to tackle this column, either convert

28
the column into Timestamp or divide the column into date,Month ,Year. Here , i am splitting the

columns

Date_of_Journey split into 3 variables (Date, Month, Year )


Arrival_Time:

In the column ‘Arrival_Time’,if we see we have combination of both time and month but we need only
the time details out of it so we split the time into ‘Hours’ and ‘Minute’.

Arrival_Time split into 2 variables (Hour, Minute)

Total_Stops:

This column is combination of number and a categorical variable like ‘1 stop’ . So we need only

the number details from this column so we split that and take the number details only also we

change the ‘non stop’ into ‘0 stop’ and convert the column into integer type

Dep_Time:

As same as ‘Arrival_time’ .we split this column also in hour and minute and convert it into

integer
29
Dep_Time split into 2 variables (Hour, Minute)

Route:

The ‘Route’ columns mainly tell us that how many cities they have taken to reach from source to

destination .This column is very important because based on the route they took will directly

effect the price of the flight So We split the Route column to extract the information .Regarding
the ‘Nan’ values we replace those ‘Nan’ values with ‘None’ .

Route split into 5 variables

Replacing the Nan values with ‘None’

30
Before splitting

After splitting

Step 4: Prepare categorical variables for model using label encoder

31
To convert categorical text data into model-understandable numerical data, we use the Label

Encoder class. So all we have to do, to label encode a column is import the LabelEncoder class

from the sklearn library, fit and transform the column of the data, and then replace the existing

text data with the new encoded data.

Label encoding of Categorical variables

Handling Categorical Data


One can find many ways to handle categorical data. Some of them categorical data are,

1. Nominal data --> data are not in any order --> OneHotEncoder is used in this case
2. Ordinal data --> data are in order --> LabelEncoder is used in this case

Step 5: Divide the data set into test and train

Now that all our data is numerical after label encoding so we split the data into test and train and

drop the price column from the test set because we have to predict the price with our test data set

32
X — independent variables; y — dependent variable

Step 6: Build Model

The goal in this step is to develop a benchmark model that serves us as a baseline, upon which we

will measure the performance of a better and more tuned algorithm. We are using different

Regression Technique and comparing them to see which algorithm is giving better performance

then other and At the end we will combine all of them using Stacking and see how our model is

predicting

33
REFERENCES

34
35
CHAPTER 5
CONCLUSION

Key Findings
The objective of the analysis is to provide an overview on the pattern and conditions based on
which flight fares are charged, efficient and clear knowledge was achieved, thus proving that the
data is reliable enough to be used for predicting fares. Using Machine Learning, the accuracy of
the proposed model is measured and verified. The accuracy on training data set was achieve to
be 95% and on test set the accuracy percentage is 79.7%.

Significance
The flight rates in India is increasing day by day due to many factors such as cost of aviation
fuel, frequent travel, etc. This analysis was done by using datasets from 2019. Several modules
were kept in mind while performing the data analysis. The data which was received was not
efficient, and thus data cleaning, preprocessing was done extensively to make the data efficient
for use.

Limitations

 There are various other model that have different algorithms supporting the current
system in order to provide various offers to the customer.

 These include the relation between other dependent and dependent variable including the
holiday dates and festive season

Future Scope

As future work, newer datasets could be analyzed, so that new fares can be framed. This will
lead to improved fares of flight and will help the general public. Technology will help increase

36
accuracy and efficiency of predicting fares. Analytics can also help discover and identify trends
to improve operational effectiveness.

37
CHAPTER 7
REFERENCES

● Kaggle Datasets :
● https://www.kaggle.com/nikhilmittal/flight-fare-prediction-mh
● https://github.com/Mandal-21/Flight-Price-Prediction/find/master

38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy