2328-5650-1-PB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

JOIV : Int. J. Inform.

Visualization, 7(3-2): Empowering the Future: The Role of Information Technology in Building Resilience - November 2023 2056-2064
INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION

INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
journal homepage : www.joiv.org/index.php/joiv

No-Show Passenger Prediction for Flights


Wei-Song Chin a,*, Choo-Yee Ting a, Chin-Leei Cham a
a
Faculty of Computing and Informatics, Multimedia University, 63000 Cyberjaya, Selangor, Malaysia
Corresponding author: *1191100961@student.mmu.edu.my

Abstract— In aviation, “no-show” refers to a customer who booked a reservation but failed to show up. No-shows can result in various
resource wastes, such as vacant seats, leading to income loss and flight delays. As a result, no-show passengers can cause considerable
problems for airlines, ultimately affecting their bottom line. Recent research has shown the use of machine learning algorithms to
reduce the rate of no-shows. For example, a researcher in healthcare is using a predictive model to identify no-shows’ patients to increase
efficiency. Therefore, this study aimed to develop prediction models to predict passenger no-shows. In this work, we used a dataset
supplied by a local airline company consisting of 1,046,486 rows and 8 columns. Additional datasets like weather data, public holiday
data of different countries, aircraft details, and foot traffic data are used to carry out the dataset's feature enrichment task to
complement the original dataset. As a result, feature selection has become an important stage in this research to identify and pick the
most relevant and useful features from the enormous number of columns. The findings showed that the model built using Random
Forest has the highest accuracy of 90.4%, while Decision Tree performed at 90.2%, Gradient Boosting at 86.5%, and Neural Networks
at 67.6%. To enhance the accuracy of the models, further research efforts are essential to integrate supplementary passenger
information.

Keywords— No-show; aviation; prediction; machine learning; classification; feature enrichment; feature selection; neural networks.

Manuscript received 25 Dec. 2022; revised 14 Jul. 2023; accepted 20 Aug. 2023. Date of publication 30 Nov. 2023.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

($725.42). However, intervention tactics could reduce this


I. INTRODUCTION loss by half, ranging from 3.8% ($167.38) to 10.5%
When a person is expected but does not show up or appear ($464.27). This means that the estimated gains from
where they are supposed to, it is known as a no-show [1]–[11]. interventions could make up 36.0% ($261.15) to 76.9%
No-show behavior, a special kind of absenteeism, is ($558.04) of the losses caused by no-shows [14].
frequently problematic, especially in the service sector. In the aviation industry, if an aircraft departs with vacant
Although such behavior might harm the individual no-show seats that might otherwise have been filled as a result of such
in the long term, today, service providers, and sometimes a “no-show”, the airline will lose the opportunity to resell the
uninvolved third parties, tend to bear the short-term seat to another customer, resulting in a direct loss of revenue.
consequences [3]. Due to this reason, airlines often accept reservations that
No-shows can have a big impact on the service sector, exceed the cabin capacity based on estimates of the number
healthcare sector [4], [8], [12], and airline sector [13] through of no-shows. This is also a concern since sufficiently high
direct financial costs, operational costs, customer service levels of overbooking could result in problems including
costs, and many other things. Every time a patient or customer customer dissatisfaction, brand damage (especially during
chooses to default, there is not only valuable time lost but also social media times when passenger complaints increase in
a risk that the appointment or booking may never be filled, scope and reach), and revenue effect [10].
leading to lower revenue, underutilized personnel, lost Understanding the causes of the high occurrence of no-
commissions, and demoralized employees. Regardless of the show incidents is critical. In our daily lives, numerous no-
size and type of business, the cost associated with show incidents might not get much notice. For instance,
cancellations and no-shows is a genuine pain and people who reserved a table at a restaurant did not show up;
inconvenience, and it can frequently occur across many participants were not present at the jury as scheduled; patients
specialties in many places. In a replicated facility, researchers scheduled appointments but did not appear; flight passengers
found that no-shows caused a daily net loss of 16.4% did not show up while boarding. Every no-show event appears

2056
insignificant when seen in the context of the larger world. At traffic is crucial to flight schedules and customer experience,
a macro level, because thousands of identical events occur particularly for no-shows. During peak travel periods, such as
each year, these seemingly insignificant absences are not holidays or weekends, the high foot traffic can contribute to
favorable to the long-term development of many industries an increased number of no-shows due to long queues and
[12]. Hence, it is also necessary to look into several strategies overcrowding at customs and immigration counters, as well
that may be used to lessen the consequences of no-shows and as security checkpoints. This issue is further compounded
cut down on their frequency. when airports have limited infrastructure capacity, resulting
Subsequently, air travel is particularly susceptible to social, in insufficient resources like a shortage of open common
political, and economic changes, causing passenger check-in counters and inadequate personnel availability [22].
purchasing patterns to alter significantly. Thus, it is These factors significantly impact passengers' efficiency and
challenging to determine the variables that influence no-show overall experience and can lead to no-shows.
prediction. Thousands of variables are present in big data sets, When airports experience high passenger volumes, factors
making it challenging to handle and manage effectively using such as age, gender, and seasonal clothing can slightly impact
conventional methods. Consequently, many studies in other the security process [23], [24]. Different age groups may
fields have focused on variable selection [15]. require modified screening procedures, potentially leading to
Lastly, developing precise prediction models is a longer screening times. Additional screening procedures may
significant challenge in various industries, such as business, be necessary based on gender-specific security concerns,
finance, healthcare, etc. Recently, there has been a growing conducted discreetly to ensure passenger privacy and safety.
interest in creating predictive models that can forecast the During colder seasons, bulkier clothing worn by passengers
future based on past data and relevant variables. Well- may require further inspection, resulting in slightly longer
designed prediction models can help companies and screening times. Despite these considerations, airports and
organizations make wise choices, conserve resources, and security personnel strive to maintain efficient screening
enhance their financial performance. As a result, different procedures while accommodating high passenger volumes
industries must maintain their competitiveness by identifying and upholding safety standards for all individuals.
their top-performing model and delivering insightful According to a study in manufacturing and service
guidance for making decisions based on reliable forecasts operations management [25], stores that experience higher
[16]–[18]. interday traffic variability face increased uncertainty, leading
This paper aims to discuss the construction of an analytical to significant errors in labor requirement forecasting. This
dataset that identifies the factors contributing to passengers results in mismatches between required and actual store labor,
not showing up for their flights. This will involve identifying decreasing customer service, and fewer purchases. Similarly,
the variables that caused passengers to not show up for their in airports, high foot traffic combined with fewer immigration
flights. Following the dataset's creation, predictive models officers on duty can lead to increased processing time for
will be created to anticipate the likelihood of passengers passengers at customs and immigration. This can cause
failing to show up for their flights to select the best- passengers to struggle to reach their boarding gates on time,
performing model. mirroring the challenges faced by stores with high traffic
A. Reasons For No-Show variability and inadequate staffing, resulting in diminished
customer service. Consequently, passengers may not be able
1) Weather Conditions: Bad weather conditions to make it to their boarding gates promptly.
significantly increase the likelihood of appointment failure or
no-show, which is considered an uncontrollable factor [5], 3) Commute Distance: In the healthcare sector,
[8], [19]. Studies have shown that snowfall and extremely low Researchers have found a correlation between the distance
and high temperatures are particularly associated with higher between a patient's home and the appointment location and
probabilities of no-shows [20]. In response to severe weather the likelihood of a no-show [1], [6], [11], [19], [26]. Longer
predictions, imaging personnel have options such as adjusting distances are associated with a higher risk of no-shows. For
exam scheduling, prioritizing patients on the waitlist for example, a study showed that patients have a 1.4% greater
extended appointment times, or implementing dynamic chance of missing appointments for every 10-mile increase in
scheduling strategies. Research in the United States studied their journey [20]. The effect is more significant for patients
the association between Extreme weather events and HIV- in the low- and medium-income brackets, with a roughly 2%
clinic attendance rates [21]. The study discovered that the risk increase in the risk of no-shows for every 10-mile increase in
of no-shows increased along with the heat index at 90°F commute distance, compared to the minimal change observed
(32.2°C), with a 14% increase on days above 100°F (37.8°C). for high-income patients.
Similarly, days with over an inch of precipitation had a higher 4) Transportation Problems: Transportation problems
probability of no-shows compared to dry days. The relative significantly contribute to no-shows, being an important
risk increased by 16% for 1-2 inches of precipitation and by uncontrollable factor [5], [19]. Insufficient transportation
13% for more than 2 inches. Days with reported severe natural options can result in missed appointments [8]. In Saudi
occurrences had a 10% higher chance of no-show visits Arabia, female patients' reliance on others for transportation
compared to non-disaster days. directly affects their rate of no-shows [1]. The recent
2) Foot Traffic: The flow of people through airport permission for women to drive in the country has the potential
terminals, encompassing check-in, security checks, and to impact the no-show rates among these patients.
boarding gates, is commonly known as foot traffic. This foot

2057
Expanding public transportation systems has reduced the and no-show would leave empty seats in the aircraft, denying
no-show rate, especially for patients living near newly the airline the chance to earn extra money. For example, if an
established rail lines [3]. Transportation has been identified as airline never overbooked a flight with 100 seats and a 10%
a key factor in maintaining patients' appointment attendance no-show rate, there would be 10 empty seats. If a one-way
in the healthcare industry [20]. Access to transportation and flight from Kuala Lumpur International Airport to Bali,
choosing suitable options are important social determinants of Indonesia, costs roughly RM 1100 on an ordinary day, we can
health, as patients cannot receive adequate care if they cannot determine that the airline company has already passed up the
attend appointments. Lower- and middle-income groups are opportunity to earn RM 11,000 on a single flight.
particularly affected by longer commuting times, relying Overbooking is a policy of selling tickets above the seating
more on public transportation than wealthier patients who capacity. This policy has a risk and might potentially cost the
often have private vehicles. Addressing transportation issues airline if the number of customers who show up for departure
can involve scheduling patients at local clinics or outpatient exceeds seat capacity because the airline must give specified
imaging facilities within the same healthcare system to compensation for overbooking penalties [32]. In these
mitigate transportation challenges. Table 1 below presents the situations, the airline usually looks for volunteers to take a
authors on the reasons for no-shows. later trip in exchange for payment, like a voucher for a future
TABLE I
flight. If the airline cannot recruit enough volunteers, they
TABLE OF AUTHOR-REASONS FOR NO-SHOW may “bump” some passengers by denying them boarding
Transportation Commute Weather Foot inadvertently. Although involuntary bumping entitles
Author passengers to compensation from the airline, it can still be
Problems Distance Condition Traffic
[1]   inconvenient and frustrating for passengers. Because of the
[3]   delay, some travelers can miss connections or significant
[4]  
occasions, and altering their travel arrangements could incur
additional costs.
[5]   
Consequently, this makes airline overbooking a problem
[6] 
that could make passengers feel disappointed if the airline
[8]   cannot conduct flights efficiently. In addition, if the airline’s
[9]  handling of passengers who were inconvenienced by
[11]  overbooking also did not satisfy passengers with the decision
[19]    made by the airline, and the compensation received by
[20]    passengers did not alleviate their unhappiness, customers will
[21]  be less likely to utilize the airline’s services again and are
[22]  more likely to fly with other airlines as a result, which could
[23]  have an impact on the airline’s ticket sales [30]. Other than
[24]  that, if the airline fails to handle the matter properly, the worst
[25]  that can happen is that it can cause serious damage to the
[26]  company’s brand image. In April 2017, a video circulated
[27] 
online showing a customer being forcefully removed from
United Airlines Express Flight 3411 due to overbooking. The
B. Overcoming No-Shows in Airline passenger resisted, and a security guard dragged him out of
his seat and down the aisle. The incident caused global
1) Overbooking: The airline management strongly outrage, and the security officer involved was suspended. The
emphasizes an all-out promotion of continuous improvement U.S. Federal Department of Transportation investigated the
and uses successful operations strategies to maximize airline's compliance with overbooking regulations [33].
operational efficiency and boost profit. Since they cannot Determining an optimal overbooking limit has become
determine whether a passenger will occupy a seat until the increasingly important to prevent customer dissatisfaction
flight takes off, airline business management frequently and and protect the company's reputation. One study uses an
effectively employs overbooking as a revenue management overbooking model to find a closed-form solution
strategy [28], [29]. Overbooking is a regular phenomenon in simultaneously for both the optimal booking limit and the
the airline business in which carriers sell more tickets than optimal overbooking limit [34]. In this work, passenger airline
seats available on the plane. With an overbooking system in data from Thailand is applied to a mathematical model that
place, when demand is low during the low season, the cost can combines two of the most crucial airline revenue management
be covered by demand from the peak season. The service tactics: overbooking and seat inventory control. The
sector frequently uses overbooking to guard against performance of the two-class overbooking model was then
undesirable events like cancellations and no-show customers evaluated, and three assumptions were tested using actual data
that would result in missed opportunities for increased in numerical research. Another study examines the voluntary
revenue [30], [31]. overbooking model in the context of rational expectation
A “no-show” in the airline industry is when a passenger equilibrium to encourage consumer cooperation with airlines,
who has made a reservation for a flight does not show up. As preserve customer goodwill, and optimize expected total
a result, airlines overbook flights to generate revenue because returns to airlines [29]. The researchers created A decision
they expect a certain number of passengers will not show up tree analysis for both customers and airlines. Simulated and
for the flight [10]. Without overbooking, every cancellation real no-show random variables are subjected to sensitivity

2058
analysis for validation. The results indicate that a “voluntary roughly 80%, which is an improvement over the Naive-Bayes
overbooking” policy that promotes cooperation between and C4.5 approach techniques. Another researcher in China is
passengers and commercial airlines provides significant forecasting passenger distribution in airport terminal main
mutual benefits. areas based on flight arrangements [36]. In the research, a
mathematical optimization (Gamma Distribution) predictive
2) Revenue Management: Revenue management is an
model is created and using the passengers’ dwell time in each
important part of the airline sector that entails adopting
area of an airport terminal as the input of the predictive model
demand management strategies to forecast price and capacity
to predict how people would be distributed throughout an
control for various demands effectively. Airlines operate on a
airport terminal’s main regions based on the configuration of
business model similar to that of perishable goods, which
the flights. The findings show that these areas saw peak
means that if seats or cargo space remain unsold before a
passenger dislocation due to the airport’s departure procedure.
flight, the opportunity to generate revenue is lost [32]. With
When deciding how many seats to allow for overbooking,
high fixed costs, revenue management is utilized to predict
airlines depend on predictions regarding the expected number
demand uncertainty issues since excess inventory cannot be
of passengers who may not show up for a specific flight. To
stored or carried over to the next period due to the limited
address this challenge, a paper proposed a decision support
capacity of seats and cargo space.
system that integrates the Case-Based Reasoning (CBR)
The ultimate goal of revenue management in the airline method with Interpolative Boolean Algebra (IBA), takes
business is to improve revenue or yield. Revenue recommendations from both experts and algorithms, and
management determines how to assign undifferentiated predicts the number of no-show passengers [37]. Aside from
capacity units to satisfy available demand to accomplish the that, a study uses the Box-Jenkins model and an artificial
goal. This is done by controlling inventory and price in line neural network model to anticipate the number of passengers
with micro-market-level forecasts of consumer behavior [13], flying in Malaysia based on lag variables as input variables
[35]. Although airlines would prefer to have higher-fare [17].
passengers, they confront market demand uncertainties and Following that, another Malaysian researcher used
frequently offer lower-fare tickets to avoid empty seats and geometric Brownian motion (GBM) to forecast the number of
associated opportunity costs. However, airline companies passengers over time [38]. Geometric Brownian motion
must strike an appropriate equilibrium between the number of (GBM) is a relatively simple mathematical model usually
tickets supplied at lower rates and those at higher prices to used to anticipate the future share price for a brief period. The
maximize income while providing sufficient capacity for researcher wishes to compare the distributional behavior data
higher-paying customers. Table 2 below indicates the details from two local airline firms’ passengers. Besides that, a
of ways the airline industry overcame no-shows. Swedish researcher forecasted the no-show rate of travelers
TABLE II using information from passenger booking data [31]. The
TABLE OF AUTHOR-WAYS AIRLINE INDUSTRY OVERCOME NO-SHOW researcher used several approaches to perform the prediction
Author Overbooking Revenue Management to determine whether decision trees, gradient boosting, or
neural networks could outperform the simple baseline model.
[10] 
Consequently, gradient boosting produced outcomes
[13] 
comparable to but marginally lower than decision trees in the
[28] 
given KPIs (Key Performance Indicators). Out of all the
[29]  models, the neural network did the worst.
[30]  Patients who miss outpatient visits for diagnostic or clinic
[31]  tests are known in the healthcare sector as “patient no-shows”.
[32]   Physicians and healthcare facilities must identify these
[33]  patients to use resources and effectively increase healthcare
[35]  efficiency. For instance, one researcher created a predictive
model to forecast patients’ failure to attend scheduled visits
B. Machine Learning Techniques in Resolving Other using decision trees and AdaBoost [2]. The models’ analysis
Challenges indicates that those patients who missed their appointments
In addition to employing overbooking and revenue tend to be younger, male, with morning appointments, and did
management to address the “no-show” problem, researchers not receive a text message on the phone or a reminder.
use machine learning techniques to make predictions to Chinese researchers also developed a prediction model for
address additional issues that arise in the aviation and patient no-shows in online outpatient appointments to help
healthcare sectors. For instance, a Chinese researcher hospitals make informed decisions and decrease the
employed multivariate linear regression to predict aircraft likelihood of patient no-show behavior [6].
delay problems [16]. The study presents a method to model As we know, the product in the airline industry is the seat,
the arriving flights and a multiple linear regression technique which is a non-stackable product. The seat demand is
to estimate delay, comparing with Naive-Bayes and C4.5 unpredictable, the capacity is limited and hard to expand, and
approach. After carefully analyzing the data, the researcher the variable costs are extremely expensive. As a result, the
discovered a strong correlation between arrival and departure airline industry places an extremely high priority on predicted
delays. As a result, they use departure delays to forecast demand prediction. One researcher designed and created the
arrival delays. According to trials with a realistic dataset of best-fit model using a multilayered feed-forward neural
domestic airports, the accuracy of the proposed model is network to estimate passenger demand at the flight origin and

2059
destination levels based on historical data [39]. The “Open-Meteo” is a website where weather information is
researchers employed data such as date, territory, origin- obtained. Selected hourly and daily meteorological and
destination, and passenger count to anticipate passenger climatic data items are among the information made available
demand to deliver the best outcomes for capacity utilization by the website. Temperature, rain, snowfall, wind, and
decisions. astronomical aspects like sunrise and sunset are all included
in the list of weather variables. The goal of obtaining weather
II. MATERIALS AND METHOD data is to determine whether or not the passenger will arrive
if it is raining. Following that, the holidays of 56 nations
A. Data Preparation contained inside the primary dataset are obtained by manually
1) Dataset Description: One of Malaysia’s local airline checking the countries’ holiday calendars on Google. Since
firms provides the primary dataset. This dataset pertains to the we all know that public holidays frequently result in traffic
records of the passengers, whether or not they arrive at the jams, this information can be utilized to forecast whether or
scheduled departure time. The dataset is a six-month not passengers will arrive at the airport during a holiday.
information collection from July 1, 2022, to December 31, Based on the data provided in the original dataset, the
2022. There are 1,046,486 rows and 8 columns in the dataset. aircraft type for each passenger is included. This allows for
The dataset includes information about each passenger's no- extracting additional aircraft details, such as the
show status, the date and country of issuance, the departure manufacturer, total number of seats, cargo capacity, fuel
date and time, the departure and arrival airports, the aircraft capacity, maximum take-off weight, and others. The purpose
type, the type of cabin class, and the departure and arrival of analyzing these details is to determine if there is any
times. Since the dataset provided does not include nearly correlation between the type of aircraft and the number of no-
enough information, feature enrichment is being applied to shows among passengers.
the original dataset, resulting in additional columns. For The last supplementary dataset being downloaded is data
instance, the IATA airport codes for each passenger’s on airport foot traffic from the “BestTime” website. Foot
departure airport and arrival airport are checked to determine traffic data is the term used to describe the gathering and
the kind of flight for each. IATA codes, often known as analysis of information regarding the flow of people through
location identifiers, are made up of three-character a certain location, such as an airport. A variety of techniques,
alphanumeric geocodes. The International Air Transport including sensors and cameras, are used to collect this data.
Association uses it to designate numerous airports and By studying airport foot traffic statistics, we may learn more
metropolitan regions all over the world. about how passengers navigate the airport and whether or not
the foot traffic patterns have an impact on the passengers
In addition to the original dataset, Wikipedia and other
arriving at the boarding gate.
websites are crawled for additional datasets and information.

Fig. 1 Flow of obtaining the Final Version of the dataset.

Moving on to the IATA code dataset, unnecessary columns


B. Data Pre-processing such as “Airport ID”, “ICAO”, and “Altitude” were removed
1) Data Cleaning: Extensive data cleaning and checks to streamline the dataset. Furthermore, any null values present
were conducted on all datasets used in the project to ensure were filled with the value “None” to ensure data
the accuracy and reliability of the predictive model. The main completeness. In the case of the foot traffic and weather
dataset provided by the corporation had missing values, dataset, no columns needed to be eliminated since the required
specifically in the “Issuing_Country” column. To handle this, information was obtained through an online API, and only the
the missing data was filled with the value “None” as a relevant data was retrieved. Moreover, this dataset was found
workaround. Aside from this, no further missing data was to be free from null values, eliminating the need for additional
found in the dataset, and all columns were considered data-cleaning steps. Finally, for the public holiday dataset, all
meaningful, requiring no added data-cleaning tasks. information was manually acquired by referring to various

2060
sources, including Wikipedia. The only data cleaning step when integrating the data, the name of the airport is then used
performed on this dataset was filling any null values with the to replace the names of these locations. In the weather dataset,
value “No”, indicating that it was not a public holiday. the “Time” column applies the same logic as the
By meticulously addressing missing values, removing “Dep_Date_Time” column in the main dataset. Besides that,
unnecessary columns, and ensuring data completeness, the the values in the “WeatherCode” column are merely numbers,
datasets were prepared for further analysis and the which are quite perplexing to people as to what they signify,
development of a robust predictive model. so the meaning of each number is put into a Python dictionary
to replace all of the numbers with their respective meanings.
Algorithm 1: Data Cleaning (Main dataset)
1. Import panda’s library.
2. Read the Main dataset. Algorithm 5: Data Transformation (Weather dataset)
3. Check null value. 1. Import panda’s library.
4. Fill in null values in “Issuing_Country” with “None” 2. Read the weather dataset.
5. Export the cleaned dataset from the data frame into a CSV 3. Change “Time” from string to DateTime data type
file 4. Create a dictionary which inside contains the respective
meanings of each value in “WeatherCode”
Algorithm 2: Data Cleaning (IATA dataset) 5. Replace the value in “WeatherCode” by using the
1. Import panda’s library. dictionary that has been created
2. Read the IATA dataset. 3) Data Merging: Data merging is a time-consuming
3. Remove unwanted columns.
task since three additional datasets, including the details of
4. Check null value.
5. Fill in null values in “City” with “None.” each IATA code, weather, foot traffic, and public holiday
6. Export the cleaned dataset from the data frame into a CSV datasets, as well as some aircraft details, must be merged with
file the main dataset to conduct data analysis and uncover what
the data can genuinely tell us.
2) Data Transformation: Three datasets are undergoing
The merging process begins by combining the IATA code
data transformation: the main, foot traffic, and weather
dataset with the main dataset using an inner merger. Due to
datasets. To prevent errors from occurring in other sections,
website restrictions, some of the data are being obtained
the first step is to modify the data type of the “Acft_Type”,
separately, therefore additional data merging tasks need to be
“Issue_Date” and “Dep_Date_Time” columns in the main
conducted during or after the data crawling process. As the
dataset. Since each value inside the column should be a name
weather data will be gathered from more than 50 nations, an
of an aircraft type rather than a number, the value in the
empty data frame is generated before the data is crawled. The
"Acft_Type" column is transformed from float to string.
weather data is then combined using the “concat” function to
While being read, the initial data type of “Issue_Date” and
create a single CSV file. The same approach is also used for
“Dep_Date_Time” is a string; after that, both are converted to
the foot traffic data after it is crawled.
DateTime data types. Following that, the date and time are
The first step is to perform a left join to merge the public
retrieved from the “Dep_Date_Time” column and placed in
holiday data with the main dataset. This join combines the two
new columns called “Dep_Date” and “Dep_Time”. The goal
data frames using the “Dep_Date” and “Dep_city” columns
of adding these two new columns is to save time during the
as the common keys. A left join ensures that all the rows from
data merging process. Based on the “Dep_Date” column,
the main dataset are retained in the merged data frame, and
“dt.day_name()” is used to determine the day of the departure
only the matching rows from the public holiday data are
date. In addition, departure times are divided into four
included. Before the merging process, a data type checking
categories based on the “Dep_Date_Time” column: early in
task is conducted to ensure no data type mismatches could
the morning, during the day, during the afternoon, and in the
lead to errors.
evening.

Algorithm 4: Data Transformation (Main dataset) Algorithm 6: Data Merging (IATA code dataset)
1. Import panda’s library. 1. Import panda’s library.
2. Read the main dataset. 2. Read the IATA dataset.
3. Change “Acft_Type” from float to string data type 3. Merge the IATA dataset with the main dataset using the
4. Change “Issue_Date” from string to DateTime data type inner merger.
5. Change “Dep_Date_Time” from string to DateTime data 4. Rename the columns after merging the process.
type 5. Iterate through the dataset.
6. Retrieve the date and time from “Dep_Date_Time” and 6. Determine the type of flight of each passenger by checking
place them in new columns called “Dep_Date” and both values in “Dep_country” and “Arr_country” If equals
“Dep_Time” to “Malaysia” then assign the string “Domestic” to
7. Determine the day of each departure date. “Type_of_flight” else “International”
8. Divide departure time in “Dep_Date_Time” into four
categories: early in the morning, during the day, during the Similarly, the other additional datasets, such as weather,
afternoon, and in the evening foot traffic, and aircraft data, are merged with the main dataset
using left-join operations. After the merging process, some
Since the online API does not offer the necessary data for columns are renamed to improve readability. To ensure that
some airports, the foot traffic data of a site close to the airport the final dataset is accurate, complete, and reliable, cleaning
is utilized for that particular airport. To minimize confusion tasks such as checking for and removing null values and

2061
unnecessary columns are performed. These steps are taken to No. Feature Score
enhance the quality of the data for analysis and decision- 14 Total_Seats 28141.63
making purposes. After the merging and cleaning processes, 15 Acft_Type 27580.18
a comprehensive version of the main dataset is generated. 16 Econ_Class 26896.28
17 EconELR_Class 17740.58
Algorithm 7: Data Merging (Weather dataset) 18 Aircraft_Name 17346.74
19 Dep_Date_Time 13775.21
1. Import panda’s library.
20 BusinessSuite 12574.52
2. Create an empty data frame.
3. Crawl weather data through API D. Model Construction
4. Store weather data in a new data frame
5. Merge the empty data frame created at the beginning with This study involves the construction of four classification
the data frame containing crawled weather data. models using different machine learning algorithms: Neural
Export the crawled weather data from the data frame into a Networks, Decision Tree, Gradient Boosting, and Random
CSV file. Forest. The Neural Network classifier is implemented using
Keras. It begins by specifying the input shape as (20,),
Algorithm 8: Data Merging (Main dataset with Public Holiday indicating that only the top 20 features are fed into the model.
data, Weather data, Foot Traffic data, and aircraft data) The architecture comprises two hidden layers with 32 and 20
1. Import panda’s library. units respectively, utilizing the ReLU activation function. The
2. Read the main dataset. output layer consists of a single unit with sigmoid activation.
3. Read the public holiday dataset. To compile the model, the Adam optimizer and binary cross-
4. Check the data type of each column. entropy loss and accuracy metrics are chosen. The model is
5. Rename the column’s name.
then trained on the training set for 10 epochs using a batch
6. Merge the data with the main dataset using the left join.
7. Check null value.
size 32.
8. Remove unwanted columns. For the Decision Tree classifier, the scikit-learn library is
9. Repeat steps 3 to 8 by replacing the public holiday dataset employed. The classifier is initialized as a decision tree
with the next dataset. object, labeled as clf, and subsequently trained on the training
10. Export the final version of the dataset from the data frame data. Moving on to the Gradient Boosting Classifier, scikit-
into a CSV file. learn is also used for implementation. The classifier is
initialized with 100 estimators (decision trees) and a learning
C. Feature Selection rate of 0.1. It is assigned a maximum depth of 3, which
The final version of the main dataset consists of 52 controls the complexity of the individual decision trees within
columns, which can be considered quite extensive. To the ensemble. The Random Forest Classifier is implemented
optimize the performance and efficiency of a machine using scikit-learn as well. Firstly, an instance of the Random
learning model, it is crucial to perform feature selection on the Forest Classifier is created with 100 estimators (decision
dataset. Besides that, we are also aiming to identify the most trees) and a random state of 42. The random state ensures the
relevant and informative features of this dataset. Before reproducibility of the results across different runs.
proceeding to the feature selection part, the data is split into All four models are then fitted to the training data, utilizing
predictor and target variables. The target variable represents the top 20 selected features based on CFS (Correlation-based
the column that indicates the presence of passengers. Feature Selection). Subsequently, predictions are made on the
Subsequently, the predictor variables are stored in a variable testing data, and various evaluation metrics, including
called X, while the target variable is stored in a variable called accuracy, F1 score, precision, and recall, are calculated.
y. This study employed the Correlation-based Feature Finally, these metrics are printed to the console, assessing
Selection (CFS) technique for feature selection. CFS each model's performance.
evaluates feature subsets solely based on their intrinsic
properties within the data. The top 20 features were selected III. RESULTS AND DISCUSSION
using this technique, as shown in Table 3 below.
The results obtained from evaluating the different
TABLE III classification models provide valuable insights into their
TABLE OF TOP 20 FEATURES performance. By comparing the models, it is evident that the
No. Feature Score Neural Networks model achieved an accuracy of 67.6%,
1 Dep_Date 580371.66 which is the lowest compared to the other models.
2 Issuing_Country 58433.70 Additionally, its precision value of 45.6% indicates that it
3 Type_of_flight 56373.87 struggles to accurately identify positive instances, while the
4 Temperature 49009.35 recall value of 67.6% implies that it captures only a moderate
5 SeaLevelPressure 37920.54
proportion of the actual positive instances. Overall, the F1
6 Business_Class 33773.07
7 Cargo_Cap(kg) 30938.80 score of 54.5% suggests a mediocre performance for this
8 Max_TOWeight(kg) 29982.75 model as shown in Table 4.
9 Max_LandWeight(kg) 29908.23 On the other hand, the Decision Tree model demonstrates
10 WindDirection 29069.74 superior performance. With an accuracy of 90.2%, precision
11 Max_Speed(km/h) 28625.15 of 80.8%, recall of 91.7%, and an impressive F1 score of
12 Fuel_Cap(kg) 28344.75 85.9%, it outperforms the Neural Networks model in all
13 Aircraft_Manu 28233.18 evaluation metrics. These results indicate that the Decision

2062
Tree model accurately identifies positive instances and by no-shows, this study has identified potential strategies and
captures a significant proportion of the actual positives, techniques that can be adapted and implemented within the
resulting in a balanced and robust performance. Similarly, the airline industry. Understanding and applying successful
gradient-boosting model exhibits favorable results. It practices from other sectors can contribute to developing
achieves an accuracy of 86.5%, a precision value of 75.4%, a effective countermeasures, such as proactive communication,
recall value of 86.8%, and an F1 score of 80.7%. Although dynamic pricing, and overbooking management, to minimize
slightly lower than the Decision Tree model, the Gradient the impact of no-shows on operational planning and resource
Boosting model still demonstrates a strong ability to identify allocation.
positive instances correctly and captures a good proportion of In addition to investigating the causes and consequences of
the actual positives. no-shows, this research has constructed and evaluated four
TABLE IV
classification models: Neural Networks, Decision Trees,
TABLE OF MODEL RESULTS Gradient Boosting, and Random Forest. The accuracy rates
Accuracy F1 Score Precision Recall achieved by each model provide insights into their
Model effectiveness in predicting no-show incidents. Notably, the
(%) (%) (%) (%)
Neural Decision Tree and Random Forest models exhibited high
67.6 54.5 45.6 67.6
Network accuracy rates of 90.2% and 90.4%, respectively, suggesting
Decision their potential applicability in developing predictive models
90.2 85.9 80.8 91.7
Tree for identifying passengers at a higher risk of no-shows.
Gradient However, it is important to acknowledge that the accuracy
86.5 80.7 75.4 86.8
Boosting
of the models could be further improved by incorporating
Random
90.4 86.2 80.7 92.4 additional passenger information. This study recommends
Forest
obtaining more comprehensive data, including variables such
as age, home address, gender, travel history, and booking
Lastly, the Random Forest model showcases the highest
accuracy among the models, with a value of 90.4%. It also patterns, to enhance the predictive power of the models. By
attains a precision of 80.7%, a recall of 92.4%, and an F1 considering a broader range of factors that influence no-show
score of 86.2%, reflecting a balanced and commendable incidents, the models can better capture the complexity of
performance. These results highlight the Random Forest passenger behavior and provide more accurate predictions,
model's capability to accurately identify positive instances enabling airlines to implement targeted measures to minimize
while capturing a substantial proportion of the actual the occurrence of no-shows.
positives. In conclusion, this research paper has comprehensively
In summary, the evaluation of the classification models examined the problems posed by no-shows in different
reveals that the Decision Tree, Gradient Boosting, and industries, particularly the airline sector. By exploring the
causes and consequences of no-show incidents and analyzing
Random Forest models consistently outperform the Neural
existing strategies from various sectors, valuable insights
Networks model in terms of accuracy, precision, recall, and
have been gained. Additionally, the construction and
F1 score. Among them, the Random Forest model has the
highest accuracy and recall. These results suggest that the evaluation of classification models have demonstrated
Random Forest and Decision Tree models are particularly promising results, indicating the potential for accurate
well-suited for the given classification task, while the Neural prediction of no-show incidents. However, further research is
Networks model may benefit from further improvements or needed to incorporate additional passenger information and
adjustments. improve the models' accuracy. By doing so, the airline
industry can better understand and address the challenges of
no-shows, ultimately enhancing operational efficiency,
IV. CONCLUSION
revenue management, and customer satisfaction.
In conclusion, this study intended to thoroughly investigate
the issues brought on by no-shows across several industries, REFERENCES
focusing on the aviation sector. The study aimed to uncover [1] S. AlMuhaideb, O. Alswailem, N. Alsubaie, I. Ferwana, and A.
the factors influencing the occurrence of no-shows and its Alnajem, “Prediction of hospital no-show appointments through
ramifications for the industry by analyzing a real-life dataset artificial intelligence algorithms,” Ann Saudi Med, vol. 39, no. 6, pp.
provided by a local airline company. The findings of this 373–381, Dec. 2019, doi: 10.5144/0256-4947.2019.373.
[2] A. Alshammari, R. Almalki, and R. Alshammari, “Developing a
study contribute to the existing body of knowledge regarding Predictive Model of Predicting Appointment No-Show by Using
the detrimental effects of no-shows. Numerous insights into Machine Learning Algorithms,” Journal of Advances in Information
the underlying reasons and their effects on operational Technology, vol. 12, no. 3, 2021, doi: 10.12720/jait.12.3.234-239.
effectiveness, revenue management, and customer [3] C. Amberger and D. Schreyer, “What do we know about no‐show
behavior? A systematic, interdisciplinary literature review,” J Econ
satisfaction have been achieved by examining the causes and Surv, Sep. 2022, doi: 10.1111/joes.12534.
effects of no-shows in numerous industries. The study [4] D. Carreras-García, D. Delgado-Gómez, F. Llorente-Fernández, and
emphasizes the importance of resolving the issue as soon as A. Arribas-Gil, “Patient No-Show Prediction: A Systematic Literature
possible to offset its negative consequences on many Review,” Entropy, vol. 22, no. 6, p. 675, Jun. 2020,
doi: 10.3390/e22060675.
industries, particularly the airline industry, which relies [5] T. Daghistani, H. AlGhamdi, R. Alshammari, and R. H. AlHazme,
largely on efficient scheduling and capacity utilization. “Predictors of outpatients’ no-show: big data analytics using apache
Furthermore, by examining the approaches taken by spark,” J Big Data, vol. 7, no. 1, p. 108, Dec. 2020,
researchers in other sectors to overcome the challenges posed doi: 10.1186/s40537-020-00384-9.

2063
[6] G. Fan, Z. Deng, Q. Ye, and B. Wang, “Machine learning-based [23] Y. Li, X. Gao, Z. Xu, and X. Zhou, “Network-based queuing model
prediction models for patients no-show in online outpatient for simulating passenger throughput at an airport security checkpoint,”
appointments,” Data Science and Management, vol. 2, pp. 45–52, Jun. J Air Transp Manag, vol. 66, pp. 13–24, Jan. 2018, doi:
2021, doi: 10.1016/j.dsm.2021.06.002. 10.1016/j.jairtraman.2017.09.013.
[7] S. L. Harris and M. Samorani, “On selecting a probabilistic classifier [24] H. Yamada et al., “Modeling and Managing Airport Passenger Flow
for appointment no-show prediction,” Decis Support Syst, vol. 142, p. Under Uncertainty: A Case of Fukuoka Airport in Japan,” 2017, pp.
113472, Mar. 2021, doi: 10.1016/j.dss.2020.113472. 419–430. doi: 10.1007/978-3-319-67256-4_33.
[8] D. Marbouh et al., “Evaluating the Impact of Patient No-Shows on [25] O. Perdikaki, S. Kesavan, and J. M. Swaminathan, “Effect of Traffic
Service Quality,” Risk Manag Healthc Policy, vol. Volume 13, pp. on Sales and Conversion Rates of Retail Stores,” Manufacturing &
509–517, Jun. 2020, doi: 10.2147/RMHP.S232114. Service Operations Management, vol. 14, no. 1, pp. 145–162, Jan.
[9] I. Mohammadi, H. Wu, A. Turkcan, T. Toscos, and B. N. Doebbeling, 2012, doi: 10.1287/msom.1110.0356.
“Data Analytics and Modeling for Appointment No-show in [26] Y. Zhou, D. Dong, and W. Jiang, “Influence Factors of Patient No
Community Health Centers,” J Prim Care Community Health, vol. 9, Show in a Outpatient Department,” IOP Conf Ser Mater Sci Eng, vol.
p. 215013271881169, Jan. 2018, doi: 10.1177/2150132718811692. 439, p. 032047, Nov. 2018, doi: 10.1088/1757-899X/439/3/032047.
[10] A. Perez, “Models for Fitting Correlated Non-identical Bernoulli [27] A. R. Teo, C. W. Forsberg, H. E. Marsh, S. Saha, and S. K. Dobscha,
Random Variables with Applications to an Airline Data Problem,” “No-Show Rates When Phone Appointment Reminders Are Not
Doctoral Dissertation, Temple University, 2021. Directly Delivered,” Psychiatric Services, vol. 68, no. 11, pp. 1098–
[11] K. Topuz, H. Uner, A. Oztekin, and M. B. Yildirim, “Predicting 1100, Nov. 2017, doi: 10.1176/appi.ps.201700128.
pediatric clinic no-shows: a decision analytic framework using elastic [28] A. Brieden and P. Gritzmann, “Predicting show rates in air cargo
net and Bayesian belief network,” Ann Oper Res, vol. 263, no. 1–2, transport,” in 2020 International Conference on Artificial Intelligence
pp. 479–499, Apr. 2018, doi: 10.1007/s10479-017-2489-0. and Data Analytics for Air Transportation (AIDA-AT), IEEE, Feb.
[12] C. Wang, R. Wu, L. Deng, Y. Chen, Y. Li, and Y. Wan, “A 2020, pp. 1–9. doi: 10.1109/AIDA-AT48540.2020.9049209.
Bibliometric Analysis on No-Show Research: Status, Hotspots, Trends [29] D. Dalalah, U. Ojiako, and M. Chipulu, “Voluntary overbooking in
and Outlook,” Sustainability, vol. 12, no. 10, p. 3997, May 2020, commercial airline reservations,” J Air Transp Manag, vol. 86, p.
doi: 10.3390/su12103997. 101835, Jul. 2020, doi: 10.1016/j.jairtraman.2020.101835.
[13] Syed Arbab Mohd Shihab, Caleb Logemann, Deepak-George Thomas, [30] Shinta Saylindra, Nurul Islami, Tito Warsito, Ira Rachman, and Imam
and Peng Wei, “Autonomous Airline Revenue Management: A Deep Ozali, “The Understanding of Airlines Overbooking by Some Airlines
Reinforcement Learning Approach to Seat Inventory Control and at The Soekarno Hatta International Airport,” Advances in
Overbooking,” Cornell University, 2019. Transportation and Logistics Research, vol. 1, pp. 71–87, 2018.
[14] B. P. Berg et al., “Estimating the Cost of No-Shows and Evaluating [31] D. Zenkert, “No-show forecast using passenger booking data,” Lund
the Effects of Mitigation Strategies,” Medical Decision Making, vol. University, 2017.
33, no. 8, pp. 976–985, Nov. 2013, doi: 10.1177/0272989X13478194. [32] O. A. C. Dewi, “Revenue management model based on capacity
[15] M. Z. I. Chowdhury and T. C. Turin, “Variable selection strategies and sharing and overbooking in the airline,” Journal of Engineering and
its importance in clinical prediction modelling,” Fam Med Community Management in Industrial System, vol. 6, no. 2, pp. 86–94, Dec. 2018,
Health, vol. 8, no. 1, p. e000262, Feb. 2020, doi: 10.1136/fmch-2019- doi: 10.21776/ub.jemis.2018.006.02.3.
000262. [33] D. Victor and M. Stevens, “United Airlines passenger is dragged from
[16] Y. Ding, “Predicting flight delay based on multiple linear regression,” an overbooked flight,” The New York Times.
IOP Conf Ser Earth Environ Sci, vol. 81, p. 012198, Aug. 2017, [34] M. Somboon and K. Amaruchkul, “Applied Two-Class Overbooking
doi: 10.1088/1755-1315/81/1/012198. Model in Thailand’s Passenger Airline Data,” The Asian Journal of
[17] N. Idrus and N. Mohamed, “Forecasting The Number of Airplane Shipping and Logistics, vol. 33, no. 4, pp. 189–198, Dec. 2017, doi:
Passengers Using Box-Jenkins And Artificial Neural Network in 10.1016/j.ajsl.2017.12.002.
Malaysia,” Universiti Malaysia Terengganu Journal of [35] J. An, A. Mikhaylov, and S.-U. Jung, “A Linear Programming
Undergraduate Research, vol. 2, no. 4, pp. 89–100, Oct. 2020, approach for robust network revenue management in the airline
doi: 10.46754/umtjur.v2i4.183. industry,” J Air Transp Manag, vol. 91, p. 101979, Mar. 2021, doi:
[18] S. T. Lim, J. Y. Yuan, K. W. Khaw, and X. Chew, “Predicting Travel 10.1016/j.jairtraman.2020.101979.
Insurance Purchases in an Insurance Firm through Machine Learning [36] L. Lin, X. Liu, X. Liu, T. Zhang, and Y. Cao, “A prediction model to
Methods after COVID-19,” Journal of Informatics and Web forecast passenger flow based on flight arrangement in airport
Engineering, vol. 2, no. 2, pp. 43–58, Sep. 2023, terminals,” Energy and Built Environment, vol. 4, no. 6, pp. 680–688,
doi: 10.33093/jiwe.2023.2.2.4. Dec. 2023, doi: 10.1016/j.enbenv.2022.06.006.
[19] X. Xu, M. Hu, and X. Li, “Coping with no-show behaviour in [37] N. Vojtek, B. Petrović, and P. Milošević, “Decision Support System
appointment services: a multistage perspective,” Journal of Service for Predicting the Number of No-Show Passengers in Airline
Theory and Practice, vol. 32, no. 3, pp. 452–474, Apr. 2022, Industry,” Tehnicki vjesnik - Technical Gazette, vol. 28, no. 1, Feb.
doi: 10.1108/JSTP-08-2020-0196. 2021, doi: 10.17559/TV-20191215144655.
[20] R. J. Mieloszyk, J. I. Rosenbaum, C. S. Hall, D. S. Hippe, M. L. Gunn, [38] N. M. Asrah, M. E. Nor, S. N. A. Rahim, and W. K. Leng, “Time
and P. Bhargava, “Environmental Factors Predictive of No- Series Forecasting of the Number of Malaysia Airlines and AirAsia
Show Visits in Radiology: Observations of Three Million Outpatient Passengers,” J Phys Conf Ser, vol. 995, p. 012006, Apr. 2018, doi:
Imaging Visits Over 16 Years,” Journal of the American College of 10.1088/1742-6596/995/1/012006.
Radiology, vol. 16, no. 4, pp. 554–559, Apr. 2019, [39] P. H. K Tissera, A. N. M. R. S. P. llwana, K. T. Waduge, M. A. l.
doi: 10.1016/j.jacr.2018.12.046. Perera, D. P. Nawinna, and D. Kasthurirathna, “Predictive Analytics
[21] D. Samano, S. Saha, T. C. Kot, J. E. Potter, and L. M. Duthely, “Impact Platform for Airline Industry,” in 2020 2nd International Conference
of Extreme Weather on Healthcare Utilization by People with HIV in on Advancements in Computing (ICAC), IEEE, Dec. 2020, pp. 108–
Metropolitan Miami,” Int J Environ Res Public Health, vol. 18, no. 5, 113. doi: 10.1109/ICAC51239.2020.9357244
p. 2442, Mar. 2021, doi: 10.3390/ijerph18052442.
[22] S. Alodhaibi, R. L. Burdett, and P. KDV. Yarlagadda, “Framework for
Airport Outbound Passenger Flow Modelling,” Procedia Eng, vol.
174, pp. 1100–1109, 2017, doi: 10.1016/j.proeng.2017.01.263.

2064

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy