0% found this document useful (0 votes)
11 views

Machine Learning

Uploaded by

Nehemiah Ganta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Machine Learning

Uploaded by

Nehemiah Ganta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

PRICE PREDICTION AND RECOMMENDATION OF

AIRBNB PROPERTY LISTINGS


Project report submitted in partial fulfillment of the requirements for the
award of the degree of

BACHELOR OF TECHNOLOGY
In

COMPUTER SCIENCE AND ENGINEERING


By
G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552
Under the Guidance of

V.SHARIF

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SIR C R REDDY COLLEGE OF ENGINEERING
Approved by AICTE & Accredited by NBA
Affiliated to Jawaharlal Nehru Technological University,
Kakinada ELURU-5340007
A.Y.2022-23

1
SIR C R REDDY COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
This is to certify that the project report entitled “PRICE PREDICTION AND RECOMMENDATION
OF AIRBNB PROPERTY LISTINGS” being submitted by

G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552

in partial fulfillment for the award of the Degree of Bachelor of Technology in Computer Science
and Engineering to the Jawaharlal Nehru Technological University, Kakinada is a record of
bonafide work carried out under my guidance and supervision.

Dr. M. Krishna M.Tech, Dr. A. YESUBABU M.Tech,


Ph.D Ph.D
Professor Head of the Department

External Examiner

2
DECLARATION

I hereby declare that the Project entitled “ PRICE PREDICTION AND RECOMMENDATIONOF

AIRBNB PROPERTY LISTINGS” submitted for the B.TechDegree is my original work and the

Project has not formed the basis for the award of any degree, associateship, fellowship or

any other similar titles.

Place:ELURU PROJECT TEAM MEMBERS

Date
G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552

3
ACKNOWLEDGEMENT

I express my sincere thanks to my principal Dr. K. VENKATESWARA RAO, Principal for


providing the necessary infrastructure required for the project.

I would like to thank Dr. A. YESU BABU, Head of the Department of CSE, for providing thenecessary
facilities and his guidance in an efficient way for the completion of the project in
the specified time.

I am grateful to V.Sharif, Assistant Professor, Department of CSE, Project guide for providing
the necessary facilities and his guidance in the efficient completion of the project in a specified
time.

I express my deep-felt gratitude to Dr. N. Deepak, Associate Professor, Department of CSEfor


his valuable guidance and unstinting encouragement enabled us to accomplish our projectin
time.

I am extremely grateful to my department staff members and teammates who helped me inthe
successful completion of this project.

PROJECT TEAM MEMBERS

G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552

4
ABSTRACT

In today's rapidly evolving world of science, technology, and global connectivity, travel has reached
unprecedented levels, with people exploring various destinations for business and personal needs.
Finding suitable lodging away from home is a crucial aspect of this trip. While hotels and motels have
long been the go-to choose for travelers, the accommodation landscape has been transformed by
Airbnb, initially known as Airbed and breakfast. This innovative company has become a favored
alternative to traditional hotels, offering a unique experience for travelers. Over time, Airbnb has
gained immense popularity, often surpassing conventional hotel options as the preferred choice for
accommodation. The Airbnb business model is a two-sided marketplace that serves both property
owners and guests. Property owners offer their homes or rental properties on the platform, while
guests book these properties for a specified period. Airbnb charges a service fee from both the guest
and the property owner for each booking. In this data set, each row represents a listing with details
such as coordinates, neighborhood, host id, price per night, number of reviews, and so on. Purpose of
using this dataset: Dataset primarily focusses on providing insights about the Airbnb model. Various
data analysis models can be enforced to render more meaningful outcomes from Airbnb Listing
Models.

By applying machine learning algorithms and data visualization techniques, the business strategy can
be analyzed. For instance, the effects on the demand and pricing depend on the listing’s rating,
customer reviews, holiday seasons etc. Analyzing these aspects will in-term support the stakeholders
and the customers of Airbnb ecosystem to benefit the most for efficient decision making and
consistent long- term profitability. Method to be employed: Implementing a machine learning method
is a crucial step towards harnessing the potential of data driven decision making, in this data set we
are planning to implement the following Machine learning methodologies,

PRICE PREDICTION AND RECOMMENDATION OF AIRBNB PROPERTY LISTINGS

1. Prediction of the pricing using various regression methods like Linear Regression, Decision Tree
Regression, Random Forest Regression between the input features like property, size, location,
neighborhood and many more.

2. Clustering will be used for dimension reduction of the dataset and Cosine Similarity will be
utilized to provide personalized recommendations for Airbnb listings based on user preferences.
Additionally, we integrate hypothesis testing into our methodology, serving as a robust statistical
tool for decision making processes, benefiting both property owners and customers in optimizing
their listings and enhance the overall Airbnb experience.

Data sourcing: The data has been downloaded from Kaggle using the following link
https://www.kaggle.com/datasets/deeplearner09/airbnb-listings/data It is used for an Exploratory Data
Analysis study since we are taking data to provide analysis and recommendations.

Keywords:Clustering, Cosine Similarity, linear regression, Machine learning algorithms,Data


visualization techniques

5
TABLE OF CONTENT

S.NO TITLE PG.NO

1 INTRODUCTION 01

2 LITERATURE SURVEY 02

3 EXISTING SYSTEM 04

DRAWBACKS OF EXISTING SYSTEM

4 PROPOSED SYSTEM 05

ADVANTAGES OF PROPOSED SYSTEM

PROBLEM STATEMENT

5 REQUIREMENT ANALYSIS 06

FUNCTIONAL REQUIREMENTS

NON-FUNCTIONAL REQUIREMENTS

6 IMPLEMENTATION

DATA CLEANING/ PRE-PROCESSING AND IMPUTATION.


VISUALIZATIONS
 GEO-GRAPHICAL DISTRIBUTIONS OF LISTINGS.
 RELATIONSHIP BETWEEN PRICES AND RATINGS.
MACHINE LEARNING MODELS
 LINEAR REGRESSION
 DECISION TREE
 RANDOM FOREST
RECOMMENDATIONS
 CLUSTERING

7 TESTING 24

8 RESULTS AND DISCUSSION 28

9 CONCLUSION 29

10 REFERENCES 30

6
LIST OF FIGURES

FIGNO. NAME PG.NO


6.2.1 Count of ratings and 19
reviews based on room
type
6.2.2 Count of number of 20
reviews by last review year
6.2.3 Count of rating by 21
neighbourhood
6.2.4 Graphical distribution of 22
Airbnb listings
6.2.5 Relationship between 22
Ratings and Price features
6.2.6 Correlation matrix for 23
different price ranges
6.2.7 Histogram of price 25
distribution for entire
home/apt
6.3.1 Mean squared error 27
6.3.2 Scatter plot 27
6.3.3 Decision tree regression 29
6.3.4 Actual prices vs predicted 31
prices
6.3.5 Random forest regression: 35
prediction vs actual prices
7.1 Density plot of prices in 38
neighbourhood 78704 and
78702

7
1. INTRODUCTION

The Airbnb business model is a two-sided marketplace that serves both property owners and
guests. Property owners offer their homes or rental properties on the platform, while guests
book these properties for a specified period. The dataset primarily focusses on providing
insights about the Airbnb model. Various data analysis models can be enforced to render more
meaningful outcomes from Airbnb Listing Models. Implementing a machine learning method
is a crucial step towards harnessing the potential of data driven decision making. By applying
machine learning algorithms and data visualization techniques, the business strategy can be
analyzed.

Airbnb has gained immense popularity, often surpassing conventional hotel options as the
preferred choice for accommodation. The Airbnb business model is a two-sided marketplace
that serves both property owners and guests. Property owners offer their homes or rental
properties on the platform, while guests book these properties for a specified period. Airbnb
charges a service fee from both the guest and the property owner for each booking. In this data
set, each row represents a listing with details such as coordinates, neighborhood, host id, price
per night, number of reviews, and so on. Purpose of using this dataset: Dataset primarily
focusses on providing insights about the Airbnb model. Various data analysis models can be
enforced to render more meaningful outcomes from Airbnb Listing Models.

This project revolves around predicting Airbnb listing prices, a critical task in the dynamic
landscape of short-term property rentals. Airbnb, as a leading platform in the travel and
hospitality industry, hosts a crowd of listings with diverse attributes.

The project's primary objective is to develop a robust machine learning model capable of
accurately forecasting listing prices based on features like location, property type, and
amenities. By doing so, the project aims to offer valuable insights to both hosts and potential
guests, enabling hosts to optimize their offerings and aiding guests in making well-informed
accommodation decisions

Additionally, visualizations will be employed to enhance data exploration and interpretation.


Leveraging a comprehensive dataset obtained from Airbnb, the project will follow a systematic
methodology encompassing data preprocessing, exploratory data analysis, and model
development. The anticipated outcomes include an accurate predictive model, actionable
insights for hosts, and enhanced decision-making capabilities for potential guests.

8
2.LITERATURE SURVEY

author

9
3. EXISTING SYSTEM
Current exisiting system used the large amount of data sets with various columns and dada listings
with the different machine learning models,

10
4.PROPOSED SYSTEM

11
5.REQUIREMENT ANALYSIS

Hardware requirements:

Processor : Intel Core


RAM : 4GB

Software requirements:

Programming Languages : Python ( 3.11 version preferred)


Kerne : python3 (ipykernel)
Operating System : windows 11

Editor : Jupyter Notebook/ Vs code


Frameworks :

12
DATA PROCESSING

Identifying the data:


Data can be classified in either qualitative or quantitative data. further classified into,
Categorical- can be Nominal, ordinal, binary. Categorical data is used to classify items or
characteristics into groups based on specific attributes or qualities.

Binary: These columns have only two unique values, typically representing binary categories
such as 0 and 1. The code identifies columns with two unique values and categorizes them as
categorical binary. No operations can be done though the binary values are numerical like 1 and0
so these are defined under Categorical data.

Discrete data: These are distinct or separate values. Discrete data can be counted. They are whole numbers or
integers. The values cannot be divided into subdivisions into smaller pieces. Examples: Total students in a
class, number of products.

Continuous data: These are numeric values that form a continuous range and can be measured.
They are in the form of fractions or decimal. The values can be divided into subdivisions into
smaller pieces. Examples: temperature readings, age

Based on the above definitions we divided the columns in the data set into following categories,

Categorical:
inn_name: This column contains the non-numeric (string) values represented as object datatype.
host_name: The numeric values here represent the unique id information of the litsing's host. id:
An integer value representing the id column in the dataset refers to numerical categoricaldata.
identifier.
host_id: An integer value representing the id column in the dataset refers to numerical
categorical data. identifier.
neighbourhood: The numeric values here represent the neighborhood values and are
consideredto be the categorical values.
room_type: Represents the type of room and is likely categorical but may contain text
descriptions.
last_review: contains the date values in yyyy-mm-dd format that are denoted as objects.

Binary:
studio: An integer datatype here reflects the binary values 0 and 1 which can be considered
under the numerical categorical value as False or True respectively.
shared_bath: Shows the binary values of 0 and 1 representing a numerical categorical value as
False or True respectively.
private_bath: Shows the binary values of 0 and 1 representing a numerical categorical value as
False or True respectively.

13
Continuous:
ratings: Continuous numerical values representing ratings which are of float datatype.
latitude and longitude: Continuous numerical values representing the geographical coordinates
which are of float datatype.
reviews_per_month: Continuous numerical values representing the average number of reviews
per month

Dscrete or Numerical:
bedrooms: Having integer values denoting the numerical discrete values.
beds: Having integer values representing the numerical discrete data. baths:
Having integer values representing the numerical discrete data.
minimum_nights: Having integer values representing the numerical discrete data.
number_of_reviews: Having integer values representing the numerical discrete values.
calculated_host_listings_count: Having integer values representing the numerical discrete
data. availability_365: Having integer values representing the numerical discrete data.
number_of_reviews_ltm: Integer values representing counts or quantities.
price: Integer datatype representing the price which is of numerical discrete data.

IMPUTATION PROCESS

Missing data is a common issue in real-world datasets and can arise forvarious reasons, such as
data entry errors, intentional omission etc. To overcome this issue Imputation technique is used
in data cleaning and preprocessing.
It involves filling in missing values with median values for the numerical columns and mode
values for the categorical columns. Through imputation the data integrity is preserved which
further helps in more robust analysis and modeling.
Here, imputation is performed which is based on replacing the null values in the dataset using
median value.

Overview of the missing value data:

14
15
Handling the null and missing values:
1. The 4222 null values in the ‘ratings’ column are imputed with the median value of 4.89.
2. The 3103 null values in the 'reviews per month' column are imputed with the median
value of 0.99.
3. The null values in the ‘last review’ column are not imputed as we understand that some
properties may not have received any reviews, either due to being relatively new to the
platform or potentially being situated in remote locations or may the amenities be up
tothe mark etc. external reasons. Additionally, guests may have visited, but for various
reasons, they chose not to leave a review. We believe that replacing the blank values
withNULL accurately reflects these practical scenarios.
4. The two null values in the host name column are imputed with ‘Unknown’ as there are
few columns which are named as Unknown already, so we took that as reference and
same way we filled the two nulls with Unknown.

Data Extraction:
1. A function utilizing regular expressions was implemented to extract numerical
values associated with each room type. The extracted data was stored in dictionaries
for further processing.
2. Data Enrichment and Column Creation: The extracted data was utilized to create
new columns in the Data Frame, 'bedrooms', 'beds', 'baths': Columns were initiated
with default values and updated using extracted numerical data.
3. 'studio', 'shared_bath', 'private_bath': Binary columns indicating presence or absence
of these room types based on extracted indices.
4. 'ratings': Extracted numeric ratings were added to a new column after converting
'★' symbols to numeric values.

Data Refinement and Transformation:


The 'ratings' column was further refined by converting symbols to numeric values, while the
'last_review' column was converted to a datetime format for better analysis.

16
VISUAL REPRESENTATION AND INSIGHTS

In our project, we leverage visualizations as powerful tools to enhance the clarity and
interpretability of our findings. Through charts, graphs, and maps, we aim to simplify complex
patterns and trends and draw the meaning full insights from the visualizations. Visual
representations of geographical distribution, property characteristics, and price trends will not
only enhance interpretability but also empower hosts and users to make informed decisions.

In our project we used Tableau, Python, Microsoft fabric in which Power BI is incorporated are
used as visualization tools to represent the data visually.
Tableau is a versatile data visualization tool that allows users to create interactive and shareable
dashboards. Its user-friendly interface makes it accessible for both technical and non-technical
users, enabling the creation of compelling visualizations without extensive coding.

Python, with libraries like Matplotlib, Seaborn serves as a robust programming language for data
visualization. Python's flexibility and extensive libraries make it a preferred choice for
customizing visualizations and creating complex plots. Its integration with data analysis and
machine learning tools further enhances its capabilities.

Microsoft Fabric, also known as Fluent UI, is a design system developed by Microsoft to create
consistent and visually appealing user interfaces across different Microsoft applications. While
not a standalone visualization tool, it plays a crucial role in maintaining a cohesive and polished
design language within applications and contributes to a seamless user experience. Here
Microsoft Power BI is incorporated as part of the Power Platform in fabric. It is a business
analytics tool that facilitates interactive visualizations and business intelligence with an
intuitive drag-and-drop interface. It seamlessly integrates with various data sources, making it
convenientfor users to transform data into insightful visuals, reports, and dashboards. We
explored it as it is one of the emerging platforms which will help us to learn from the new
applications.

17
Some of the visualizations we implemented are

Count of ratings and reviews based on Room type:

Some of the insights are the Entire home/apt have more ratings and reviews than any other room
types. The hotel room has the lowest ratings and reviews.

Visualizing the count of ratings and the number of reviews based on room types provides
valuable insights into the popularity and satisfaction levels across different accommodation
offerings. we created intuitive visualizations using bar chart and line chart to depict the
distribution of ratings and reviews among various room types. This visual representation helps
stakeholders quickly grasp the count ratings associated with each room type, providing a
comprehensive overview of customer requirements and choices. Additionally, the number of
reviews for each room type canbe visualized to understand the level of engagement and feedback
received, allowing for strategic decision-making in the hospitality industry. These visualizations
enhance data-driven decision capabilities, enabling businesses to tailor their offerings based on
customer preferences and experiences.

18
Count of number of reviews by last review year:

Some of the insights are when we see the count of number of reviews over the last review
year,the count is almost similar till 2021 from 2012, though they have very slight increases in
middle years. Starting from 2021 the count of number of reviews started increasing for the
listings and from 2022 there is drastically increase which states that customers are increased
majorly from this year. This information is helpful to draw a lot of information like what makes
customers occupy Airbnb’s from that year, reasons etc. which is very useful to draw insights and
predict the future as well.

Visualizing the count of reviews based on the last review year offers a perspective on customer
feedback trends over the years. We used a line chart to illustrate how the number of reviews has
evolved over different years. This visualization aids in identifying patterns, seasonality, or shifts
in customer engagement, enabling businesses to make data-driven decisions. It serves as a
powerful tool for hospitality industry professionals to adapt their strategies and offerings in
response to changing customer sentiments over time.

19
Count of rating by neighborhood:

Some of the insights are that neighborhood 78704 has the highest count of ratings with 2233
followed by neighborhood 78702. In such a way we can be able to identify which listings are
mostly occupied and have a good count of ratings, which means a lot of customers are visiting
those regions. This will help owners to understand the demand can either increase the price,
increase the listings in those regions, identify and improve the quality and many more
advancements in listings with low count of ratings in particular regions and many more useful
information can be obtained with this visualization.

Analyzing the count of ratings by neighborhood through visualizations provides a localized


perspective on customer experiences. We used a bar graph to highlight the distribution of ratings
across different neighborhoods. This visualization helps businesses identify areas with
consistently high or low ratings, offering valuable insights into the perceived quality of
accommodation in specific regions. By understanding the variations in ratings by
neighborhood,businesses in the hospitality industry can tailor their services, marketing, and
improvement efforts to meet the specific preferences and expectations of customers in each area.

20
Geographical distribution of Airbnb Listings- Geographical distribution of Airbnb
listingshave been shown through the folium maps in Python. Folium is used to create a
geographical map with markers representing the locations of Airbnb listings. Key features

include markers where each marker on the map represents the location of an Airbnblisting.
This visualization provides an overview of the geographical distribution of listings. We
can zoom in and zoom out to explore different geographical areas.

Relationship between Ratings and Price features- The scatter plot shows the
relationship between ratings and price. This helps in understanding how customers
perceive the relationship between the price and rating of Airbnb listings. We can see that
higher ratings are concentrated between price range from 0 to 2500. It can be inferred that
customers are likely to opt for listings with low prices.

21
Correlation matrix for different price ranges:

To understand the correlation between the independent variables (features) and the target
variable price, the price is categorized into different ranges based on the Min, Max, and
Medianprices

22
Classifying as follows:

Very Low Price (0 to 95)


■ Low Price (96 to 150)
■ Medium Price (151 to 300)
■ High Price (301 to 600)
■ Very High Price (600 to 1000)
■ Exorbitantly High Price (>1000)

The correlation matrix is plotted using the sns heatmap for each price category as defined
aboveto understand how each parameter correlates with each other.

The matrix also reveals that the target variable price is not dependent only on one feature, but
rather multiple features dictate the price of the listing. This leads us to perform regression
techniques.

● Positive correlation: We can infer that the positive correlation features are bedrooms,
baths, beds, number_of_reviews, reviews_per_month, number_of_reviews_ltm for
eachof the price categories between each other. For exorbitantly high price categories
there is another positive correlation feature 'private_bath' which indicates the customers
also prioritize this feature when looking for very high-end listings. The intensity of red
indicates such correlation.
● Negative correlation: We can infer that the negative correlation features studio,
shared_bath, private_bath with respect to bedrooms, beds, baths. The intensity of blue
indicates such correlation.
● No correlation: We can also observe there are multiple loosely correlated variables
which are indicated by neutral colors.

23
If we see visualizations from tableau,

Histogram of Price distribution for Entire home/apt:

Some of the insights are significant number of Entire Home/ apt room type are less than 300$ in
which 100$ to 200$ price range is the highest. It is also observed that there are a smaller
numberof properties which are more than 1000$. The distribution exhibits a positive skewness,
indicating that the tail of the distribution extends toward higher prices.

24
Machine learning Models
1. Linear Regression-
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable we want to predict is called the dependent variable. The variable we are
using to predict the other variable's value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression fits
a straight line or surface that minimizes the discrepancies between predicted and actual output
values. There are simple linear regression calculators that use a “least squares” method to
discover the best-fit line for a set of paired data. We then estimate the value of X
(dependentvariable) from Y(independent variable).

Implementation of Linear Regression for Price Prediction Model-

Through Python code we have imported necessary libraries and loaded the Airbnb dataset. It then
calculates the Pearson correlation coefficients between the features and the target variable,
'price.' The top three most positively and negatively correlated features are identified. As
negatively correlated coefficients tending towards zero don’t have significant impact on target
variable, we have taken top 3 positively correlated features which are closer to 1. Subsequently, a
linear regression model is created using the three most positively correlated features. The model
is trained on the training data, and predictions are made on the test set. The linear regression
model assumes a linear relationship between the selected features and the target variable, 'price’.

Feature importance is derived from the correlation analysis, where the three most positively
correlated features with the target variable, 'price,' are identified. However, in the context of
linear regression, the emphasis is on identifying the features that contribute most to the
prediction model. In this case, the features selected for the linear regression model are
deemedimportant as they are believed to have a strong linear relationship with the target
variable.

Calculated the Mean Squared Error (MSE), a measure of the average squared difference
between the predicted and actual values on the test data. Lower MSE values indicate better
model performance. Additionally, the R-squared value, which measures the proportion of the
variance in the target variable, is explained by the model. A higher R-squared value signifies a
better fit. The provided output shows the MSE and R-squared values for the linear regression
model, providing insights into its predictive accuracy and overall goodness of fit.

25
Scatter plot is generated for comparing the actual prices in the test data against the predicted
prices from the linear regression model. The reference line (red dashed line) represents a perfect
prediction scenario where actual and predicted values are equal. The scatter plot allows for a
visual assessment of how well the linear regression model aligns with the actual prices,
providing insights into the model's performance and potential areas for improvement.

To summarize, the code conducts linear regression analysis, identifies important features
basedon Pearson correlation, evaluates model performance using MSE and R-squared, and
visualizes predictions through a scatter plot with a reference line.

26
Advantages of Linear Regression
1. Linear Regression is simple to implement and easier to interpret the output coefficients.

2. When you know the relationship between the independent and dependent variable have a
linear relationship, this algorithm is the best to use because of its less complexity compared
to other algorithms.

3. Linear Regression is susceptible to over-fitting, but it can be avoided using some


dimensionality reduction techniques, regularization (L1 and L2) techniques and cross-
validation.

4. Linear regression gives a quantitative degree of the quality and direction of the relationship
between factors.

Disadvantages of linear regression

1. The linearity presumption of linear regression can be a disadvantage when the genuine
relationship between factors is non-linear.

2. In linear regression technique outliers can have huge effects on the regression and boundaries are
linear in this technique.

3. Multicollinearity can make it challenging to decide the personal commitments of related indicators.

4. Linear regression also looks at a relationship between the mean of the dependent variables and the
independent variables. Just as the Mean is not a complete description of a single variable, linear
regression is not a complee description of relationships among variable

27
Decision Tree Regression

Decision Tree is one of the most used, practical approaches for supervised learning. It can be
used to solve both Regression and Classification tasks with the latter being put more into
practical application.
It is a tree-structured classifier with three types of nodes. The Root Node is the initial node which
represents the entire sample and may get split further into further nodes. The Interior Nodes
represent the features of a data set, and the branches represent the decision rules. Finally, the
Leaf Nodes represent the outcome. This algorithm is very useful for solving decision-related
problems.
With a particular data point, it is run completely through the entire tree by answering True/False
questions till it reaches the leaf node. The final prediction is the average value of the dependent
variable in that leaf node. Through multiple iterations, the Tree can predict a proper value for the
data point.

28
Implementation of Decision Tree Regression for Price Prediction Model-

The Python code implements Decision Tree Regression to predict Airbnb prices based on various
features. The Decision Tree Regressor is trained on the dataset, and the model captures non-
linear relationships by recursively partitioning the feature space. This approach is well-suited for
scenarios where the relationship between features and the target variable is intricate and involves
complex interactions. Decision trees are advantageous for their interpretability, as the resulting
tree structure provides insights into the decision-making process, making them valuable for
understanding feature importance and relationships within the data.

For feature importance the code extracts feature importance directly from the trained Decision
Tree Regressor. The importance of each feature is shown and sorted in descending order. The top
five features are selected based on their importance, providing a valuable understanding of which
features contribute most significantly to the prediction model. This information is crucial for
feature selection and helps streamline the model to focus on the most influential variables.
After training the Decision Tree Regressor with the top features, the Mean Squared Error (MSE)
is calculated, which quantifies the average squared difference between predicted and actual
values on the test set. Additionally, the R-squared value is computed, and output values are
shown below-

29
Scatter plot is generated for comparing the actual prices in the test set to the predicted prices
using the top features. The scatter plot visually represents how well the model predictions
alignwith the actual prices. Ideally, the points on the plot should cluster closely to a diagonal
line, indicating accurate predictions.

To summarize, the code effectively implements Decision Tree Regression, explores feature
importance, evaluates model performance using MSE and R-squared, and visualizes predictions
through a scatter plot, offering a comprehensive analysis of the predictive capabilities of the
model.

Advantages of Decision Tree Regression


1. Compared to other algorithms, decision trees require less effort for data preparation during
pre-processing.
2. A decision tree does not require normalization of data.
3. A decision tree does not require scaling of data as well.
4. Missing values in the data also do not affect the process of building a decision tree to any
considerable extent.
5. A Decision tree model is very intuitive and easy to explain to technical teams as well as
stakeholders.

Disadvantages of Decision Tree Regression

1. A small change in the data can cause a large change in the structure of the decision tree
causing instability.
2. For a Decision tree sometimes, calculation can go far more complex compared to
30
other algorithms.

31
3. Decision tree often involves more time to train the model.
4. Decision tree training is relatively expensive as the complexity and time has taken are more.
5. The Decision Tree algorithm is inadequate for applying regression and predicting
continuous value

32
RANDOM FOREST REGRESSION

The Random Forest Regressor is an ensemble learning method that enhances predictive accuracy
and stability by constructing numerous decision trees during training. Each tree is trained on a
random subset of the dataset, making decisions based on features. The aggregation of predictions
from these individual trees results in the final output, typically the mean prediction for regression
tasks. This ensemble approach not only leverages the strength of individual trees but also
minimizes overfitting, providing a robust and effective tool for regression analysis.

Feature selection:
The data set consists of various features related to accommodation listings. Amon all the
featureswe have implemented the inbuilt function feature_importances_ function which is inbuilt
in the sklearn.ensemble in the random forest itself. There by using it we calculated the
importance scores for the features and are mentioned below. We later took top most 9 features
for buildingthe most and gave the best predictions out of it. This feature selection process aids in
identifying and prioritizing the key variables that influence pricing decisions in the context of
Airbnb listings.

33
Algorithm:

A subset of relevant features is selected based on the top features from the feature selected. Here
after trying with different of features, we found top nine features gives the best score.Therelevant
features for predicting the 'price' target variable are selected and encoded, with categorical
variables like 'room type' and 'neighborhood' transformed using Label Encoder as they are
categorical columns. The model is then fitted to the selected features and target variable. The
dataset is split into training and testing sets, with 80% used for training the model and 20% for
evaluation. Subsequently, a Random Forest Regressor model is instantiated with 100estimators
for robust predictions and then trained on the training set. Afterward, predictions aremade on the
test set, and the model's performance is evaluated using metrics such as Mean Squared Error
(MSE) and R-squared (R2). Finally, the predicted prices are visually compared to the actual
prices through a scatter plot, providing insights into the model's accuracy in capturingthe price
variations in the Airbnb listings.

In conclusion the Importance score is a quantitative measure indicating the contribution of each
feature to the model's ability to make accurate predictions. Higher importance scores suggest that
the feature is more influential in determining the target variable.

Model Implementation:

The Mean Squared Error (MSE) of 217710.29 indicates the average squared difference between
the predicted and actual prices. A lower MSE is preferable usually.

The R-squared (R2) score of 0.50 suggests that approximately 50.40% of the variability in the
target variable ('price') can be explained by the model. R-squared values range from 0 to 1,
where1 indicates a perfect fit, so 0.50 indicates a good level of predictive power when compared
to all other machine learning models which we implemented here.

These metrics provide insights into the performance of the Random Forest Regression model. A
higher R2 would indicate a better-fitting model.

34
Visualization:

The scatter plot visually represents the model's predictions against the actual prices. Each
pointon the plot corresponds to a data point in the test set. The x-axis represents the actual prices,
while the y-axis represents the predicted prices by the Random Forest Regression model. The
points are scattered around the diagonal line, indicating the disparity between the predicted
andactual values. A more accurate model would have points closely aligned along the diagonal.
Hereif we observe most points are aligned in the direction of diagonal, which states a good
model.

35
Advantages of Random Forest regression:

High Predictive Accuracy: Random Forest tends to provide high predictive accuracy by
aggregating multiple decision trees, reducing the risk of overfitting.

Handles Non-Linearity: It can effectively model complex non-linear relationships in the data,
making it suitable for a wide range of regression tasks.

Feature Importance: The algorithm provides a feature importance score, helping identify the
most influential features in making predictions.

Robust to Outliers: Random Forest is robust to outliers and noise in the data due to its ensemblenature,
which averages out individual errors.

Disadvantages of Random Forest regression:

Computational Complexity: Training multiple decision trees and combining them can be
computationally expensive, especially for large datasets.

Memory Usage: The algorithm may consume significant memory, particularly when dealing
with many trees and features.

Not Suitable for Small Datasets: Random Forest may not perform well on small datasets, as it
requires enough data to capture complex relationships.

36
RECOMMENDATIONS

CLUSTERING

K-means clustering is a widely utilized unsupervised machine learning algorithm


designed to partition datasets into distinct, non-overlapping subgroups or clusters. The
primary objective of this algorithm is to aggregate similar data points, categorizing them into
clusters based on specific features or characteristics.

K-means clustering on Airbnb listing data to identify distinct clusters based on various
features. The feature importance is then calculated by an importance score which is already
present as inbuilt function kmeans.cluster_centers_ in the library sklearn cluster import
kmeans there by uitilizing that we calculated the importance scores.Based on thescores we just
took the top 7 features as it gives the best solution. We tried with differentcount as well. The
top 7 features are taken for clustering. Our analysis encompassed varying the number of
clusters (2, 3, 10, 15, 18) to comprehensively understand the datadistribution. To assess the
efficiency of different cluster sizes, we employed two key metrics: the Silhouette Score, which
measures how similar an object is to its own cluster compared to other clusters, and the
Davies-Bouldin Index, which evaluates the compactness and separation of clusters. The
evaluation indicated that a cluster size of 15yielded the most optimal distribution, despite
some overlapping. This cluster configuration demonstrated the most effective distribution
among the alternatives considered.

The feature importance is then calculated for each cluster, revealing the key attributes that
contribute to the differentiation between clusters. The top features are sorted basedon their
importance scores. Additionally, a recommendation function is implemented using cosine
similarity, suggesting listings like a given input listing.
In practical terms, the clustering results can be used to categorize Airbnb listings
into groups with shared characteristics. For example, clusters might represent listings with
similar pricing patterns, review frequencies, and other relevant features. Let say one property
id is given then it recommends the top5 property id with similar characteristics.The
recommendation function allows for personalized suggestions by identifying listings with
similarities to a chosen property. This information can be leveraged for targeted marketing,
pricing strategies, or providing tailored recommendations to users based on their preferences.
The clustering results can be used to categorize Airbnb listings into groups with shared
characteristics.

37
7.TESTING

Suppose an owner is considering purchasing a property with the intention of listing it on


Airbnb. After some initial analysis it becomes evident that the top two neighborhoods are
the most promising options. However, a crucial question emerges: Does investing in a
property in either of these neighborhoods yield equivalent returns? To address
thisinquiry, a hypothesis testing is undertaken.

The null hypothesis shows that there is no significant difference in the mean price
between the top two neighborhoods. The findings reveal that null hypothesis is true.

With this information in hand, the prospective property buyer can now broaden their
focus beyond property price per night alone. Factors such as the overall locality, unit
pricing and other relevant considerations can be incorporated into the decision-making
process.

38
8.RESULTS AND SOLUTION EVALUATION

Linear Decision Rando


Tree m
Regression Regression Forest
MSE 395,718.39 281,190.71 271,710.29
R-Squared 0.11 0.36 0.50

By comparing the Mean squared error (MSE) and R squared for all the machine learning
algorithms implemented, we can say that Random Forest Regression provides the lowest MSE
value and highest R squared value. Highlighting its effectiveness in explaining and capturing the
variability in the target variable.

Among the models evaluated—Linear Regression, Lasso Regression, Gradient Boosting


Regressor, Decision Tree Regression, and Random Forest—Random Forest stands out as the
most effective. It achieved the MSE of 271,710.29 and the highest R-Squared of 0.50, indicating
superior predictive performance and a better ability to explain variance in the target variable.
These results suggest that Random Forest is the preferred model for predicting
accommodationprices in this dataset, offering a balance of accuracy and explanatory power.

The remarkable efficacy of the Random Forest model in predicting accommodation prices is
highly advantageous for various stakeholders in the real estate and hospitality industry. This
predictive capability is crucial for property owners, hosts, and investors to make informed
decisions regarding pricing strategies, optimizing rental income, and maximizing occupancy
rates. Additionally, the insights gained from feature importance analysis contribute to a deeper
understanding of the key factors influencing pricing, enabling data-driven decision-making, and
enhancing overall business strategies in the dynamic real estate landscape.

Here in all the algorithms, we performed models by taking multiple count of features, checked
the performance with different count of features, tried different functions and methods to
calculate the feature importance scores while coding in the report we have mentioned the best of
all those.

39
9.CONCLUSION

 To conclude we can say that non-linear regression models perform better than the
linear regression algorithms by comparing the MSE and R squared values for the given
dataset.

 This analysis provides valuable suggestions into the factors influencing property prices
aiding property owners and customers in strategic decision-making. The application of
machine learning techniques demonstrated robust predictive capabilities for property
prices.

 The Insights obtained from this project contribute to the growing body of knowledge in
the field of machine learning for real-world applications and provide a foundation for
further refinement and optimization of pricing prediction models in the dynamic
contextof real estate.

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy