Machine Learning
Machine Learning
BACHELOR OF TECHNOLOGY
In
V.SHARIF
1
SIR C R REDDY COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the project report entitled “PRICE PREDICTION AND RECOMMENDATION
OF AIRBNB PROPERTY LISTINGS” being submitted by
G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552
in partial fulfillment for the award of the Degree of Bachelor of Technology in Computer Science
and Engineering to the Jawaharlal Nehru Technological University, Kakinada is a record of
bonafide work carried out under my guidance and supervision.
External Examiner
2
DECLARATION
I hereby declare that the Project entitled “ PRICE PREDICTION AND RECOMMENDATIONOF
AIRBNB PROPERTY LISTINGS” submitted for the B.TechDegree is my original work and the
Project has not formed the basis for the award of any degree, associateship, fellowship or
Date
G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552
3
ACKNOWLEDGEMENT
I would like to thank Dr. A. YESU BABU, Head of the Department of CSE, for providing thenecessary
facilities and his guidance in an efficient way for the completion of the project in
the specified time.
I am grateful to V.Sharif, Assistant Professor, Department of CSE, Project guide for providing
the necessary facilities and his guidance in the efficient completion of the project in a specified
time.
I am extremely grateful to my department staff members and teammates who helped me inthe
successful completion of this project.
G.Nehemiah - 20B81A0548
G.Ishwarya - 20B81A0549
G.Pravalika - 20B81A0550
G.Ravali - 20B81A0551
G.Venkata Sai Monika - 20B81A0552
4
ABSTRACT
In today's rapidly evolving world of science, technology, and global connectivity, travel has reached
unprecedented levels, with people exploring various destinations for business and personal needs.
Finding suitable lodging away from home is a crucial aspect of this trip. While hotels and motels have
long been the go-to choose for travelers, the accommodation landscape has been transformed by
Airbnb, initially known as Airbed and breakfast. This innovative company has become a favored
alternative to traditional hotels, offering a unique experience for travelers. Over time, Airbnb has
gained immense popularity, often surpassing conventional hotel options as the preferred choice for
accommodation. The Airbnb business model is a two-sided marketplace that serves both property
owners and guests. Property owners offer their homes or rental properties on the platform, while
guests book these properties for a specified period. Airbnb charges a service fee from both the guest
and the property owner for each booking. In this data set, each row represents a listing with details
such as coordinates, neighborhood, host id, price per night, number of reviews, and so on. Purpose of
using this dataset: Dataset primarily focusses on providing insights about the Airbnb model. Various
data analysis models can be enforced to render more meaningful outcomes from Airbnb Listing
Models.
By applying machine learning algorithms and data visualization techniques, the business strategy can
be analyzed. For instance, the effects on the demand and pricing depend on the listing’s rating,
customer reviews, holiday seasons etc. Analyzing these aspects will in-term support the stakeholders
and the customers of Airbnb ecosystem to benefit the most for efficient decision making and
consistent long- term profitability. Method to be employed: Implementing a machine learning method
is a crucial step towards harnessing the potential of data driven decision making, in this data set we
are planning to implement the following Machine learning methodologies,
1. Prediction of the pricing using various regression methods like Linear Regression, Decision Tree
Regression, Random Forest Regression between the input features like property, size, location,
neighborhood and many more.
2. Clustering will be used for dimension reduction of the dataset and Cosine Similarity will be
utilized to provide personalized recommendations for Airbnb listings based on user preferences.
Additionally, we integrate hypothesis testing into our methodology, serving as a robust statistical
tool for decision making processes, benefiting both property owners and customers in optimizing
their listings and enhance the overall Airbnb experience.
Data sourcing: The data has been downloaded from Kaggle using the following link
https://www.kaggle.com/datasets/deeplearner09/airbnb-listings/data It is used for an Exploratory Data
Analysis study since we are taking data to provide analysis and recommendations.
5
TABLE OF CONTENT
1 INTRODUCTION 01
2 LITERATURE SURVEY 02
3 EXISTING SYSTEM 04
4 PROPOSED SYSTEM 05
PROBLEM STATEMENT
5 REQUIREMENT ANALYSIS 06
FUNCTIONAL REQUIREMENTS
NON-FUNCTIONAL REQUIREMENTS
6 IMPLEMENTATION
7 TESTING 24
9 CONCLUSION 29
10 REFERENCES 30
6
LIST OF FIGURES
7
1. INTRODUCTION
The Airbnb business model is a two-sided marketplace that serves both property owners and
guests. Property owners offer their homes or rental properties on the platform, while guests
book these properties for a specified period. The dataset primarily focusses on providing
insights about the Airbnb model. Various data analysis models can be enforced to render more
meaningful outcomes from Airbnb Listing Models. Implementing a machine learning method
is a crucial step towards harnessing the potential of data driven decision making. By applying
machine learning algorithms and data visualization techniques, the business strategy can be
analyzed.
Airbnb has gained immense popularity, often surpassing conventional hotel options as the
preferred choice for accommodation. The Airbnb business model is a two-sided marketplace
that serves both property owners and guests. Property owners offer their homes or rental
properties on the platform, while guests book these properties for a specified period. Airbnb
charges a service fee from both the guest and the property owner for each booking. In this data
set, each row represents a listing with details such as coordinates, neighborhood, host id, price
per night, number of reviews, and so on. Purpose of using this dataset: Dataset primarily
focusses on providing insights about the Airbnb model. Various data analysis models can be
enforced to render more meaningful outcomes from Airbnb Listing Models.
This project revolves around predicting Airbnb listing prices, a critical task in the dynamic
landscape of short-term property rentals. Airbnb, as a leading platform in the travel and
hospitality industry, hosts a crowd of listings with diverse attributes.
The project's primary objective is to develop a robust machine learning model capable of
accurately forecasting listing prices based on features like location, property type, and
amenities. By doing so, the project aims to offer valuable insights to both hosts and potential
guests, enabling hosts to optimize their offerings and aiding guests in making well-informed
accommodation decisions
8
2.LITERATURE SURVEY
author
9
3. EXISTING SYSTEM
Current exisiting system used the large amount of data sets with various columns and dada listings
with the different machine learning models,
10
4.PROPOSED SYSTEM
11
5.REQUIREMENT ANALYSIS
Hardware requirements:
Software requirements:
12
DATA PROCESSING
Binary: These columns have only two unique values, typically representing binary categories
such as 0 and 1. The code identifies columns with two unique values and categorizes them as
categorical binary. No operations can be done though the binary values are numerical like 1 and0
so these are defined under Categorical data.
Discrete data: These are distinct or separate values. Discrete data can be counted. They are whole numbers or
integers. The values cannot be divided into subdivisions into smaller pieces. Examples: Total students in a
class, number of products.
Continuous data: These are numeric values that form a continuous range and can be measured.
They are in the form of fractions or decimal. The values can be divided into subdivisions into
smaller pieces. Examples: temperature readings, age
Based on the above definitions we divided the columns in the data set into following categories,
Categorical:
inn_name: This column contains the non-numeric (string) values represented as object datatype.
host_name: The numeric values here represent the unique id information of the litsing's host. id:
An integer value representing the id column in the dataset refers to numerical categoricaldata.
identifier.
host_id: An integer value representing the id column in the dataset refers to numerical
categorical data. identifier.
neighbourhood: The numeric values here represent the neighborhood values and are
consideredto be the categorical values.
room_type: Represents the type of room and is likely categorical but may contain text
descriptions.
last_review: contains the date values in yyyy-mm-dd format that are denoted as objects.
Binary:
studio: An integer datatype here reflects the binary values 0 and 1 which can be considered
under the numerical categorical value as False or True respectively.
shared_bath: Shows the binary values of 0 and 1 representing a numerical categorical value as
False or True respectively.
private_bath: Shows the binary values of 0 and 1 representing a numerical categorical value as
False or True respectively.
13
Continuous:
ratings: Continuous numerical values representing ratings which are of float datatype.
latitude and longitude: Continuous numerical values representing the geographical coordinates
which are of float datatype.
reviews_per_month: Continuous numerical values representing the average number of reviews
per month
Dscrete or Numerical:
bedrooms: Having integer values denoting the numerical discrete values.
beds: Having integer values representing the numerical discrete data. baths:
Having integer values representing the numerical discrete data.
minimum_nights: Having integer values representing the numerical discrete data.
number_of_reviews: Having integer values representing the numerical discrete values.
calculated_host_listings_count: Having integer values representing the numerical discrete
data. availability_365: Having integer values representing the numerical discrete data.
number_of_reviews_ltm: Integer values representing counts or quantities.
price: Integer datatype representing the price which is of numerical discrete data.
IMPUTATION PROCESS
Missing data is a common issue in real-world datasets and can arise forvarious reasons, such as
data entry errors, intentional omission etc. To overcome this issue Imputation technique is used
in data cleaning and preprocessing.
It involves filling in missing values with median values for the numerical columns and mode
values for the categorical columns. Through imputation the data integrity is preserved which
further helps in more robust analysis and modeling.
Here, imputation is performed which is based on replacing the null values in the dataset using
median value.
14
15
Handling the null and missing values:
1. The 4222 null values in the ‘ratings’ column are imputed with the median value of 4.89.
2. The 3103 null values in the 'reviews per month' column are imputed with the median
value of 0.99.
3. The null values in the ‘last review’ column are not imputed as we understand that some
properties may not have received any reviews, either due to being relatively new to the
platform or potentially being situated in remote locations or may the amenities be up
tothe mark etc. external reasons. Additionally, guests may have visited, but for various
reasons, they chose not to leave a review. We believe that replacing the blank values
withNULL accurately reflects these practical scenarios.
4. The two null values in the host name column are imputed with ‘Unknown’ as there are
few columns which are named as Unknown already, so we took that as reference and
same way we filled the two nulls with Unknown.
Data Extraction:
1. A function utilizing regular expressions was implemented to extract numerical
values associated with each room type. The extracted data was stored in dictionaries
for further processing.
2. Data Enrichment and Column Creation: The extracted data was utilized to create
new columns in the Data Frame, 'bedrooms', 'beds', 'baths': Columns were initiated
with default values and updated using extracted numerical data.
3. 'studio', 'shared_bath', 'private_bath': Binary columns indicating presence or absence
of these room types based on extracted indices.
4. 'ratings': Extracted numeric ratings were added to a new column after converting
'★' symbols to numeric values.
16
VISUAL REPRESENTATION AND INSIGHTS
In our project, we leverage visualizations as powerful tools to enhance the clarity and
interpretability of our findings. Through charts, graphs, and maps, we aim to simplify complex
patterns and trends and draw the meaning full insights from the visualizations. Visual
representations of geographical distribution, property characteristics, and price trends will not
only enhance interpretability but also empower hosts and users to make informed decisions.
In our project we used Tableau, Python, Microsoft fabric in which Power BI is incorporated are
used as visualization tools to represent the data visually.
Tableau is a versatile data visualization tool that allows users to create interactive and shareable
dashboards. Its user-friendly interface makes it accessible for both technical and non-technical
users, enabling the creation of compelling visualizations without extensive coding.
Python, with libraries like Matplotlib, Seaborn serves as a robust programming language for data
visualization. Python's flexibility and extensive libraries make it a preferred choice for
customizing visualizations and creating complex plots. Its integration with data analysis and
machine learning tools further enhances its capabilities.
Microsoft Fabric, also known as Fluent UI, is a design system developed by Microsoft to create
consistent and visually appealing user interfaces across different Microsoft applications. While
not a standalone visualization tool, it plays a crucial role in maintaining a cohesive and polished
design language within applications and contributes to a seamless user experience. Here
Microsoft Power BI is incorporated as part of the Power Platform in fabric. It is a business
analytics tool that facilitates interactive visualizations and business intelligence with an
intuitive drag-and-drop interface. It seamlessly integrates with various data sources, making it
convenientfor users to transform data into insightful visuals, reports, and dashboards. We
explored it as it is one of the emerging platforms which will help us to learn from the new
applications.
17
Some of the visualizations we implemented are
Some of the insights are the Entire home/apt have more ratings and reviews than any other room
types. The hotel room has the lowest ratings and reviews.
Visualizing the count of ratings and the number of reviews based on room types provides
valuable insights into the popularity and satisfaction levels across different accommodation
offerings. we created intuitive visualizations using bar chart and line chart to depict the
distribution of ratings and reviews among various room types. This visual representation helps
stakeholders quickly grasp the count ratings associated with each room type, providing a
comprehensive overview of customer requirements and choices. Additionally, the number of
reviews for each room type canbe visualized to understand the level of engagement and feedback
received, allowing for strategic decision-making in the hospitality industry. These visualizations
enhance data-driven decision capabilities, enabling businesses to tailor their offerings based on
customer preferences and experiences.
18
Count of number of reviews by last review year:
Some of the insights are when we see the count of number of reviews over the last review
year,the count is almost similar till 2021 from 2012, though they have very slight increases in
middle years. Starting from 2021 the count of number of reviews started increasing for the
listings and from 2022 there is drastically increase which states that customers are increased
majorly from this year. This information is helpful to draw a lot of information like what makes
customers occupy Airbnb’s from that year, reasons etc. which is very useful to draw insights and
predict the future as well.
Visualizing the count of reviews based on the last review year offers a perspective on customer
feedback trends over the years. We used a line chart to illustrate how the number of reviews has
evolved over different years. This visualization aids in identifying patterns, seasonality, or shifts
in customer engagement, enabling businesses to make data-driven decisions. It serves as a
powerful tool for hospitality industry professionals to adapt their strategies and offerings in
response to changing customer sentiments over time.
19
Count of rating by neighborhood:
Some of the insights are that neighborhood 78704 has the highest count of ratings with 2233
followed by neighborhood 78702. In such a way we can be able to identify which listings are
mostly occupied and have a good count of ratings, which means a lot of customers are visiting
those regions. This will help owners to understand the demand can either increase the price,
increase the listings in those regions, identify and improve the quality and many more
advancements in listings with low count of ratings in particular regions and many more useful
information can be obtained with this visualization.
20
Geographical distribution of Airbnb Listings- Geographical distribution of Airbnb
listingshave been shown through the folium maps in Python. Folium is used to create a
geographical map with markers representing the locations of Airbnb listings. Key features
include markers where each marker on the map represents the location of an Airbnblisting.
This visualization provides an overview of the geographical distribution of listings. We
can zoom in and zoom out to explore different geographical areas.
Relationship between Ratings and Price features- The scatter plot shows the
relationship between ratings and price. This helps in understanding how customers
perceive the relationship between the price and rating of Airbnb listings. We can see that
higher ratings are concentrated between price range from 0 to 2500. It can be inferred that
customers are likely to opt for listings with low prices.
21
Correlation matrix for different price ranges:
To understand the correlation between the independent variables (features) and the target
variable price, the price is categorized into different ranges based on the Min, Max, and
Medianprices
22
Classifying as follows:
The correlation matrix is plotted using the sns heatmap for each price category as defined
aboveto understand how each parameter correlates with each other.
The matrix also reveals that the target variable price is not dependent only on one feature, but
rather multiple features dictate the price of the listing. This leads us to perform regression
techniques.
● Positive correlation: We can infer that the positive correlation features are bedrooms,
baths, beds, number_of_reviews, reviews_per_month, number_of_reviews_ltm for
eachof the price categories between each other. For exorbitantly high price categories
there is another positive correlation feature 'private_bath' which indicates the customers
also prioritize this feature when looking for very high-end listings. The intensity of red
indicates such correlation.
● Negative correlation: We can infer that the negative correlation features studio,
shared_bath, private_bath with respect to bedrooms, beds, baths. The intensity of blue
indicates such correlation.
● No correlation: We can also observe there are multiple loosely correlated variables
which are indicated by neutral colors.
23
If we see visualizations from tableau,
Some of the insights are significant number of Entire Home/ apt room type are less than 300$ in
which 100$ to 200$ price range is the highest. It is also observed that there are a smaller
numberof properties which are more than 1000$. The distribution exhibits a positive skewness,
indicating that the tail of the distribution extends toward higher prices.
24
Machine learning Models
1. Linear Regression-
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable we want to predict is called the dependent variable. The variable we are
using to predict the other variable's value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression fits
a straight line or surface that minimizes the discrepancies between predicted and actual output
values. There are simple linear regression calculators that use a “least squares” method to
discover the best-fit line for a set of paired data. We then estimate the value of X
(dependentvariable) from Y(independent variable).
Through Python code we have imported necessary libraries and loaded the Airbnb dataset. It then
calculates the Pearson correlation coefficients between the features and the target variable,
'price.' The top three most positively and negatively correlated features are identified. As
negatively correlated coefficients tending towards zero don’t have significant impact on target
variable, we have taken top 3 positively correlated features which are closer to 1. Subsequently, a
linear regression model is created using the three most positively correlated features. The model
is trained on the training data, and predictions are made on the test set. The linear regression
model assumes a linear relationship between the selected features and the target variable, 'price’.
Feature importance is derived from the correlation analysis, where the three most positively
correlated features with the target variable, 'price,' are identified. However, in the context of
linear regression, the emphasis is on identifying the features that contribute most to the
prediction model. In this case, the features selected for the linear regression model are
deemedimportant as they are believed to have a strong linear relationship with the target
variable.
Calculated the Mean Squared Error (MSE), a measure of the average squared difference
between the predicted and actual values on the test data. Lower MSE values indicate better
model performance. Additionally, the R-squared value, which measures the proportion of the
variance in the target variable, is explained by the model. A higher R-squared value signifies a
better fit. The provided output shows the MSE and R-squared values for the linear regression
model, providing insights into its predictive accuracy and overall goodness of fit.
25
Scatter plot is generated for comparing the actual prices in the test data against the predicted
prices from the linear regression model. The reference line (red dashed line) represents a perfect
prediction scenario where actual and predicted values are equal. The scatter plot allows for a
visual assessment of how well the linear regression model aligns with the actual prices,
providing insights into the model's performance and potential areas for improvement.
To summarize, the code conducts linear regression analysis, identifies important features
basedon Pearson correlation, evaluates model performance using MSE and R-squared, and
visualizes predictions through a scatter plot with a reference line.
26
Advantages of Linear Regression
1. Linear Regression is simple to implement and easier to interpret the output coefficients.
2. When you know the relationship between the independent and dependent variable have a
linear relationship, this algorithm is the best to use because of its less complexity compared
to other algorithms.
4. Linear regression gives a quantitative degree of the quality and direction of the relationship
between factors.
1. The linearity presumption of linear regression can be a disadvantage when the genuine
relationship between factors is non-linear.
2. In linear regression technique outliers can have huge effects on the regression and boundaries are
linear in this technique.
3. Multicollinearity can make it challenging to decide the personal commitments of related indicators.
4. Linear regression also looks at a relationship between the mean of the dependent variables and the
independent variables. Just as the Mean is not a complete description of a single variable, linear
regression is not a complee description of relationships among variable
27
Decision Tree Regression
Decision Tree is one of the most used, practical approaches for supervised learning. It can be
used to solve both Regression and Classification tasks with the latter being put more into
practical application.
It is a tree-structured classifier with three types of nodes. The Root Node is the initial node which
represents the entire sample and may get split further into further nodes. The Interior Nodes
represent the features of a data set, and the branches represent the decision rules. Finally, the
Leaf Nodes represent the outcome. This algorithm is very useful for solving decision-related
problems.
With a particular data point, it is run completely through the entire tree by answering True/False
questions till it reaches the leaf node. The final prediction is the average value of the dependent
variable in that leaf node. Through multiple iterations, the Tree can predict a proper value for the
data point.
28
Implementation of Decision Tree Regression for Price Prediction Model-
The Python code implements Decision Tree Regression to predict Airbnb prices based on various
features. The Decision Tree Regressor is trained on the dataset, and the model captures non-
linear relationships by recursively partitioning the feature space. This approach is well-suited for
scenarios where the relationship between features and the target variable is intricate and involves
complex interactions. Decision trees are advantageous for their interpretability, as the resulting
tree structure provides insights into the decision-making process, making them valuable for
understanding feature importance and relationships within the data.
For feature importance the code extracts feature importance directly from the trained Decision
Tree Regressor. The importance of each feature is shown and sorted in descending order. The top
five features are selected based on their importance, providing a valuable understanding of which
features contribute most significantly to the prediction model. This information is crucial for
feature selection and helps streamline the model to focus on the most influential variables.
After training the Decision Tree Regressor with the top features, the Mean Squared Error (MSE)
is calculated, which quantifies the average squared difference between predicted and actual
values on the test set. Additionally, the R-squared value is computed, and output values are
shown below-
29
Scatter plot is generated for comparing the actual prices in the test set to the predicted prices
using the top features. The scatter plot visually represents how well the model predictions
alignwith the actual prices. Ideally, the points on the plot should cluster closely to a diagonal
line, indicating accurate predictions.
To summarize, the code effectively implements Decision Tree Regression, explores feature
importance, evaluates model performance using MSE and R-squared, and visualizes predictions
through a scatter plot, offering a comprehensive analysis of the predictive capabilities of the
model.
1. A small change in the data can cause a large change in the structure of the decision tree
causing instability.
2. For a Decision tree sometimes, calculation can go far more complex compared to
30
other algorithms.
31
3. Decision tree often involves more time to train the model.
4. Decision tree training is relatively expensive as the complexity and time has taken are more.
5. The Decision Tree algorithm is inadequate for applying regression and predicting
continuous value
32
RANDOM FOREST REGRESSION
The Random Forest Regressor is an ensemble learning method that enhances predictive accuracy
and stability by constructing numerous decision trees during training. Each tree is trained on a
random subset of the dataset, making decisions based on features. The aggregation of predictions
from these individual trees results in the final output, typically the mean prediction for regression
tasks. This ensemble approach not only leverages the strength of individual trees but also
minimizes overfitting, providing a robust and effective tool for regression analysis.
Feature selection:
The data set consists of various features related to accommodation listings. Amon all the
featureswe have implemented the inbuilt function feature_importances_ function which is inbuilt
in the sklearn.ensemble in the random forest itself. There by using it we calculated the
importance scores for the features and are mentioned below. We later took top most 9 features
for buildingthe most and gave the best predictions out of it. This feature selection process aids in
identifying and prioritizing the key variables that influence pricing decisions in the context of
Airbnb listings.
33
Algorithm:
A subset of relevant features is selected based on the top features from the feature selected. Here
after trying with different of features, we found top nine features gives the best score.Therelevant
features for predicting the 'price' target variable are selected and encoded, with categorical
variables like 'room type' and 'neighborhood' transformed using Label Encoder as they are
categorical columns. The model is then fitted to the selected features and target variable. The
dataset is split into training and testing sets, with 80% used for training the model and 20% for
evaluation. Subsequently, a Random Forest Regressor model is instantiated with 100estimators
for robust predictions and then trained on the training set. Afterward, predictions aremade on the
test set, and the model's performance is evaluated using metrics such as Mean Squared Error
(MSE) and R-squared (R2). Finally, the predicted prices are visually compared to the actual
prices through a scatter plot, providing insights into the model's accuracy in capturingthe price
variations in the Airbnb listings.
In conclusion the Importance score is a quantitative measure indicating the contribution of each
feature to the model's ability to make accurate predictions. Higher importance scores suggest that
the feature is more influential in determining the target variable.
Model Implementation:
The Mean Squared Error (MSE) of 217710.29 indicates the average squared difference between
the predicted and actual prices. A lower MSE is preferable usually.
The R-squared (R2) score of 0.50 suggests that approximately 50.40% of the variability in the
target variable ('price') can be explained by the model. R-squared values range from 0 to 1,
where1 indicates a perfect fit, so 0.50 indicates a good level of predictive power when compared
to all other machine learning models which we implemented here.
These metrics provide insights into the performance of the Random Forest Regression model. A
higher R2 would indicate a better-fitting model.
34
Visualization:
The scatter plot visually represents the model's predictions against the actual prices. Each
pointon the plot corresponds to a data point in the test set. The x-axis represents the actual prices,
while the y-axis represents the predicted prices by the Random Forest Regression model. The
points are scattered around the diagonal line, indicating the disparity between the predicted
andactual values. A more accurate model would have points closely aligned along the diagonal.
Hereif we observe most points are aligned in the direction of diagonal, which states a good
model.
35
Advantages of Random Forest regression:
High Predictive Accuracy: Random Forest tends to provide high predictive accuracy by
aggregating multiple decision trees, reducing the risk of overfitting.
Handles Non-Linearity: It can effectively model complex non-linear relationships in the data,
making it suitable for a wide range of regression tasks.
Feature Importance: The algorithm provides a feature importance score, helping identify the
most influential features in making predictions.
Robust to Outliers: Random Forest is robust to outliers and noise in the data due to its ensemblenature,
which averages out individual errors.
Computational Complexity: Training multiple decision trees and combining them can be
computationally expensive, especially for large datasets.
Memory Usage: The algorithm may consume significant memory, particularly when dealing
with many trees and features.
Not Suitable for Small Datasets: Random Forest may not perform well on small datasets, as it
requires enough data to capture complex relationships.
36
RECOMMENDATIONS
CLUSTERING
K-means clustering on Airbnb listing data to identify distinct clusters based on various
features. The feature importance is then calculated by an importance score which is already
present as inbuilt function kmeans.cluster_centers_ in the library sklearn cluster import
kmeans there by uitilizing that we calculated the importance scores.Based on thescores we just
took the top 7 features as it gives the best solution. We tried with differentcount as well. The
top 7 features are taken for clustering. Our analysis encompassed varying the number of
clusters (2, 3, 10, 15, 18) to comprehensively understand the datadistribution. To assess the
efficiency of different cluster sizes, we employed two key metrics: the Silhouette Score, which
measures how similar an object is to its own cluster compared to other clusters, and the
Davies-Bouldin Index, which evaluates the compactness and separation of clusters. The
evaluation indicated that a cluster size of 15yielded the most optimal distribution, despite
some overlapping. This cluster configuration demonstrated the most effective distribution
among the alternatives considered.
The feature importance is then calculated for each cluster, revealing the key attributes that
contribute to the differentiation between clusters. The top features are sorted basedon their
importance scores. Additionally, a recommendation function is implemented using cosine
similarity, suggesting listings like a given input listing.
In practical terms, the clustering results can be used to categorize Airbnb listings
into groups with shared characteristics. For example, clusters might represent listings with
similar pricing patterns, review frequencies, and other relevant features. Let say one property
id is given then it recommends the top5 property id with similar characteristics.The
recommendation function allows for personalized suggestions by identifying listings with
similarities to a chosen property. This information can be leveraged for targeted marketing,
pricing strategies, or providing tailored recommendations to users based on their preferences.
The clustering results can be used to categorize Airbnb listings into groups with shared
characteristics.
37
7.TESTING
The null hypothesis shows that there is no significant difference in the mean price
between the top two neighborhoods. The findings reveal that null hypothesis is true.
With this information in hand, the prospective property buyer can now broaden their
focus beyond property price per night alone. Factors such as the overall locality, unit
pricing and other relevant considerations can be incorporated into the decision-making
process.
38
8.RESULTS AND SOLUTION EVALUATION
By comparing the Mean squared error (MSE) and R squared for all the machine learning
algorithms implemented, we can say that Random Forest Regression provides the lowest MSE
value and highest R squared value. Highlighting its effectiveness in explaining and capturing the
variability in the target variable.
The remarkable efficacy of the Random Forest model in predicting accommodation prices is
highly advantageous for various stakeholders in the real estate and hospitality industry. This
predictive capability is crucial for property owners, hosts, and investors to make informed
decisions regarding pricing strategies, optimizing rental income, and maximizing occupancy
rates. Additionally, the insights gained from feature importance analysis contribute to a deeper
understanding of the key factors influencing pricing, enabling data-driven decision-making, and
enhancing overall business strategies in the dynamic real estate landscape.
Here in all the algorithms, we performed models by taking multiple count of features, checked
the performance with different count of features, tried different functions and methods to
calculate the feature importance scores while coding in the report we have mentioned the best of
all those.
39
9.CONCLUSION
To conclude we can say that non-linear regression models perform better than the
linear regression algorithms by comparing the MSE and R squared values for the given
dataset.
This analysis provides valuable suggestions into the factors influencing property prices
aiding property owners and customers in strategic decision-making. The application of
machine learning techniques demonstrated robust predictive capabilities for property
prices.
The Insights obtained from this project contribute to the growing body of knowledge in
the field of machine learning for real-world applications and provide a foundation for
further refinement and optimization of pricing prediction models in the dynamic
contextof real estate.
40