House Price Prediction Based On Machine Learning: A Case of King County

Advances in Economics, Business and Management Research, volume 648
Proceedings of the 2022 7th International Conference on Financial Innovation and Economic Development (ICFIED 2022)
House Price Prediction Based on Machine Learning:

A Case of King County
Yijia Wang1, † and Qiaotong Zhao2, *, †
1
Queen’s University. 39-2250 Rockingham Drive, L6H 6J3, Oakville, ON, Canada. Email: 18yw148@queensu.ca
2
Civil Aviation University of China. No.2898, Jinbei Highway, Dongli District, Tianjin, China 300300. Email:
zhaoqiaotong@163.com
*Corresponding author. Email: zhaoqiaotong@163.com
†
These authors contributed equally.
ABSTRACT
This paper focuses on formulating a feasible method for house price prediction. A dataset containing features and house
price of King County in the US is used. During the data preprocessing, extreme values are winsorized and highly
correlated features are removed. Eight models including Catboost, lightGBM and XGBoost serve as candidate models.
They are evaluated by several evaluation indicators, including rooted mean square error, R-squared score, adjusted R-
squared score and K-fold cross validation score. The model that has low RMSE, achieves a high R-squared score and
adjusted R-squared score, especially in the test set, and acquires a high score in cross validation is considered a better
model. This paper finds out that Catboost performs the best among all models and can be used for house price prediction.
Location, living space and condition of the house are the most important features influencing house price. After
comparison and contrast with other papers, it is attested that findings in this paper conform to real life. This paper
formulates a model that fits better than preceding studies for house price prediction and makes necessary supplement to
the exploration of features that influence house price from a microscope.
Keywords: Catboost, House Price, King County, Prediction.
synchronized over time and the FAVAR model to find

1. INTRODUCTION out that global interest rate shock has the most
considerable influence on global house price, especially
The trend of house prices is always a controversial in the US [3]. Shishir Mathur has provided insight from a
topic as its fluctuation will pose a huge effect on the entire micro perspective in his report, stating that quality and
economy. The rise in house price means growth in non- size are two factors, contributing to house price [4]. In his
financial assets which ultimately increase personal opinion, this assumption can be explained through the
wealth, stimulating household consumption and boosting perceived value of the house. The property assessors will
the economy; however, a decrease in house price limits evaluate the size and quality of the house during value
an individual’s borrowing capacity, crowding out assessment processes for house reselling, which will
investments due to the evaporation in the value of determine the value of the house. Property developers
collaterals [1]. The shock in the global economy caused also take houses’ quality and size into consideration
by the 2008 housing bubble perfectly explains the while they initially design the project and pricing for the
importance of a stable and measurable house price. The property. A bigger size and better quality will bring a
turmoil in house prices causes an unexpected rise in real higher perceived value to both assessors, developers, and
long-term interest rates, bankruptcy in financial buyers. In Shishir Mathur’s report, he also mentioned
institutions and global economic depression [2]. another contributor – the level of maintenance. With the
Although it is hard to control the house price, it is increased investment in refurbishing before offering for
possible to predict it. sale, the house owners will expect a higher dealing price
Many scholars have conducted research on this issue. due to their value addition through the maintenance.
For instance, Hirata et al. have used time-series models Although prior researches make a valid analysis, they
to determine that house prices have become more failed to discuss the simultaneous effect of those factors.
Copyright © 2022 The Authors. Published by Atlantis Press International B.V.

This is an open access article distributed under the CC BY-NC 4.0 license -http://creativecommons.org/licenses/by-nc/4.0/. 1547
Some factors may contribute more to the results than 2. DATA

others. Besides, their conclusions are based on theoretical
knowledge and lack practical proof. 2.1. DESCRIPTION OF THE DATASET
This paper goes beyond previous economic analysis
and uses machine learning to explore the country-wide The dataset used to predict the sales price of houses
house price. This paper also assumes that the two factors, in King County comes from Kaggle. It includes 21613
size and quality mentioned by Shishir Mathur, will affect observations of 20 house features and one house price
the house price, but this paper will use Machine learning column for homes sold between May 2014 and May
algorithms to prove the relationship. Other than these two 2015.
factors, some other factors on house price, including Among the 20 features, eight of them are the
location, size, and overall structure of the house and continuous numerical variables, that describe the area
grading from the agency will also be evaluated. Machine dimensions in measurements and the geographical
learning’s most obvious advantages are that it can location of the house. These continuous variables provide
automatically solve a wide range of problems and a basic view of the overall structure and information of
efficiently handle big datasets [5]. These two benefits the house. The rest of the attributes are discrete variables,
allow us to prove the assumptions through analyzing a which provide some more detailed information on
huge amount of historical data and taking multiple factors components of the house. Most of them quantify the
into consideration to present a comprehensive model number of items in the house, for instance, the number of
efficiently. bedrooms, bathrooms, waterfront, and floor. Some others
The research studies the house price in King County, indicate the background of the house, such as year of
US, during a 2-year period from 2014 to 2015. According building, year of innovation and previous selling price
to the data gathered by Washington Government, King and date. One thing that should be mentioned is that
County has the highest estimated population of 2,052,800 values in the attribute, “yr_renovated”, will be replaced
in 2015 among all counties in Washington [6]. With a by the difference between the year of renovation and the
higher population, King County has more potential house year sold out. Additionally, there are two evaluation
buyers and higher house demands; thus, house price data scores: “Grade” and “Condition”. These two attributes
in King County will be more complete and more precise. grade the overall condition of the house based on
This paper employs different technical models, including different scales and standards [7].
Catboost, LightGBM, XGBoost, Random Forest and
regressions to identify the important influencer of the 2.2. DATA PREPROCESSING
house price. The best model will be selected through
A classic phrase in computing says “garbage in,
training and testing, which will allow us to have the most
garbage out” [8]. In another word, “good” data is the
accurate result. The results will conclude important micro
origins of high-quality analysis and project design.
features in determining the house price, including the
Foxwell concluded four causes of data error: creation and
location, size, and gradings. This will provide a guide for
pre-collection errors, collection errors, post-collection
future house price prediction to not only consider the
and analysis errors, and recording errors [9]. During this
macro effects but also think about the micro factors.
stage, this paper will focus on addressing collection
The rest of the paper is organized as follows: section errors of the dataset, especially the two frequently
2 and section 3 introduce the source of the data and the mentioned collection errors: missing variables and
methodology used in the evaluation. The fourth section outliers. Since the counts of each feature are equal to the
discusses the evaluation and the results. Eventually, the total number of observations, the data does not have any
conclusions are in section 5. missing variables. However, there is no outliers in the
data. Comparing the price at 75% quantile ($645000)
with the maximum number ($7700000), feature “price”
should have some outliers (Figure 1). After further
analyzing the distribution of “price”, the right-skewed
distribution with a fat tail confirms the outliers. Thus, the
price which are greater than 99th percentile number will
be replaced with 99th percentile numbers. The same
method will be applied in the other numerical features.
1548
Figure 1 Distribution of the price feature.
After adjusting the dataset into a “good” version, the

paper further explores the implicit meaning of the raw
data. Firstly, the distribution of the features illustrates that
most of the house prices collected in the data come from
houses with 2 to 5 bedrooms (Figure 2), 1 or 2.5
bathrooms (Figure 3), 1 or 2 floors (Figure 4), and
without waterfront and view (Figure 5). Houses with this
structure will be more attractive in the real estate market.
Secondly, most of them have the overall condition of a
house is at the 3rd level on a scale of 1 to 5 and the overall
grade given to the housing unit, based on King County Figure 4 Number of floors.
grading system is at level 7 on a scale of 1 to 11. This
demonstrates that the property in the data is mid-level
houses.
Figure 2 Number of bedrooms.
Figure 5 Number of waterfronts & number of views.
The observation from the heatmap in Figure 6,

demonstrates that most of the data concentrated in the
west of King County, especially in Seattle. Data are rare
in cities located in the east of the county, such as
Snoqualmie and Skykomish. This is because most of the
area in east King County are covered by forest.
Figure 3 Number of bathrooms.
1549
Figure 7 Correlation matrix among features.

a) Location of the King County.
3. METHODOLOGIES
This paper tests eight regression models, as
implemented in the Sci-kit learn, XGBoost, Catboost and
LightGBM package of Python. Those models include
multiple linear regression, polynomial regression, lasso
regression, ridge regression, random forest regression,
XGBoost regression, LightGBM regression as well as
Catboost regression.
3.1. MODELS
3.1.1. MULTIPLE LINEAR REGRESSION AND

POLYNOMIAL REGRESSION
b) The geographical distribution of the price. In multiple linear regression, the output is subject to
Figure 6 Heatmap of house price. 𝑥1 , 𝑥2 , …𝑥𝑛 . It is determined when 𝜃0 , 𝜃1 , …𝜃𝑛 , are
chosen [10]. It can be represented as:
2.3. FEATURE SELECTION 𝑓(𝑋) = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛 (1)
According to the correlation matrix (Figure 7), where 𝑋 = [𝑥1 , 𝑥2 , … 𝑥𝑛 ].

“sqft_living15” has a high correlation with “bathrooms” Similarly, the polynomial regression model can be
(0.57), “sqft_living” (0.76), “grade” (0.71), and written as:
“sqft_above” (0.73). Therefore, this feature will be
dropped to avoid multi-correlation and increase the 𝑓(𝑋) = 𝜃0 + 𝜃1 𝑥11 + 𝜃2 𝑥2 2 + ⋯ + 𝜃𝑛 𝑥𝑛 𝑖 (2)
accuracy of the result. Additionally, the feature, where i denotes the degree of independent variables [11].
“sqft_lot15” which has a similar meaning to
“sqft_living15” will also be dropped. Lastly, because 3.1.2. RIDGE AND LASSO REGRESSION
“id” does not have any noticeable relationship with house
price, it will be deleted. In ridge regression, the goal is to optimize the
following program:
1
𝐽(𝜃) = 𝑀𝑆𝐸(𝜃) + 𝛼 ∑𝑛𝑖=1 𝜃𝑖 2 (3)
2
where 𝛼 is a parameter used to balance the regularization

factor and the error. The second factor is 𝑙2
regularization, which avoids overfitting [12].
Similarly, in lasso regression, the goal is also to
optimize a program.
𝐽(𝜃) = 𝑀𝑆𝐸(𝜃) + 𝛼 ∑𝑛𝑖=1|𝜃𝑖 | (4)
1550
The second factor is 𝑙1 regularization. Catboost makes use of a strategy named Ordered TS
(Target Statistics) in the prevention of prediction shifts.
3.1.3. RANDOM FOREST REGRESSION To realize this strategy, an artificial “time”, i.e., a random
permutation 𝜎 of the training examples, is introduced.
Random forest searches for the best feature among a Then, we take 𝐷𝑘 = {𝑥𝑗 : 𝜎(𝑗) < 𝜎(𝑘)} as the training
random set of features. It trains the model for T rounds. example and 𝐷𝑘 = 𝐷 for a test one, where 𝐷𝑘 is the
The best feature in each random subset is used to split the dataset. This strategy not only uses all the training data
node and the combination of them generates the strong for the learning model but also satisfies the following
learner F(x). property:
3.1.4. XGBOOST AND LIGHTGBM REGRESSION 𝐸(𝑦 = 𝑣) = 𝐸(𝑦𝑘 = 𝑣) (9)

where (𝑥𝑘 , 𝑦𝑘 ) is the k-th training example.
XGBoost is a method, originated from gradient
boosting decision tree (GBDT). Its objective is to find the
3.2. EVALUATION INDICATORS
function f(x) to fit the residual error in the last node so
that the loss function is reduced to the minimum [13]. After preprocessing the data, we fit the data into the
LightGBM is another method based on GBDT. It adopts models and acquired the outcome. With the purpose of
the histogram algorithm, which reduces the evaluating the models, we picked several statistical
computational cost. It also adopts leaf-wise tree growth. indicators.
Every time the tree grows, it splits from the node that
performs the best and iterates this process. What’s more, The first indicator is the Root Mean Square Error
it controls the maximum depth to avoid overfitting [14]. (RMSE). It can be utilized to measure the precision of a
regression model. The way that RMSE is calculated is
3.1.5. CATBOOST REGRESSION written as:
1
The model that outperforms other ones is Catboost, 𝑅𝑀𝑆𝐸(𝑋, ℎ) = √ ∑𝑚 (𝑖) (𝑖) 2
𝑖=1((ℎ(𝑋 ) − 𝑦 ) (10)
𝑚
and it is worth diving deeper into its algorithm. Catboost
adopts the gradient boosting procedure. Ft is built where m is the number of instances in the dataset, 𝑋 (𝑖) is
iteratively in a greedy fashion, representing a sequence of a vector of all feature values of the ith instance, 𝑦 (𝑖) is the
approximations. Ft is obtained from the following target value for each instance, 𝑋 is a matrix containing all
equation: feature values and h is the system’s prediction function.
𝐹 𝑡 = 𝐹 𝑡−1 + 𝛼ℎ𝑡 (5) The second indicator that we picked is R-squared. The
greater the value of R-squared, the better the model fits.
where 𝛼 is the step size, and ℎ𝑡 is chosen to minimize the
The maximum value of R-squared is 1. R-squared is
following loss function.
calculated in the following way:
ℎ𝑡 = 𝐿 (𝐹 𝑡−1 + ℎ) = 𝐸𝐿(𝑦, 𝐹 𝑡−1 (𝑥) + ℎ(𝑥)) (6) ∑𝑖(𝑦𝑖 −𝑦̂𝑖 )2
𝑅2 = 1 − ∑𝑖(𝑦𝑖 −𝑦̅)2
(11)
The solution to the problem is usually obtained by
functional gradient descent. The gradient step ℎ𝑡 is where 𝑅2 is the value of R-squared, 𝑦𝑖 is the true value of
selected in the way that ℎ𝑡 → −𝑔𝑡 (𝑥, 𝑦) , where target observation, 𝑦̂𝑖 is the predicted value, and 𝑦̅ is the
𝜕𝐿(𝑦,𝑠)
𝑔𝑡 (𝑥, 𝑦) = |𝑠=𝐹𝑡−1 (𝑥) .Often, the method utilized mean value for the target vector.
𝜕𝑥
for the approximation is the least-squares approximation. Third, we selected the value of adjusted R-
ℎ𝑡 = 𝐸(−𝑔𝑡 (𝑥, 𝑦) − ℎ(𝑥)) 2 (7) squared(adjusted 𝑅2 ) because there is a problem with 𝑅2 :
when the total number of features increases, the 𝑅2 will
Catboost adopts a decision tree as its base predictor. also increase, regardless of whether the variable is indeed
The decision tree divides the feature space into disjoint closely related to the target variable. Adjusted 𝑅2 can be
regions according to the values of some splitting denoted as:
attributes a. Splitting attributes are usually binary ones
∑𝑖(𝑦𝑖 −𝑦̂𝑖 )2 /(𝑛−𝑝−1)
that identify that some features 𝑥 𝑘 exceeds some 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 − (12)
∑𝑖(𝑦𝑖 −𝑦̅)2 /(𝑛−1)
threshold t. This can be written as 𝑎 = 𝐼{𝑥 𝑘 >𝑡} , where 𝑥 𝑘
is either a numerical or binary feature. The final node of where p is the number of variables and n is the number of
the tree serves as the estimate of the response y. A instances. What’s more, we adopted K-fold cross-
decision tree, therefore, can be written as: validation, as is demonstrated in Figure 8.
ℎ(𝑥) = ∑𝐽𝑗=1 𝑏𝑗 𝕀{𝑥∈𝑅𝑗 } (8)
where 𝑅𝑗 is the disjoint regions corresponding to the

leaves of the tree [15].
1551
model predicts well, while training set is used to train the

model to fit data. Specifically, to begin with, the first part
serves as the test set, and the remaining K-1 parts serve
as the training set. Then, the second part is used as the
test set, and then the third, etc. It iterates this procedure K
times until the Kth part has served as the test set.
4. EXPERIMENTS AND RESULTS
4.1. MODEL EVALUATION AND SELECTION

After determining the candidate models as well as the
evaluation indicators, we processed the data with Python.
Figure 8 K-fold cross-validation.
During the evaluation process, 𝑅2 and adjusted 𝑅2 are
K-fold is a vivid description of dividing the whole calculated not only for the training set but also for the test
dataset into K parts with the same amount of data. During set, helping us to see clearly how the models are
the process of K-fold cross-validation, firstly, as the name performing respectively in the two sets. K-fold cross-
K-fold indicates, we first divide the dataset into K equal validation is conducted on the whole dataset to
parts. After that we let one part serve as the test set, and holistically assess how well the model is performing. The
the remaining parts serve as the training set. Test set is results are arranged in descending order by K-fold cross-
used to conduct model evaluation and see whether the validation score, where K =5 in the experiment. Table 1
illustrates the results.
Table 1. Model Evaluation Results.
R2(training Adjusted R2 R2 Adjusted 5-Fold Cross-

Model Details RMSE
) (training) (test) R2 (test) Validation
Catboost - 95163.23 0.954 0.954 0.912 0.911 0.91
LightGBM - 101269.9 0.938 0.938 0.9 0.9 0.898
XGBoost - 103746.9 0.969 0.969 0.895 0.895 0.893
Random
- 108767.6 0.984 0.984 0.885 0.884 0.878
forest
Polynominal degree=2 141416.5 0.807 0.805 0.805 0.796 0.8
Polynominal degree=3 148391.8 0.842 0.829 0.786 0.69 0.791
Multiple - 166881.3 0.724 0.723 0.729 0.728 0.721
Ridge alpha=1 166877.5 0.724 0.723 0.729 0.728 0.721
Lasso alpha=1 166881.1 0.724 0.723 0.729 0.728 0.721
Lasso alpha=100 166871.1 0.724 0.723 0.729 0.728 0.721
Lasso alpha=1000 167236.7 0.722 0.722 0.728 0.727 0.719
Ridge alpha=100 167943.5 0.719 0.719 0.725 0.724 0.717
Ridge alpha=1000 177437.5 0.685 0.685 0.693 0.692 0.683
Notes: alpha stands for regularization parameter, degree stands for the highest degree of polynomial regression, and
the name in the column ‘model’ stands for its kind of regression, for example, Catboost stands for Catboost regression.
From Table 1, it is not difficult to conclude that Catboost is selected as the final model used to predict
Catboost Regressor performs the best among all models. house prices.
It has an RMSE of 95163.23 and becomes the only model
The hyperparameters in the model are set by default.
that has an RMSE of less than 100,000. When it comes to
Here, we discuss some of the hyperparameters that are
the R2 score, adjusted R2 score, as well as the 5-Fold
most used. In the model, the ‘iteration’, which means the
Cross Validation score, Catboost stands out from the
largest number of trees, is set to be 1000. ‘Learning rate’
candidate models as well. Catboost demonstrates a great
is set to be 0.03. ‘Depth’ means the maximum depth of
capability of precise prediction and does not show any
the tree, which is 6. ‘Class_weights’ determines the
tendency of overfitting, therefore, there is no doubt that
1552
weight of each category, highly useful in hierarchical slight problem of overfitting with Random Forest
training with unbalanced data, is set to be None. Regressor.
It is worth noting that the model that obtains the
highest average R2 score of the training sets in each 4.2. IMPORTANT FEATURES FOR DETERMINING
iteration is Random Forest Regressor, which achieves an THE HOUSE PRICE
R2 score of 0.984. However, when it comes to the average
This section will explore what features bring the most
R2 score of the training sets in each iteration, its
influence to the outcome of the model. The graph in
performance is not that ideal. The average R2 score drops
Figure 9 shows feature importance generated through
to only 0.885. It is suspected reasonably that there is a
Catboost.
Figure 9 Feature importance graph of Catboost regressor model.
Some features that get a high score in feature

importance are worth discussing. The first one is location. 4.3. FURTHER DISCUSSION
Note that although latitude ranks the first among all This paper mainly focuses on the prediction of house
features, longitude also gets a high score and is supposed prices from a scope that is comparatively micro.
to be taken into consideration. After all, the combination Properties of the houses are utilized to determine how
of latitude and longitude represents the location of the much the houses are priced and what the important
house, thus influencing the house price. This highly factors are in influencing house prices. A similar
conforms to real life. Location always serves as the approach to searching for the determinant factors of
determining factor for house price. An example for this house prices has also been used in a good sum of papers.
argument prevails, for instance, a place that is convenient A comparison and contrast of the findings of papers will
with public transportation is usually sold for a higher be conducted in this section.
price. Likewise, a place near parks or lakes is priced
highly for its surroundings. The second important factor According to Mathur, who also conducted a survey of
is the area of living space. It is not surprising for house prices in King County of the United States, the size
‘sqft_living’ to come to this place because when houses and quality of a certain house matter the most for
are sold, they are priced a certain amount of money per determining the house price. Bigger size and better
square meter or square foot. Therefore, the larger the quality will bring about a higher estimated value for
house, the greater amount of money customers need to assessors. Such a finding highly conforms to our
pay. The third important factor is the grade of the house. outcome, where living space and grade rank among the
It is reasonable that a house in good condition will be top three. He also finds that a higher level of maintenance
more attractive to consumers, resulting in a higher price will make the house appreciate. Scholars studying other
for sale. The feature importance outcome in this case, in areas also contribute to this topic. Zakaria and Fatine
general, is highly compatible with our consensus. House conducted research on determinants of real estate’s price
sellers ought to pay more attention to these features to in Morocco. They find out that two factors most
gain more revenue and attract customers. significantly determine the house price, which is a
surface area as well as the location of the real estate [16].
These two factors rank the second and the first,
respectively, in our finding. Selim did research on
1553
figuring out the determinants of house prices in Turkey. the same issue. In addition, this essay focuses on the
Taking even more properties into account, he concludes house price prediction from a microscope rather than
that the condition of the water system, whether the house macro scope which is used by more scholars. This brings
has a swimming pool, and the type of the house (what about an essential supplement to research on the house
material the house is made of) are the most important price prediction.
factors [17]. These factors seem to obviate the previous
Despite the merits above, this essay still bears some
findings. However, if inspected carefully, these factors
slight drawbacks. First, this paper does not cover the
are, to some extent, related to the grade of a house.
macroeconomic factors. If they were taken into
Besides, he mentions that the number of rooms and the
consideration, the results might be closer to real-life
locational characteristics is also important. These factors
situations. Besides, this paper conducts a case study of
are compatible with our findings in this paper.
King County of the US. However, for other areas that are
All literature mentioned above solves the problem of not similar to King County, additional study is probably
house price prediction and important factor determination needed.
from a microscope. Extant literature effectively attests to
the validity of our paper’s findings. Though there exist ACKNOWLEDGMENTS
some slight differences, the general outcome is quite
similar. House location, the space for living, as well as Copyright © 2021 by the authors. This is an open-
the condition of the house, are indeed among the most access article distributed under the Creative Commons
essential features from a microscope to determine how a Attribution License which permits unrestricted use,
certain accommodation will be priced. distribution, and reproduction in any medium, provided
the original work is properly cited (CC BY 4.0).
5. CONCLUSIONS
REFERENCES
In this paper, the issue of house price prediction is
explored using a case from King County in the United [1] C. Daniel, “House price fluctuations: The role of
States. In order to eliminate the problems that exist in the housing wealth as borrowing collateral,” The
original dataset, this paper not only winsorizes the Review of Economics and Statistics, vol. 95, 2021.
extreme values in numerical features like ‘price’, but also [2] H. Mayer, T. Sinai, “Assessing high house prices:
calculates the correlation coefficient and removes the Bubbles, fundamentals and misperceptions,” The
highly correlated features including ‘sqft_living15’ and Journal of Economic Perspectives, vol. 19, pp. 67-
‘sqft_lot15’, to assure with a precise prediction. Then,
92, 2005.
several models are utilized to fit the data. They are
assessed with a variety of evaluation indicators including [3] H. Hirata, M. Kose, C. Otrok, M. Terrones, “Global
RMSE, R2 score, adjusted R2 score and cross-validation House Price Fluctuations: Synchronization and
score. Among the models, Catboost outperforms all the Determinants,” NBER International Seminar on
other models and becomes the selected model because it Macroeconomics, vol. 9, no.1, pp. 119-166, 2013.
derives the highest R2 score and adjusted R2 score in the
test set and ranks the first in cross-validation score. The [4] S. Mathur, “House price impacts of construction
final model and corresponding essential factors are quality and level of maintenance on a regional
subsequently derived through Python coding. housing market: Evidence from King County”,
Comparison among related literature is also conducted to Housing and Society, vol. 46, no.2, pp. 57-80, 2019.
complete a further discussion of the topic. [5] T. Van, “Exploring the Advantages and
From the research, we obtain the following Disadvantages of Machine Learning”
conclusions. First, Catboost serves as the best model for [6] Washington State Office of Financial Management.
our house price prediction. It not only gets the highest April 1, 2021 Population of Cities, Towns, and
score in a model assessment and makes a sensible Counties
prediction, but also avoids overfitting. Second, the most
important factors in the microscope that influence the [7] https://www.kaggle.com/harlfoxem/housesalespred
house prices are location, living space and the condition iction.
of the house. Such a finding highly conforms to our
[8] R. Geiger, D. Cope, J. Ip, et al. “‘Garbage in,
common sense.
garbage out’ revisited: What do machine learning
The innovations of this essay are summarized as application papers report about human-labeled
follows. First and foremost, this essay adopts Catboost to training data,” Quantitative Science Studies, 2021;
predict house prices. This approach achieves better vol.2, no.3: 795–827.
prediction precision compared to extant research papers
on the same issue. compared to extant research papers on [9] H. Foxwell, Creating good data: a guide to dataset
structure and data representation, 1st ed. 2020.
1554
[10] Q. Luu, M. Lau, S. Ng, and T. Chen, “Testing [13] T. Chen, and C. Guestrin, “XGBoost: A Scalable
multiple linear regression systems with Tree Boosting System,” in the 22nd ACM SIGKDD
metamorphic testing,” Journal of Systems and International Conference, 2016.
Software, vol. 182, December 2021.
[14] Q.Meng, “Lightgbm: A highly efficient gradient
[11] P. Jenny, A. Cifuentes, et al., “Towards a boosting decision tree,” 2017.
mathematical framework to inform neural network
[15] L. Prokhorenkova, G. Gusev, et al., “CatBoost:
modelling via polynomial regression,” Neural
unbiased boosting with categorical features,” 2017.
Networks, vol. 142, pp. 57-72, October 2021.
[16] F. Filali, A.Fatine, “Towards the hedonic modelling
[12] L. Panzonea, A. Ulph, et al., “A ridge regression
and determinants of real estates price in Morocco,”
approach to estimate the relationship between
Social Sciences & Humanities Open, vol. 4, no. 1,
landfill taxation and waste collection and disposal in
2021.
England,” Waste Management, vol. 129, pp. 95-110,
June 2021. [17] H. Selim, “Determinants of house prices in Turkey:
Hedonic regression versus artificial neural
network,” Expert Systems with Applications, vol.
36, no. 2, Part 2, March 2021.
1555

House Price Prediction Based On Machine Learning: A Case of King County

Uploaded by

Copyright:

Available Formats

House Price Prediction Based On Machine Learning: A Case of King County

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

House Price Prediction Based On Machine Learning: A Case of King County

Uploaded by

Copyright:

Available Formats

Advances in Economics, Business and Management Research, volume 648

House Price Prediction Based on Machine Learning:

Keywords: Catboost, House Price, King County, Prediction.

synchronized over time and the FAVAR model to find

Copyright © 2022 The Authors. Published by Atlantis Press International B.V.

Some factors may contribute more to the results than 2. DATA

Figure 1 Distribution of the price feature.

After adjusting the dataset into a “good” version, the

Figure 2 Number of bedrooms.

Figure 5 Number of waterfronts & number of views.

The observation from the heatmap in Figure 6,

Figure 3 Number of bathrooms.

Figure 7 Correlation matrix among features.

3.1.1. MULTIPLE LINEAR REGRESSION AND

According to the correlation matrix (Figure 7), where 𝑋 = [𝑥1 , 𝑥2 , … 𝑥𝑛 ].

where 𝛼 is a parameter used to balance the regularization

3.1.4. XGBOOST AND LIGHTGBM REGRESSION 𝐸(𝑦 = 𝑣) = 𝐸(𝑦𝑘 = 𝑣) (9)

ℎ(𝑥) = ∑𝐽𝑗=1 𝑏𝑗 𝕀{𝑥∈𝑅𝑗 } (8)

where 𝑅𝑗 is the disjoint regions corresponding to the

model predicts well, while training set is used to train the

4. EXPERIMENTS AND RESULTS

4.1. MODEL EVALUATION AND SELECTION

R2(training Adjusted R2 R2 Adjusted 5-Fold Cross-

Figure 9 Feature importance graph of Catboost regressor model.

Some features that get a high score in feature

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.