House Price Prediction Based On Machine Learning: A Case of King County
House Price Prediction Based On Machine Learning: A Case of King County
House Price Prediction Based On Machine Learning: A Case of King County
Proceedings of the 2022 7th International Conference on Financial Innovation and Economic Development (ICFIED 2022)
ABSTRACT
This paper focuses on formulating a feasible method for house price prediction. A dataset containing features and house
price of King County in the US is used. During the data preprocessing, extreme values are winsorized and highly
correlated features are removed. Eight models including Catboost, lightGBM and XGBoost serve as candidate models.
They are evaluated by several evaluation indicators, including rooted mean square error, R-squared score, adjusted R-
squared score and K-fold cross validation score. The model that has low RMSE, achieves a high R-squared score and
adjusted R-squared score, especially in the test set, and acquires a high score in cross validation is considered a better
model. This paper finds out that Catboost performs the best among all models and can be used for house price prediction.
Location, living space and condition of the house are the most important features influencing house price. After
comparison and contrast with other papers, it is attested that findings in this paper conform to real life. This paper
formulates a model that fits better than preceding studies for house price prediction and makes necessary supplement to
the exploration of features that influence house price from a microscope.
1548
Advances in Economics, Business and Management Research, volume 648
1549
Advances in Economics, Business and Management Research, volume 648
3.1. MODELS
1550
Advances in Economics, Business and Management Research, volume 648
The second factor is 𝑙1 regularization. Catboost makes use of a strategy named Ordered TS
(Target Statistics) in the prevention of prediction shifts.
3.1.3. RANDOM FOREST REGRESSION To realize this strategy, an artificial “time”, i.e., a random
permutation 𝜎 of the training examples, is introduced.
Random forest searches for the best feature among a Then, we take 𝐷𝑘 = {𝑥𝑗 : 𝜎(𝑗) < 𝜎(𝑘)} as the training
random set of features. It trains the model for T rounds. example and 𝐷𝑘 = 𝐷 for a test one, where 𝐷𝑘 is the
The best feature in each random subset is used to split the dataset. This strategy not only uses all the training data
node and the combination of them generates the strong for the learning model but also satisfies the following
learner F(x). property:
1551
Advances in Economics, Business and Management Research, volume 648
Notes: alpha stands for regularization parameter, degree stands for the highest degree of polynomial regression, and
the name in the column ‘model’ stands for its kind of regression, for example, Catboost stands for Catboost regression.
From Table 1, it is not difficult to conclude that Catboost is selected as the final model used to predict
Catboost Regressor performs the best among all models. house prices.
It has an RMSE of 95163.23 and becomes the only model
The hyperparameters in the model are set by default.
that has an RMSE of less than 100,000. When it comes to
Here, we discuss some of the hyperparameters that are
the R2 score, adjusted R2 score, as well as the 5-Fold
most used. In the model, the ‘iteration’, which means the
Cross Validation score, Catboost stands out from the
largest number of trees, is set to be 1000. ‘Learning rate’
candidate models as well. Catboost demonstrates a great
is set to be 0.03. ‘Depth’ means the maximum depth of
capability of precise prediction and does not show any
the tree, which is 6. ‘Class_weights’ determines the
tendency of overfitting, therefore, there is no doubt that
1552
Advances in Economics, Business and Management Research, volume 648
weight of each category, highly useful in hierarchical slight problem of overfitting with Random Forest
training with unbalanced data, is set to be None. Regressor.
It is worth noting that the model that obtains the
highest average R2 score of the training sets in each 4.2. IMPORTANT FEATURES FOR DETERMINING
iteration is Random Forest Regressor, which achieves an THE HOUSE PRICE
R2 score of 0.984. However, when it comes to the average
This section will explore what features bring the most
R2 score of the training sets in each iteration, its
influence to the outcome of the model. The graph in
performance is not that ideal. The average R2 score drops
Figure 9 shows feature importance generated through
to only 0.885. It is suspected reasonably that there is a
Catboost.
1553
Advances in Economics, Business and Management Research, volume 648
figuring out the determinants of house prices in Turkey. the same issue. In addition, this essay focuses on the
Taking even more properties into account, he concludes house price prediction from a microscope rather than
that the condition of the water system, whether the house macro scope which is used by more scholars. This brings
has a swimming pool, and the type of the house (what about an essential supplement to research on the house
material the house is made of) are the most important price prediction.
factors [17]. These factors seem to obviate the previous
Despite the merits above, this essay still bears some
findings. However, if inspected carefully, these factors
slight drawbacks. First, this paper does not cover the
are, to some extent, related to the grade of a house.
macroeconomic factors. If they were taken into
Besides, he mentions that the number of rooms and the
consideration, the results might be closer to real-life
locational characteristics is also important. These factors
situations. Besides, this paper conducts a case study of
are compatible with our findings in this paper.
King County of the US. However, for other areas that are
All literature mentioned above solves the problem of not similar to King County, additional study is probably
house price prediction and important factor determination needed.
from a microscope. Extant literature effectively attests to
the validity of our paper’s findings. Though there exist ACKNOWLEDGMENTS
some slight differences, the general outcome is quite
similar. House location, the space for living, as well as Copyright © 2021 by the authors. This is an open-
the condition of the house, are indeed among the most access article distributed under the Creative Commons
essential features from a microscope to determine how a Attribution License which permits unrestricted use,
certain accommodation will be priced. distribution, and reproduction in any medium, provided
the original work is properly cited (CC BY 4.0).
5. CONCLUSIONS
REFERENCES
In this paper, the issue of house price prediction is
explored using a case from King County in the United [1] C. Daniel, “House price fluctuations: The role of
States. In order to eliminate the problems that exist in the housing wealth as borrowing collateral,” The
original dataset, this paper not only winsorizes the Review of Economics and Statistics, vol. 95, 2021.
extreme values in numerical features like ‘price’, but also [2] H. Mayer, T. Sinai, “Assessing high house prices:
calculates the correlation coefficient and removes the Bubbles, fundamentals and misperceptions,” The
highly correlated features including ‘sqft_living15’ and Journal of Economic Perspectives, vol. 19, pp. 67-
‘sqft_lot15’, to assure with a precise prediction. Then,
92, 2005.
several models are utilized to fit the data. They are
assessed with a variety of evaluation indicators including [3] H. Hirata, M. Kose, C. Otrok, M. Terrones, “Global
RMSE, R2 score, adjusted R2 score and cross-validation House Price Fluctuations: Synchronization and
score. Among the models, Catboost outperforms all the Determinants,” NBER International Seminar on
other models and becomes the selected model because it Macroeconomics, vol. 9, no.1, pp. 119-166, 2013.
derives the highest R2 score and adjusted R2 score in the
test set and ranks the first in cross-validation score. The [4] S. Mathur, “House price impacts of construction
final model and corresponding essential factors are quality and level of maintenance on a regional
subsequently derived through Python coding. housing market: Evidence from King County”,
Comparison among related literature is also conducted to Housing and Society, vol. 46, no.2, pp. 57-80, 2019.
complete a further discussion of the topic. [5] T. Van, “Exploring the Advantages and
From the research, we obtain the following Disadvantages of Machine Learning”
conclusions. First, Catboost serves as the best model for [6] Washington State Office of Financial Management.
our house price prediction. It not only gets the highest April 1, 2021 Population of Cities, Towns, and
score in a model assessment and makes a sensible Counties
prediction, but also avoids overfitting. Second, the most
important factors in the microscope that influence the [7] https://www.kaggle.com/harlfoxem/housesalespred
house prices are location, living space and the condition iction.
of the house. Such a finding highly conforms to our
[8] R. Geiger, D. Cope, J. Ip, et al. “‘Garbage in,
common sense.
garbage out’ revisited: What do machine learning
The innovations of this essay are summarized as application papers report about human-labeled
follows. First and foremost, this essay adopts Catboost to training data,” Quantitative Science Studies, 2021;
predict house prices. This approach achieves better vol.2, no.3: 795–827.
prediction precision compared to extant research papers
on the same issue. compared to extant research papers on [9] H. Foxwell, Creating good data: a guide to dataset
structure and data representation, 1st ed. 2020.
1554
Advances in Economics, Business and Management Research, volume 648
[10] Q. Luu, M. Lau, S. Ng, and T. Chen, “Testing [13] T. Chen, and C. Guestrin, “XGBoost: A Scalable
multiple linear regression systems with Tree Boosting System,” in the 22nd ACM SIGKDD
metamorphic testing,” Journal of Systems and International Conference, 2016.
Software, vol. 182, December 2021.
[14] Q.Meng, “Lightgbm: A highly efficient gradient
[11] P. Jenny, A. Cifuentes, et al., “Towards a boosting decision tree,” 2017.
mathematical framework to inform neural network
[15] L. Prokhorenkova, G. Gusev, et al., “CatBoost:
modelling via polynomial regression,” Neural
unbiased boosting with categorical features,” 2017.
Networks, vol. 142, pp. 57-72, October 2021.
[16] F. Filali, A.Fatine, “Towards the hedonic modelling
[12] L. Panzonea, A. Ulph, et al., “A ridge regression
and determinants of real estates price in Morocco,”
approach to estimate the relationship between
Social Sciences & Humanities Open, vol. 4, no. 1,
landfill taxation and waste collection and disposal in
2021.
England,” Waste Management, vol. 129, pp. 95-110,
June 2021. [17] H. Selim, “Determinants of house prices in Turkey:
Hedonic regression versus artificial neural
network,” Expert Systems with Applications, vol.
36, no. 2, Part 2, March 2021.
1555