House Price Prediction
House Price Prediction
INTRODUCTION
Machine learning is a subset of Artificial Intelligence (AI) that attempts to train computers
with new knowledge through input of data, such as texts, images, numerical values, and so
on, and support its interaction with other computer networks. According to (Feggella, 2019),
machine learning is about ‘the science of getting computers to learn and act like humans do,
and improve their learning over time in autonomous fashion, by feeding them data and
artificial intelligence (AI) that endow with systems the capability to repeatedly learn and
improve from experience without being overtly programmed. Machine learning centers
attention on the growth of computer programs that can access data and use it to be trained
for themselves. The process of learning starts with interpretation of data, such as examples,
straight experience, or instruction, so as to look for sample in data and make enhanced
judgment in the future pedestal on the instance that it provides. The primary aspire is to
Machine learning is one of the cutting- e d g e techniques that can be used to identify,
interpret, and analyze hugely complicated data structures and patterns ( N g i a m & Khor,
2019). It is one of the most effective methods for prediction (Harrington, 2018).
A house is a single unit residential building which may range in complexity from
outfitted with plumbing, electrical, and heating, ventilation and air-conditioned systems.
Houses use a range of different roofing systems to keep precipitation such as rain from
1
getting into the dwelling space (Webster, 2021). A house is one of the most important needs
of man. It provides shelter and wellbeing for people; in some cases, some houses depict the
Accurately estimating the price of a house is an important problem for many stakeholders
including house owners, house buyers, agents, creditors, and investors. It is also a difficult
one. Though it is common knowledge that factors such as the size, number of rooms and
location affect the price, there are many other things at play. Additionally, prices are
sensitive to changes in market demand and the peculiarities of each situation, such as
One possible explanation for the relative increase in house prices is a simple income
effect, or non- homogeneity in preferences. Changes in preference rates also impact housing
There is some evidence that the increase in housing wealth does not stem from an
increase in the value of houses, but rather from the increase in the value of the land
upon which they are built. First, a price index that include the value of land, the
Conventional Mortgage Home Price Index, has increased approximately 0.75% faster
than indexes that do not, such as the Census's Composite Construction Cost index, on
The current house buying is expensive as the buyer has to pay commission to an agent, in
other for him source properties, bid at auction and generally represent a buyer throughout the
buying process.
Manual process of purchasing a house is time consuming; (Seidor, 2018). As a buyer has
to roam places in search for the house he wishes to buy, and the seller or agent also may
2
have to go in search for an interested buyer; the time taken for both to meet their
House sellers have to formulate an estimation of the worth based on its characteristics or
features in similarity to the existing market price of related houses; this will be hectic as
Manual purchasing of a house is associated with the risk of the interested buyer falling into
the hands of scammers; house scammers pose to be house owners or real estate agents,
The aim is to develop a system that predicts the efficient house pricing for house buyers
with respect to their budgets and priorities. Using machine learning algorithms.
1. Design a model that predicts the house prices so as to minimize the problems faced
by the customer
3. Evaluate the functionality of the system using Mean Absolute Error (MAE), Mean
Squared Error (MSE), Root Mean Squared Error (RMSE), performance metrics.
Predicting house prices is expected to help people who plan to buy a house so they can know
the price range in the future, then they can plan their finance properly. In addition, house
price predictions are also beneficial for property investors to know the trend of housing
prices.
3
The system will help customers to invest in a home without approaching an agent. It will save
time and energy, the system will be easily accessible anytime, anywhere. The system will also
save the real estate agent the stress of travelling from one place to the other in vain.
The system will be implemented using machine learning algorithms that will train the
system, a number of house attributes will be required by the system from the user which will
The system will be used only within Makurdi metropolis, Benue State of Nigeria. It will be
used by House sellers, buyers and agents. The locations in the system will be that of the
streets in Makurdi.
The house sellers will use the system to predict the price and communicate back to a
customer who may want to make enquiries directly from them, using the system to predict
the price will save the time of getting back to the customer.
The buyers will use the system to find out the price of the house he may be interested in,
which will enable him work properly on his budget before contacting the seller.
1. House: A house is a single unit residential building which may range in complexity
material, outfitted with plumbing, electrical, and heating, ventilation and air-
2. Price: price is the amount of money that has to be paid to acquire a given product. It
4
3. Prediction: prediction is a forecast. It is a statement about the future, sometimes
the idea that systems can learn from data, identify patterns and make decisions with
finding the correlation between variables and enables prediction of continuous output
5
CHAPTER TWO
LITERATURE REVIEW
House is one of the basic needs of human existence. It protects us from the vagaries of nature,
from threats, natural or otherwise. A house provides a sense of security and wellbeing, along
with an economic standing in society. A house is not only a mere physical structure but also a
symbol of power, authority and a host of other things that come along with it. Nowadays a
house is no longer treated as something that is just a shelter but has metamorphosed into a
Houses come in various styles, forms and shapes; from mansions to bungalows to terraces.
These different types of houses all have some uniqueness about them. Over the years, some
As it everywhere, safe affordable housing is a basic necessity for every family, (Tracy, 2019).
a. Bungalow: A bungalow is typically a one story home, cottage or cabin. Bungalows are
generally small in size when it comes to square meters. They are inexpensive to build and
easy to maintain compared to other buildings as a result they provide an affordable home
apartment building. What differentiates a penthouse from other apartments is its luxury
features or elements.
6
c. Mansion: A mansion is a very big luxury home, often upwards of 5,000 square feet in size.
However, what actually qualifies as a mansion depends on opinion and location. Calling a
home, a mansion indicates a level of grandeur, style, and quality far above the normal in a
given area. This correlates with the housing patterns of the rich and wealthy. The size of a
property, as well as the number of rooms and bathrooms all, play a part in defining what a
mansion is, but there are further defining features, such as entertainment facilities, leisure
space, and luxury finishing. So things like grand staircases, crystal chandeliers, big open
foyers, massive gardens, swimming pools, tennis courts, home automation, and various
d. Apartments or flats: Apartments or flats are among the most popular forms of housing in
the world today. An apartment building is a combination of many separate homes stacked
on top and next to each other. Each apartment acts as its own dwelling or living space. As
each apartment forms just a section of the overall building: it offers less privacy than
alternative types of housing. By owning a flat, a person has access to common areas such
e. Terraced house: A terrace, house, townhouse or row house is a single-family home that is
usually set over two or three floors. A terrace house sits side-by-side with other terraces
(joined together) forming a tight row down a road or block or the inside of a gated estate.
They are designed to accommodate as much people as possible in densely populated cities.
Terraces are generally more affordable than detached and semi-detached homes.
family home that shares a single wall with the next house. This style differentiates it from
terraced houses (with shared walls on both sides), and detached houses (with no shared
walls). You can find single-story semidetached homes (bungalows) and semis that are
7
spread over two floors. Building costs are typically lower than that of a fully detached
house, thus you tend to get extra space for your money (or the same thing for less).
g. Detached house: Another popular type of house is the detached house. Detached houses
are associated with the rich and famous, and feature some pretty impressive architecture.
Although not as large as mansions, they are often just as luxurious, depending on tastes
and how they are furnished. They are usually spread over two or more floors and feature
their own private gates: leading onto a private compound. Detached houses are great for
people that like their privacy and for those with larger families.
h. Duplex: A duplex can be thought of as a house where two different units (on two different
floors) are stacked on one another like apartments. In essence, a duplex is like having two
different houses in the same home. As a rule of thumb: duplexes have at least two floors
and you can find functioning apartments on each floor. Thus, this type of home is perfect
for bigger families. Similar structures with three or four housing units or floors are called
triplex or fourplex.
i. Traditional houses: Traditional houses in Nigeria are generally found in the more rural
parts of the country. These houses reflect the traditional house building techniques and
styles of the various ethnic groups of Nigeria. The primary building materials used in
constructing traditional houses include wood, straw, stones, and mud. Although they are
far cheaper to build: these homes are often outdated and are far less convenient for modern
life. Consequently, they are continuously being replaced (especially in areas close to
When a buyer is looking to acquire an already built house; it is important to know which type
8
2.2 House Price Prediction
A house is not only the basic need of a man but today it also represents the riches and
property values do not decline rapidly. Changes in the house price can affect various
household investors, bankers, policymakers, and others. Investment in the housing sector
seems to be an attractive choice for investments. The relationship between house prices and
the economy is an important motivating factor for predicting house prices. House prices
trends are not only the concerns for buyers and sellers, but they also indicate the current
economic situations. Therefore, it is important to predict the house prices without bias to help
both buyers and sellers make their decisions (Wu, 2017). Thus, predicting a house price is an
important economic index. It involves considering some major features of a house and
coming up with and estimation which may not be exactly the actual price of the house but
close to it.
A house is based on the idea of subjectivity and mutual connection with the person who lives
in it and within a right architectural scope, resulting in a good arrangement of the internal
There are three main factors which determine house prices they include Features, concept and
location, but house prices can be explained as a general income function (Imran et al., 2021).
1. Features: The features of a house include the physical attributes of a house, the
features include the number of bathrooms, the roof style, roof material and so on.
9
2. Concept: The concept of a house is the house style and can be the purpose of the
house, a house can be used for many purposes like a place to live for shelter, for
3. Location: the location of a house means a place or position where a house is situated.
Locations range from developed, developing and under developed areas. Houses
located in developed areas tend to have greater prices compared to houses in less
developed areas.
Regression is a supervised machine learning technique which helps in finding the correlation
between variables and enables prediction of continuous output variable based on the one or
Regression is concerned with specifying the relationship between a single numeric dependent
variable (the value to be predicted) and one or more numeric independent variables (the
predictors). As the name implies the dependent variable depends on the value of the
independent variable or variables. The simplest forms of regression assume that the
relationship between the independent and dependent variables follows a straight line.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal
Regression comprises of several algorithms but the purpose of this study only six will be
discussed.
10
1. Linear Regression
Linear Regression is a supervised machine learning model that attempts to model a linear
relationship between dependent variables (Y) and independent variables (X). The
Y= a0 + a1 X + Ɛ
Y = Dependent Variable
X = Independent Variable
Ɛ = Random Error
When the linear regression algorithm is implemented, it starts finding the best fit line using a 0
and a1. In such a way, it becomes more accurate to actual data points; since the value for a 0
and a1 is recognized, the model can be used for predicting the response, (Khushbu & Suniti,
2018).
2. Lasso Regression
The word “LASSO” denotes Least Absolute Shrinkage and Selection Operator. Lasso
regression follows the regularization technique to create prediction. It is given more priority
over the other regression methods because it gives an accurate prediction. Lasso regression
model uses shrinkage technique. In this technique, the data values are shrunk towards a
central point similar to the concept of mean. The lasso regression algorithm suggests a
simple, sparse models (i.e. models with fewer parameters), which is well-suited for models or
data showing high levels of multicollinearity or when we would like to automate certain parts
of model selection, like variable selection or parameter elimination using feature engineering.
11
Lasso Regression algorithm utilizes L1 regularization technique It is taken into consideration
when there are more number of features because it automatically performs feature selection,
3.Ridge Regression
Ridge Regression is another type of regression algorithm and is usually considered when
there is a high correlation between the independent variables or model parameters. As the
value of correlation increases the least square estimates evaluates unbiased values. But if the
collinearity in the dataset is very high, there can be some bias value. Therefore, a bias matrix
which the model is less susceptible to overfitting and hence the model works well even if the
Decision tree builds regression or classification models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an associated
nodes and leaf nodes. A decision node has two or more branches, each representing values
for the attribute tested. Leaf node represents a decision on the numerical target. The topmost
decision node in a tree which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data (Burcu & Ipek, 2020).
Extremely Randomized Trees, or Extra Trees for short, is an ensemble machine learning
algorithm. The Extra Trees algorithm works by creating a large number of unpruned decision
trees from the training dataset. Predictions are made by averaging the prediction of the
decision trees in the case of regression or using majority voting in the case of classification.
12
The predictions of the trees are aggregated to yield the final prediction, by majority vote in
The random selection of split points makes the decision trees in the ensemble less correlated,
although this increases the variance of the algorithm. This increase in variance can be
countered by increasing the number of trees used in the ensemble (Ernest et al., 2020).
6. KNeighbors Regression
The algorithm uses a weighted average of the k nearest neighbors, weighted by the inverse of
approximates the association between independent variables and the continuous outcome
needs to be set by the analyst or can be chosen using cross-validation to select the size that
minimizes the mean-squared error. The distance to the kth nearest neighbor can also be seen
as a local density estimate and thus is also a popular outlier score in anomaly detection. The
larger the distance to the k-NN, the lower the local density, the more likely the query point is
an outlier, this outlier model, along with another classic data mining method, local outlier
factor, works well also in comparison to more recent and more complex approaches,
Houses have become a necessity in this present age, not only for people looking into buying
the house but also the people that sell these houses. According to(Shinde & Gawande, 2018).
There are different machine learning algorithms to predict the house prices which many
researchers have compared in their research works and come up with important results.
13
(Wu, 2017) used 16 principal components as inputs of support Vector Regression. For
feature selection experiment, fifteen features were selected. The experiment result showed
that there is no difference between the performance of feature selection and feature
extraction. Both achieve 0.86 R-square scores after log transformation on house price. The
best combination of Parameter that achieved the highest R-square was Support Regression
Vector.
(Alfiyatin et al., 2017) shows the prediction model based on regression analysis and particle
predict the NJOP price (Dependent Variable) in the city of Malang, based on factors such as
land area, NJOP land price, NJOP building price. PSO is a stochastic optimization technique
used for the selection of affect variables. The results obtained show that the Regression
(Lu et al., 2018)examined the creative feature engineering and proposed a hybrid Lasso and
Gradient boosting regression model that promises better prediction. They used Lasso in
feature selection. They did many iterations of feature engineering to find the optimal number
of features that will improve the prediction performance. Furthermore, they used Lasso for
feature selection to remove the unused features and found that less features provide the best
(Babu & Chandran, 2019) expressed that there is a need to use a mix of models; a linear
model gives a high bias (underfit) whereas a high model complexity-based model gives a
high variance (overfit). The outcome of this study can be used in the annual revision of the
guideline value of land which may add more revenue to the State Government while this
transaction is made.
14
(Satish et al., 2019) observed that their data set took more than one day to prepare. As
opposed to performing the computations sequentially, various processors utilized and the
computations involved, which might possibly decrease the preparation time furthermore
prediction period. Including functionalities under the model. (Chouthai et al., 2019) used a
data set of 100 houses with several parameters. They used 50 percent of the data set to train
the machine and 50 percent to test the machine. According to them results were truly
accurate. And they tested it with different parameters also. Not using PSO makes it easier to
A study was accomplished by (Ahmad & Nawar, 2020), where they did a comparison of
artificial neural network and multiple linear regression for prediction. In their study, the
impact of different morphological measures on live weight has been modelled by artificial
neural networks and multiple linear regression analyses. They used three different back-
and Scaled conjugate. They showed that ANN is more successful than multiple linear
(Kuvalekar et al., 2020) suggests that every single organization in today’s selling business is
need to simplify the process for a normal human being while providing the best results. In the
process of developing their model, various retrospective techniques were studied. SVM,
Random Forest, Linear regression, Multiple linear regression, Decision Tree Regressor,
KNN, all tested on training databases. However, the decision tree regressor provided high
accuracy in predicting house prices. The decision to choose an algorithm depends largely on
the size and type of data in the data used. The decision tree algorithm was well suited for
their database.
15
(Truong et al., 2020), investigated different models for housing price prediction. Three
different types of Machine Learning methods including Random Forest, XGBoost, and
LightGBM and two techniques in machine learning including Hybrid Regression and Stacked
Generalization Regression were compared and analyzed for optimal solutions. Even though
all of those methods achieved desirable results, they found out that each model has its
advantages and limitations. The Random Forest method has the lowest error on the training
set but is prone to be overfitting. Its time complexity is high since the dataset has to be fit
multiple times. The XGBoost and LightGBM are decent methods when comparing accuracy,
but their time complexities are the best, especially Light GBM. The Hybrid Regression
method is simple but performs a lot better than the three previous methods due to the
architecture, but it is the best choice when accuracy is the top priority. Even though Hybrid
complexity must be taken into consideration since both of them contain Random Forest, a
high time complexity model. Stacked Generalization Regression also has K-fold cross-
selection for machine learning models in this study does improve performances
forecasting models in forecasting accuracy. Thus, the least squares support vector
(Levantesi & Piscopo, 2020) used random forest to predict house prices in London, and
discovered that despite the dataset size being small, the numerical results show a better
16
Generalized Linear Models. (Warnia & Muhammed, 2020) proposed to use machine learning
and artificial intelligence techniques to develop an algorithm that can predict housing prices
based on certain input features. The business application of this algorithm is that classified
websites can directly use this algorithm to predict prices of new properties that are going to
be listed by taking some input variables and predicting the correct and justified price i.e.,
avoid taking price inputs from customers and thus not letting any error creeping in the
system.
(Sivasankar et al., 2020) compared Random Forest Regression, Decision Tree Regression,
Algorithms, using Scores and Root Mean Square Error(RMSE) and it was found out that the
Decision Tree Regression algorithm has the highest RMSE therefore in that model, it shows
that the Decision tree algorithm can predict more accurately than the other algorithms that
were compared. (Mohd et al., 2020) compared Linear Regression, Decision Tree, Random
Forest, Ridge and Lasso algorithms found that, the best accuracy was provided by the
Random Forest Regressor followed by the Decision Tree Regressor. A similar result is
generated by the Ridge and Linear Regression with a very slight reduction in Lasso. Across
all groups of feature selections, there is no extreme difference between all regardless of
strong or weak groups. It gives a good sign that the buying prices can be solely used for
predicting the selling prices without considering other features to disseminate model over-
fitting. Additionally, a reduction in accuracy is apparent in the very weak features group. The
same pattern of results is visible on the Root Square Mean Error (RMSE) for all feature
selections.(Thamarai et al., 2020) experimented with the most fundamental machine learning
algorithms like decision tree classifier, decision tree regression, and multiple linear
17
(Priya et al., 2021)considered the most macroeconomic parameters that affect the house
prices variation. In this, they used back propagation neural network (BPN) and radial basis
function neural network (RBF) to establish the nonlinear model for real estate’s price
variation prediction. The dataset was taken from Taipei, Taiwan based on leading and
obtained from them are compared to public Cathay House Price Index or the Sinyi Home
Price Index. The two error metrics used were Mean Absolute Error (MAE) and Root Mean
Squared Error (RMSE). When the prediction results were compared to Cathay House Price
Index, RBF Neural Network showed better prediction results than BPN Neural Network.
Similarly, for Sinyi Home Price Index BPN Neural Network showed better prediction results
than RBF Neural Network. Some research articles describe the in depth methods and
procedures to collect the real estate data and their pre-processing techniques.
(Peng et al., 2021) used Support vector regression, Decision tree, Regression-Particle Swarm
Optimization and LUCE algorithms for prediction, LUCE addresses two critical issues of
property valuation; the lack of recent sold prices and the sparsity of house data. Experimental
results show that LUCE consistently outperforms prior automated house valuation methods.
(Ho et al., 2021)used 18-year of housing property data to train models with utilising
stochastic gradient descent based support vector regression, random forest and gradient
boosting machine. They demonstrated that advanced machine learning algorithms can
metrics. Given the dataset used in the paper, the main conclusion was that Random Forest
and Gradient Boosting Machine are able to generate comparably accurate price estimations
18
(Dabreo et al., 2021), also stated the importance of an automated system of purchasing a
house. They mentioned how advantageous it would be if buyers do not have to roam around
simply for the purpose of buying a house. Comparing XGBoost, Random Forest, Decision
tree and Linear Regression Algorithms, it was concluded that XGBoost will be the best
19
Table 1: Summary of Related Works
SN Author Title Algorithm Result
1 (Jiao, W. 2017) Housing Price Support Vector The experiment result showed that
Prediction Using Regression there is no difference between the
Support Vector performance of feature selection and
Regression. feature extraction
2 (Alfiyatin et al., Modeling House Regression Accuracy was tested using Mean
2017) Price Prediction analysis and Absolute Percentage Error
using Regression Particle Swarm
Regression: 4.84552
Analysis and Optimization
Particle Swarm PSO: 0.73255
Optimization
The result shows that the Regression
algorithm was more accurate.
3 (Lu et al., 2018) A Hybrid Lasso Regression RMSE was used for evaluating
Regression and Ridge accuracy
Technique for algorithms
Ridge: 0.112276
House Prices
Prediction Lasso: 0.113838
Which shows that there is no
significant difference between the two
algorithms
4 (Babu & Literature Multiple It is found that four factors viz. GLV
Chandran, Review on Real Regression, (84%), silver price per gram (92%),
2019) Estate Value Neural Network, population (86%) and cost of crude
Prediction Using Linear oil (88%) have more positive effect
Machine Regression, on land price. The
Learning Support Vector
Regression, k-
Nearest
Neighbours,
Random Forest
Regression
5 (Satish et al., House Price Linear The algorithms were tested using
2019) Prediction Using Regression, accuracy score.
Machine Lasso
Lasso: 76.14994569
Learning Regression,
Gradient Gradient Boosting: 91.27202689
Boosting
Regression Linear Regression: 76.15709644
The result shows that the Gradient
20
Boosting algorithm has the highest
accuracy.
6 (Ahmad & House Price Multiple linear Accuracy was determined using R-
Nawar, 2020) Prediction regression, square. where the score is closer to 1
Lasso the data is more fitted in the model
Regression,
Multiple Linear: 0.6971
Ridge
Regression, Lasso: 0.6953
Random Forest
Regression, Ridge: 0.6966
Artificial Neural Random Forest: 0.8555
Network
ANN: 0.6593
7 (Kuvalekar et House Price SVM, Random The Decision tree regressor provided
al., 2020) Forecasting Forest, Linear high accuracy in predicting house
Using Machine regression, prices. The decision to choose an
Learning Multiple linear algorithm depends largely on the size
regression, and type of data in the data used. The
Decision Tree decision tree algorithm was well
Regressor, suited for the database
KNN,
8 (Truong et al., Housing Price Random Forest, Using RMSE to evaluate the
2020) Prediction via XGBoost, performance. Random Forest:
Improved LightGBM, 0.12980
Machine Hybrid Extreme Gradient Boosting: 0.16118
Learning Regression and LightGBM: 0.16687
Stacked Hybrid Regression: 0.149690
Techniques Stacked Generalization Regression:
Generalization
0.16404
Regression
Random forest is found to perform
best.
9 (Pai & Using Machine Least Squares The algorithms were evaluated using
Wang, Learning Models Support Vector the Mean Absolute Percentage Error
2020) and Actual Regression,
Least Squares Support Vector
Transaction Data Classification
Regression:1.676
for Predicting and Regression
Real Estate Prices Trees, General Classification and Regression Trees:
Regression 2.2944 General Regression Neural
Neural Networks: 22.8936 Backpropagation
Networks, Neural Networks: 15.0357
Backpropagation BNN is good but LSSVR and CRT
Neural
Networks
10 (Levantesi & The Importance Random Forest The Mean Absolute Percentage Error
Piscopo, 2020) of Economic was used to evaluate the performance
21
Variables on of the Random Forest model and it
London Real was 1.68% which shows that is good
Estate Market: A for the system.
Random Forest
Approach
11 (Sivasankar et House Price Random Forest The threshold value of RMSE was
al., 2020) Prediction Regression, set as 0.12. Random Forest: 0.1356
Decision Tree Decision Tree: 0.2048
Regression, Ridge Regression: 0.1179
Ridge LASSO Regression: 0.118
Regression, Ada-Boost: 0.1707
LASSO XGBoost: 0.1135
Regression, The algorithms with RMSE less than
Ada-Boost 0.12 were integrated (Ridge, Lasso,
XGBoost regression)
12 (Mohd et al., Machine learning Linear Random Forest: 0.027
2020) building price Regression,
Decision Tree: 0.053
prediction with Decision
green building Ridge: 0.048
Tree, Random
determinant
Forest, Ridge Linear Regression: 0.048
and Lasso
algorithms Lasso: 0.045
Based on RSME, the Random Forest
has the best performance
13 (Thamarai et House Price decision tree RMSE for Multiple Linear
al., 2020) Prediction classifier, Regression: 2.462792680479472
Modeling Using decision tree RMSE Decision Tree:
Machine regression and 2.57390753524675
Learning multiple linear RMSE of Decision tree classifier:
regression 0.7071067811865476
From the result, the multiple linear
regression has the less value of
RSME which makes it better than
others.
14 (Priya et al., Prediction of Back Machine learning algorithms have
2021) Property Price propagation different performance when used in
and Possibility neural network different datasets
Prediction Using (BPN) and
Machine Radial basis
Learning function neural
network (RBF)
15 (Peng et al., Lifelong Property Support vector Experimental results show that
2021) Price Prediction: regression, LUCE consistently outperforms prior
A Case Study for Decision tree,
22
the Toronto Real Regression- automated house valuation methods
Estate Market Particle Swarm
Optimization
and LUCE
16 (Ho et al., Predicting Support Vector The R squared value was used as the
2021) property prices Regression ,Ran performance metric.
with machine dom Forest and Support Vector Machine:
learning Gradient 0.82715
algorithms Boosting Random Forest: 0.90333
Machine Gradient Boosting Machine:
0.90365. the result shows that the
GBM is better than the other
algorithms
17 (Dabreo et al., Real Estate Price XGBoost, XGBoost :3.06, Random Forest :3.38
2021) Prediction Random Forest, Decision tree: 4.189
Decision tree, Linear Regression: 4.22
Linear In general, the best accuracy was
Regression provided by the XGBoost
When people first think of buying a house they tend to go online and try to study trends.
People do this so they can look for a house which contains everything they need, while doing
this they make note of price which goes with these houses, automating the system of buying a
house is of great importance, many researchers have used a number of machine learning
algorithms to predict the price of houses but most of the researchers repeatedly used the same
algorithms. This system will make use of six (6) machine learning algorithms which from
research have not been compared simultaneously before. The algorithms to be used are:
Linear Regression, Lasso, Ridge, Extra Trees Regressor, Decision Tree Regressor, and
KNeighbors Regressor.
with intentions of selecting the most accurate algorithm with the aid of evaluation metrics.
23
CHAPTER THREE
System Analysis is the study of a business problem domain to recommend improvements and
specify the business requirements and priorities for the solution. It involves the analyzing and
understanding a problem, then identifying alternative solutions, choosing the best course of
action and then designing the chosen solution. It involves determining how existing systems
work and the problems associated with existing systems. It is worthy to note that before a
new system can be designed, it is necessary to study the system that is to be improved upon
or replaced, if there is any. System analysis is conducted to study a system or its part in other
to identify its objectives. System analysis specifies what the system should do. It involves
collection of data, examination of an already existing solution and building the logical model
of the system.
The research method adopted in this study is System Development Life Cycle. Waterfall
Model is the SDLC approach that was used for the software development. In waterfall model
approach, the whole process of development is divided into separate phases. The outcome of
one phase acts as the input for the next phase serially.
Fact finding is the formal process of data collection and information about the system. Facts
included in any information system can be tested based on three steps: data set used to create
useful information, process functions to perform the objectives and interface designs to
interact with users. This study gathers facts through the followings methods to analyze the
system expectations.
24
The fact finding techniques that is employed in this study is observation. This study adopts
the secondary method of data collection. The dataset used was obtained from the online
machine learning repository of kaggle. The dataset has eighteen (18) attributes, seventeen
(17) of which are feature attributes and one (1) is the predicted value. Samples of 13,321
houses were collected, after which when cleaned it was left with 6,877 to train the model.
The data is available in a CSV (Comma Separated Values) file that can easily be loaded into
calculation of house prices, and it also involves the house buyer meeting with the owner or
agent one on one. Either buyer or seller may have to travel in other to meet with the other to
make decision.
While searching for a house, the buyer is to contact various Estate agents, the problem with
this is: It is time consuming; in that the seller has to manually calculate the price of the house
which may take a reasonable amount of time because of the factors he has to put into
consideration.
Agents need to be paid a fraction of the amount just for searching a house and setting a price
tag for you. In most cases, this price tag is blindly believed by people because they have no
other options. There might be cases that the agents and sellers may have a secret dealing
and the buyer might be sold an overpriced house without his/her knowledge
25
3.1.4 Advantages of proposed system
The purpose of this system is to determine the price of houses in by looking at the various
features which are given as input by the user. These features are given to the Machine
Learning Model and based on how these features affect the label it gives out a prediction.
This was done by first searching for an appropriate dataset that suits the needs of the
developer as well as the user. House prices increase every year, so there is a need for a
system to predict house prices. This system can help the seller determine the selling
price of a house and can help the customer to arrange the right time to purchase a house.
System modelling is the process of developing abstract models of a system, with each model
presenting a different perspective of the system. It is all about representing a system using
graphical notation. Models help the analyst to understand the functionality of the system.
language in the field of software engineering, which is intended to provide a standard way to
visualize the design of a system. It can be used to model the structures of an application,
behaviors and even business processes. The central idea behind the usage of UML in this
research is to capture the significant details about the system, such that the problem will be
System architecture is the conceptual model that defines the structure, behaviour and
representation of a system. The architecture of the system is shown in the figure below:
26
Figure 1: Architecture of the system
The architecture shows that the dataset is first preprocessed which involve transforming raw
data into an understandable format and check if there are missing values. Processing the
dataset is done to remove rows or columns that have missing values due to mistakes the
might have occurred when entering the data into the CSV file. This is important as it helps
prevent some runtime errors like Not a Number (NaN) error that could prevent the system
from working effectively. The dataset is then normalized which involves rescaling real
27
valued numeric attributes into the range 0 and 1. The dataset is then divided into training and
testing datasets.
A use case diagram is a representation of a user’s interaction with the system that shows the
relationship between the user and the different use cases in which the user is involved. Use
case diagrams are a way to capture the system’s functionality and requirements in UML
diagrams. It captures the dynamic behavior of live system. A use case diagram consists of a
use case and an actor. The systems use case diagram is shown below.
28
3.3 System Design
the business requirements identified in a system analysis. It gives the overall plan or model of
a system consisting of all specifications that give the system its form and structure i.e. the
The selected architectural design defines all the components that needs to be developed,
communications with third party services, user flows and database communications as well as
front-end representations and behavior of each component. The design is usually kept in the
Input Design is the process of converting a user oriented description of the into a computer
based system. Input design facilitates the entry of data into the computer system. In other for
the proposed system to perform predictions, it requires specific features of the house to be
29
9 Watersrc object The source of water in the house
A system must have one or more outputs. The output from a system is the justification for its
design. The objectives a system is said to be achieved when the output of the system is
efficient and accurate. The output of the system is the predicted outcome of the house
Program design is the process of translating system requirements into a program that can be
executed on a computer system. The program was designed using Linear regression machine
learning algorithm;
diagram is a structure diagram that describes the structure of a system by showing the
30
systems classes, their attributes, methods and the relationships among the objects. The class
certain order to get the desired output. The flowchart below represents the algorithm used in
the study.
31
Figure 4: Flowchart for the proposed System
The dataset used to train the model was loaded to the system. The dataset is divided into
training the dataset which is 80% of the total dataset and testing dataset which is 20% of the
total dataset. The model is training involves identifying patterns in the data and creating a
function that attempts to map input features to the output feature. The remaining testing data
32
is used to determine how well the model has learnt the training dataset and can carry out
prediction. The inputs supplied by the user first has to be arranged into a numpy array before
it is passed to the model, then the model uses the function that was created at the point of
learning to map the inputs given to the model to an output in the solution space and
sometimes even outside it depending on the values of the input supplied to it. This is what is
A module is a software component or part of a program that contains one or more routines.
The modules used in this study are explained in the table below.
SN Module Description
1 Pickle It is use for saving a programs state data on the disk
where it is left off when restarted. It is used for
minimizing the execution time
2 Pandas It is used to load and save dataset
3 Numpy It is the core library for scientific computing in python
4 Train_Test_Split It splits arrays or matrices into random train and test
subsets
5 Accuracy_Score It is used to determine how accurate the classification is.
6 MinMax Scaler The ranges of values in the dataset are scaled down to
between 0 to 1. It is used to prevent one feature from
dominating other features.
The programming tools that were used in implementing this project are HTML, CSS,
documents designed to be displayed in a web browser. HTML was used to ensure the
proper formatting of text and images so that the web browser will display them as
33
b) Cascading Style Sheets: CSS is a style sheet language used for describing the
presentation of a document written in a markup language like HTML. CSS was used
to control how the web pages look. CSS was used to control the fonts, texts, colors,
programming language. The syntax in python helps the programmers code in fewer
study because; it supports web applications, it has many in built functions, it has
providing a wide range of essential tools for python developers, integrated to create a
convenient environment for productive python, web and data science development. It
e) Jupyter Notebook: Jupyter notebook is an open source web application that allows
data scientists to create and share documents that integrate live code, equations,
explanatory text in a single document. In this study Jupyter was used to train the
model.
34
CHAPTER FOUR
4.1 Implementation
Implementation involves testing the system to verify if it meets the stated aim and
objectives. It also involves training users to handle the system and plan for a smooth
conversion.
Once the data was cleanend and insights was gained about the dataset, appropriate machine
learning model that fits our dataset was applied. Six regression algorithms were selected
to predict the dependent variable in the dataset. The algorithms that were selected are
six algorithms are Linear Regression, Ridge Regression, Lasso Regression, Extra Trees
implemented with the help of python’s SciKit-learn Library. The predicted outputs
obtained from these algorithms were saved in comma separated value file. This file was
After the design and coding of an application, it is imperative to run a test to ascertain that the
actual results match the expected results. There are different testing types but for the purpose
of this study, the System testing was adopted. The purpose of the system testing is to identify
When the code gets executed and the model is trained, an interface is created and
connected to the trained model. Such that a user can input features of the desired house,
35
Test cases were carried out to verify some functionalities of the software. Each test case has
its objective and expected outcome which is represented in the table below.
36
4.3 Results
The software was executed as specified in the table 5 and table 6. The outputs were evaluated
to determine the performance of the software. The results for the first test HPPS1 is shown in
the table 7.
Four performance metrics; R-squared value, Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE) and Mean Squared Error (MSE) were taken into consideration to
The R squared value determines the proportion of variance in the dependent variables that
can be explained by the independent variable. The higher the value of the R square the
Root Mean Square Error determines how well a regression model fits a dataset, the smaller
The Mean Absolute Error tells the absolute value of the difference between the predicted
value and the actual value. A low value of the MAE is more appreciated than higher
values.
The Mean Squared Error is also used to determine how close predictions are to actual
values, the lower the MSE the better the model (Umar, 2020).
37
Decision Tree Regression 21083.89 136.44 57.53 0.52
From the above table it is clear that the Extra Tress Regression gives a higher accuracy and
In the table 6 the test was to ascertain the workability of the software; to determine if the
software can actually predict a house price and also to determine if a change in a feature can
38
Figure 5: Prediction of House Price
39
House prices according to change in Locations
40
Figure 9: Change in location to Mechanic Village affecting house price
41
Figure 11: Change in price for four bedrooms
42
House Style affecting House Price
43
Figure 15: Effects of a Duplex on House Price
Based on the results gotten from the tests carried out, the system was able to come out with
After comparing the models, it is found that Extra Trees Regression works best with highest
accuracy which has the lowest value for the MSE next to the value of Linear regression and
has the lowest value of RSME, MAE and the highest value of R square.
Next to the Extra Trees Regression is the Decision tree regression, which also has a relatively
low MSE, RSME, MAE, and relatively high R square value. The Lasso and Ridge Regression
almost gave the same results that were not good enough compared to Extra Trees and
Decision tree. Both R squares are less than 50% and according to the measurement they are
Linear Regression and KNeighbors Regression had a very poor performance, though the
Linear Regression model has the best MSE it still does not measure up to standard compared
44
to the other models. Aside the MSE, KNeighbors still performed better than the Linear
Regression model. This means that the Linear Regression model is the worst model according
Considering the result gotten from HPPS2, the system is able to predict house prices and
display them for the user to view. This proves that the system is working and has achieved
In HPPS3 the aim of the test was to test if the change in feature selection alters the price
depending on the type of features that were selected. This shows that the system doesn’t only
predict price but it predicts house price according to the data it was trained with.
In HPPS3b, location as a feature was tested to determine if for each location there is a
change in house price. Barracks Rd, G R A and Mechanic Village were tested, and for each
location that was selected all other features were left unchanged in order to be certain that
the change is not as a result of another feature. it was found that the house price in G R A
was higher than the house price in Barracks Rd while the house price in Mechanic village
was less than the other locations. This shows that the same house styles with the same
Considering results from HPPS3c, the number of bedrooms were tested and the results found
showed that the number of bedrooms also affects the house price. The test was carried out
for 3 bedrooms, 4 bedrooms, and 6 bedrooms, for each number of bedroom that was selected
all other features were not changed in order to be sure that the change is not as a result of
other features. it was found that the house price for 6 bedrooms was higher than the house
price for 4 bedrooms while the house price for 3 bedrooms was less than the other number of
45
bedrooms. This proves that the same house with the same features in the same location can
have different prices when the number of bedrooms are increased or decreased.
HPPS3d was a test for the house style; to test if for every house style there is a change in
house price. For the test, a penthouse, duplex and Bungalow were chosen to test the price
effect of House style. The results show that the price of a penthouse is greater than the price
of a duplex while the price of a bungalow is lesser than the price of a duplex.
From all the results gotten from the test cases, it does not only prove that the system is
predicting house price but it also shows that all the features have effects on the house price
and the system, in most cases predict the prices in order quantity, i.e for the number of
bedrooms, when the number of rooms is increased the price tend to increase and when the
number is reduced the house price tend to reduce too. In order of most developed locations,
i.e for location, the system predicts a higher price when the location is more developed. In
house style a penthouse has a greater price, while in real life situations, it is also costlier than
other house styles, the Bungalow is lesser which is so in real life too.
Considering all the tests results from HPPS1 to HPPS3d, it is clear that the system is in good
System requirements are the configuration that a system must have in order for the software
to run efficiently. Failure to meet these requirements can result to performance problems.
46
2. Python flask
3. Pycharm Ide
4. Python interpreter
4.6 User Documentation
A user documentation is used to assist the users by providing them with clear and
comprehensible information about the software. For the house price prediction system, the
steps are:
In this system the appropriate change over method to use is he Parallel Running. In the
parallel running change over; the new system is started, but the old system is kept running in
parallel for a while. All of the information that is collected into the old system is also
collected into the new system. Eventually the old system fades away but only when the new
system has been proven to work. This is advantageous because if there comes a breakdown at
any time in the new system, then the old system will act as backup and also the output from
the old system can be compared with the outputs in the new system to ascertain its accuracy.
It will be much easier to start by running both the manual price prediction and the automated
price prediction because many house buyer or real estate may at first find it challenging to
wholly give in to the new system. And other house buyers who are computer illiterates may
47
CHAPTER FIVE
5.1 CONCLUSION
machine learning can be used. This study is an exploratory attempt to use six machine
learning algorithms in predicting house prices, and then compare their results.
The study shows that machine learning algorithms can achieve accurate prediction of house
prices, as evaluated by the performance metrics. Given the dataset used in this study, the
conclusion is that Extra Trees Regression and D ecis ion Tree are able to generate
accurate price predictions with lower prediction errors, compared with the Linear
The study has shown that machine learning algorithms, are tools important for property
researchers to use in housing price predictions. However, these machine learning tools also
have limitations.
The choice of algorithm depends on consideration of a number of factors such as the size
of the data set, computing power of the equipment, and the availability of waiting time for
the results.
To conclude, Machine learning is very useful for finding the relation between the attributes
and building the model according to the relation that attributes contain. By using regression
algorithm which is part of machine learning the house price prediction can be done. House
price prediction helps the customer to buy its dream house among the different price
variation, attributes and needs. Algorithm find relation among the training data and the result
48
is applied on test data which will be users input. According to attributes specified the plans
gets provided.
5.2 RECOMMENDATION
Buying your own house is what every human wish for. Therefore, using this system, people
can buy houses and real estate at their rightful prices and ensure that they don't get tricked by
sketchy agents.
This system is recommended for usage in Real Estate because it will aid in giving accurate
predictions for them to set the pricing and save them from a lot of hassle and save time. The
system is apt enough in training itself and in predicting the prices from the raw data provided
to it.
Whenever large dataset is involved and there is much categorical data it is recommended that
49
REFERENCES
Aditya, A., (2017). Housing wealth and consumption, Vol. 107, No. 17. pp. 3415-3446.
Zakaria, I.A (2020). How to redefine the Concept of House within the Architectural
Modernity Framework. Housing Architecture Research.
50
APPENDICES
Code for training the model
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv('C:/ML/mkdhouses.csv')
data.drop(columns=['availability', 'society', 'balcony','Utilities'],inplace=True)
data['location'].value_counts()
def convertRange(x):
temp = x.split('-')
if len(temp) == 2:
return (float(temp[0]) + float(temp[1]))/2
try:
return float(x)
except:
return None
data['location'] = data['location'].apply(lambda x: x.strip())
location_count= data['location'].value_counts()
data['location']=data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)
data = data[((data['total_sqft']/data['bhk']) >= 300)]
data.describe()
def remove_outliers_sqft(df):
df_output = pd.DataFrame()
for key,subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
51
for bhk, bhk_df in location_df.groupby('bhk'):
stats = bhk_stats.get(bhk-1)
if stats and stats['count']>5:
exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
return df.drop(exclude_indices,axis='index')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error,mean_absolute_error
import math
print("LinearRegression: ", mean_absolute_error(y_test, y_pred_lr))
print("Lasso: ", mean_absolute_error(y_test, y_pred_lasso))
print("Ridge: ", mean_absolute_error(y_test, y_pred_ridge))
print("ExtraTreesRegressor: ", mean_absolute_error(y_test, y_pred_extratreesregressor))
print("DecisionTreeRegressor: ", mean_absolute_error(y_test, y_pred_decisiontreeregressor))
print("KNeighborsRegressor: ", mean_absolute_error(y_test, y_pred_kneighborsregressor))
HTML Code
52