Real Estate Price Prediction With Regression and Classification
Real Estate Price Prediction With Regression and Classification
Real Estate Price Prediction With Regression and Classification
1. Introduction 3. Models
Housing prices are an important reflection of the We would perform two types of supervised learning
economy, and housing price ranges are of great interest algorithms: classification and regression. While it seems
for both buyers and sellers. In this project. house prices more reasonable to perform regression since house prices
will be predicted given explanatory variables that cover are continuous, classifying house prices into individual
many aspects of residential houses. As continuous house ranges of prices would also provide helpful insight for the
prices, they will be predicted with various regression users; also, this helps us explore different techniques
techniques including Lasso, Ridge, SVM regression, and which might be regression- or classification-specific.
Random Forest regression; as individual price ranges, Since there are 288 features in the dataset, regularization
they will be predicted with classification methods is needed to prevent overfit. In order to determine the
including Naive Bayes, logistic regression, SVM regularization parameter, throughout the project in both
classification, and Random Forest classification. We will classification and regression parts, we would first perform
also perform PCA to improve the prediction accuracy. K-fold cross validation with k = 5 on a wide range of
The goal of this project is to create a regression model and selection of regularization parameters; this helped us to
a classification model that are able to accurately estimate select the best regularization parameters in the training
the price of the house given the features. phase. In order to further improve our models, we also
performed principal component analysis pipeline on all
2. Data and Preprocessing models, and cross validated number of components to fit
in each of the model to give the optimized results.
The dataset is the prices and features of residential houses
sold from 2006 to 2010 in Ames, Iowa, obtained from the 3.1 Classification
Ames Assessor’s Office. This dataset consists of 79
house features and 1460 houses with sold prices. Data Preprocessing
Although the dataset is relatively small with only 1460
examples, it contains 79 features such as areas of the The house prices were classified into the buckets of
houses, types of the floors, and numbers of bathrooms. prices. Based on the distribution of the housing prices in
Such large amounts of features enable us to explore the data set, the price buckets were followed: [0, 100K),
various techniques to predict the house prices. [100K, 150K), [150K, 200K), [200K, 250K), [250K,
300K), [300K, 350K), [350K, ထ), and we would need to
The dataset consists of features in various formats. It has perform multi-class classification to predict house prices
numerical data such as prices and numbers of into these seven buckets. The performance of each model
bathrooms/bedrooms/living rooms, as well as categorical can be characterized by accuracy rate, which is the
features such as zone classifications for sale, which can be number of test examples correctly classified over the
‘Agricultural’, ‘Residential High Density’, ‘Residential number of total examples.
Low Density’, ‘Residential Low Density Park’, etc. In
order to make this data with different format usable for Models and Results
our algorithms, categorical data was converted into
separated indicator data, which expands the number of Our baseline model for classification is Naive Bayes. We
features in this dataset. The final dataset has 288 features. implemented two types of Naive Bayes: Gaussian Naive
We splitted our dataset into training and testing set with a Bayes and Multinomial Naive Bayes. Our initial
roughly 70/30 split, with 1000 training examples and 460 expectation is that Multinomial might perform better than
testing examples. Besides, there were some features that Gaussian, since most of the features are binary indicator
had values of N/A; we replaced them with the mean of values, while only a minority of them are continuous. The
their columns so that they don’t influence the distribution. test result showed that Gaussian Naive Bayes had 21%
1
accuracy while Multinomial Naive Bayes had 51% Before we fit the regression models, we preprocessed the
accuracy. Optimistically speaking, even the Gaussian data with log-transform on the skewed features, including
Naive Bayes model performed better than random guess the target variable SalePrice, to have normal distributions.
(14% or 1/7 with 7 price buckets). Besides, to better
characterize how bad the Multinomial Naive Bayes Models and Results
misclassified, we would assign the indexes on price
buckets according to their orders, and we would compute For regression models, we try to solve the following
the average absolute difference between the expected problem: given a processed list of features for a house, we
indexes and the computed indexes of all examples, which would like to predict its potential sale price. Linear
can be viewed a root mean square error. The Multinomial regression is a natural choice of baseline model for
Naive Bayes had an average absolute difference of 0.689, regression problems. So we first ran linear regression
which means that in average the indexes were including all features, using our 288 features and 1000
misassigned by less than 1. training samples. The model is then used to predict sale
prices of houses given features in our test data and is
In order to improve our classification, we turned to compared to the actual sale prices of houses given in test
Multinomial Logistic Regression on the same dataset. We data set. The performance was measured by Root Mean
tuned the L2 regularization parameters using the 5-fold Square Error (rmse) of the predicted results and the actual
cross validation (we would address that with more details results. Our baseline model generated a rmse of 0.5501.
later); we also fit an intercept into the features. Note that since the target variable SalePrice is
Nevertheless, its performance was actually similar with log-transformed before model fitting, the resulting rmse is
Multinomial Naive Bayes; it had an accuracy of 50% based on differences in the log-transformed sale prices,
compared with 51% of Multinomial Naive Bayes. After which accounts for the small values of rmse for regression
the tuning of the parameters, it appeared that the models.
performance of both Naive Bayes and Multinomial
Logistic Regression was capped at around 50%. After using linear regression model as the baseline model,
we included the regularization parameters in linear
We continued to explore other models for our multiclass regression models to reduce overfitting. Linear regression
classification. One choice was Support Vector Machine with Lasso after 5-fold cross validation generated a rmse
Classification(SVC), and we chose linear kernel as well of 0.5418, which is better than our baseline model. Also,
as Gaussian kernel. Similar to Multinomial Logistic linear regression with lasso automatically picked 110
Regression, we added an L2 regularization parameter and variables and eliminated the other 178 variables to fit in
tuned it using cross validation. We found out that the the mode. The plot on selected features and their weights
SVC with linear kernel outperformed our past model with in lasso regularized model is attached in part 6.
an accuracy of 63%, while the SVC with Gaussian kernel
only had an accuracy of 41%. Other than lasso regularizer, we also applied ridge
regularizer with cross validation in our linear regression
At last, our final choice of classification model is random model, which generate a rmse of 0.5448. This rmse is also
forest classification. One important parameter to control better than our baseline model, meaning that regularized
overfitting is the maximum depth that we allow the trees linear regression helped with overfitting.
to grow; as a result, similar to the L2 regularization
parameters of Multinomial Logistic Regression and SVC, Support vector regression (SVR) with Gaussian and linear
we performed cross validation to tune this maximum kernels are also fitted to the features. Parameters Cs of
depth parameters for regularization. After tuning, we both models are cross validated to pick the best
obtain the accuracy of 67%, which is actually similar to performing parameters. SVR with Gaussian kernel model
SVC with linear kernel. generated a rmse of 0.5271, and that of linear kernel
generated a high rmse of 5.503. SVR with Gaussian
So far, we can observe that the SVC with linear kernel kernel performed 4% better than our baseline model.
and Random Forest Classification had the best Whereas SVR with linear kernel generated a relatively
performance, with the accuracy of 67%. high rmse due to the kernel’s unfit with the dataset in this
case.
3.2 Regression
Lastly, we fitted our training dataset with random forest
Data Preprocessing regression model, with max_depth parameter cross
validated to be 150. Our random forest regression model
2
generated a rmse of 0.5397, which is also better than our
baseline model.
4. Performance Optimization
Lasso 0.5418 -
4
very expensive sometimes) can have significant impact on
housing prices.
7. Conclusion
8. Future Work
9. Reference:
Figure 4. Coefficients of Covariates Selected in the Lasso
Model
[1] De Cook, Dean. “Ames, Iowa: Alternative to the
Boston Housing Data as an End of Semester
Figure 4 presents selected variables from the lasso model,
Regression Project.” Journal of Statistics
and value of coefficients for each fitted covariate. The
Education, vol. 19, no. 3, 2011.
variable that has the greatest coefficient is GrLivArea
(Continuous): Above grade (ground) living area square
feet. This makes intuitive sense that the sale price of a real
estate property is strongly correlated with its living area.
The variable that has the greatest negative impact on the
housing prices is Roof Matl (Nominal): Roof material :
ClyTile, which indicates the material of the roof. This
offers a new perspective of the housing prices as well,
since the cost of the materials of the roof (which could be