The document provides details about a project to predict house prices using machine learning methods. It includes the title, motivation, introduction discussing the data and variables, timeline, and references. It also summarizes two relevant research papers on house price prediction that discuss important factors and different machine learning techniques. The papers explore feature definition, selection using information value and variance influence, and applying support vector machines, random forest, and neural networks to classify whether prices will increase or decrease.
The document provides details about a project to predict house prices using machine learning methods. It includes the title, motivation, introduction discussing the data and variables, timeline, and references. It also summarizes two relevant research papers on house price prediction that discuss important factors and different machine learning techniques. The papers explore feature definition, selection using information value and variance influence, and applying support vector machines, random forest, and neural networks to classify whether prices will increase or decrease.
The document provides details about a project to predict house prices using machine learning methods. It includes the title, motivation, introduction discussing the data and variables, timeline, and references. It also summarizes two relevant research papers on house price prediction that discuss important factors and different machine learning techniques. The papers explore feature definition, selection using information value and variance influence, and applying support vector machines, random forest, and neural networks to classify whether prices will increase or decrease.
The document provides details about a project to predict house prices using machine learning methods. It includes the title, motivation, introduction discussing the data and variables, timeline, and references. It also summarizes two relevant research papers on house price prediction that discuss important factors and different machine learning techniques. The papers explore feature definition, selection using information value and variance influence, and applying support vector machines, random forest, and neural networks to classify whether prices will increase or decrease.
Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 13
Contents:
• 1. title of the project.
• 2. motivation • 3. Introduction (problem and statement) • 4. timeline • 5. base papers (research papers) • 6. references Title House Prices Prediction Motivation • House price is a continuously hot topic. In fact, a lot of factors should be taken into consideration if given this topic. • Even though we seldom think about the type and material of roof or the height of basement ceiling, they indeed have impacts on the determination of the house price. • That is why we find the prediction of house price a really complicated problem and worth furthering. In this project, we would like treat it with methods from machine learning. Introduction
• Data Source and Variables
Kaggle competition: -- “House Prices: Advanced Regression Techniques” -- Dataset preared by Dean De Cock Variables: --79 variables present in the dataset Variable named “SalePrice” -Dependent Variable -Represent final price at which the house was sold Remaining 78 variables -Represent different attributes of the house like area, car parking, number of fireplaces, etc. Introduction (...ctd) • Data Processing Normalizing Response Variable Training Vs Validation split –Train data – 75% – Validation Data – 25% Data cleaning Variable treatments ■ Missing value treatment: – Continuous variables – Character variables ■ Outlier treatment Variable creations: – Character variables were converted to indicators – Based on train data, further grouping of character variables were done and new indicators were created Base papers Publisher: IEEE • SECTION I. • Introduction • Development of civilization is the foundation of increase of demand of houses day by day. Accurate prediction of house prices has been always a fascination for the buyers, sellers and for the bankers also. Many researchers have already worked to unravel the mysteries of the prediction of the house prices. There are many theories that have been given birth as a consequence of the research work contributed by the various researchers all over the world. Some of these theories believe that the geographical location and culture of a particular area determine how the home prices will increase or decrease whereas there are other schools of thought who emphasize the socio-economic conditions that largely play behind these house price rises. We all know that house price is a number from some defined assortment, so obviously prediction of prices of houses is a regression task. To forecast house price one person usually tries to locate similar properties at his or her neighborhood and based on collected data that person will try to predict the house price. All these indicate that house price prediction is an emerging research area of regression which requires the knowledge of machine learning. This has motivated to work in this domain. Base Paper (ctd...) • SECTION II. • Related Work • There are two major challenges that researchers have to face. The biggest challenge is to identify the optimum number of features that will help to accurately predict the direction of the house prices. Kahn [7] mentions that productivity growth in various residential construction sectors does impact the growth of the housing prices. The model that Kahn worked with shows how housing prices can have an appearently trendy appearance in which housing wealth rises faster than income for an extended period, then collapses and experiences an extended decline. • Lowrance [2] mentions in his doctoral thesis that he found the interior living space to be the most influential factor determining the housing prices with his research work. He also cites the medium income of the census tract that holds the house to be a very influential factor in determining the house prices. • Pardoe [1] utilizes features such as floor size, lot size category, number of bathrooms, and number of bedrooms, standardized age and garage size as features and utilizes linear regression techniques for predicting the house prices. • The second major challenge that is faced by the researchers is to find out the machine learning technique that will be the most effective when it comes to accurately predicting the house prices. Ng and Deisenroth [4] constructs a cell phone based application using Gaussian processes for regression. Hu et al. [5] uses maximum information coefficient (MIC) to build accurate mathematical models for predicting house prices. Limsombunchao [6] builds a model by using features like house size, house age, house type, number of bedrooms, number of bathrooms, number of garages, amenities around the house and geographical location. His work on the house price issue in New Zealand compared accuracy performance between Hedonic and Artificial Neural Network models and observed that neural networks perform better compared to the hedonic models when it comes to accurately predicting the prices of the houses. Bork and Moller [3] uses time series based models for predicting the prices of the houses. • The present work is unique from all these works as instead of looking at the problem from the regression perspective that tries to predict a price for the house, the work constructs the problem as a classification problem i.e. predicting whether the price of the house will increase or decrease. Base Paper (ctd...) • The complete work process can be divided into following four segments. These are : A. feature definition B. feature selection C. application of machine learning techniques D. performance measurement procedures. • A. Feature Definition The current work utilizes data from the web resource Kaggle.com and the dataset has been used from a competition hosted by that web application. Base Paper (ctd...) • B. Feature Selection This work utilizes feature selection techniques such as variance influence factor, Information value, principle component analysis and data transformation techniques such as outlier and missing value treatment as well as box-cox transformation techniques for the feature selection and subsequent transformation process. These techniques are used in the following way: • Information Value Computation Information Value of a predictor variable is a nonparametric measure that calculates the level of information contained in the predictor variable about the target variable. In our work the target variable is consisting of 2 values i.e. 0 and 1 whereby 0 indicates the price decrease and 1 indicates the increase in the price. The information value is computed for all the features and then the features with the largest information values are selected as most important features for further improvement. The python tool of Canopy is utilized in this process. • Data Transformation The data transformation techniques are applied to those very features which have been selected from the information value computation process. The data transformation processes include outlier and null value removal techniques then followed by the box-cox transformation process. In the box-cox transformation process the original value is transformed into square, inverse and exponential values. Base Paper (ctd...)
• Principle Component Analysis
Principle component analysis or PCA is employed on the features after the data transformation process. The principle component analysis is performed through the “pca” package in Python. This is done to ensure that there is no multicolinearity in the feature set. • Variance Influence Factor The Variance Inflation Factor (VIF) of a variable is a measure of the correlation of that variable with other variables. If the correlation between the variables is high, and hence the Variance Inflation Factor (VIF) will also be high as a thumb rule, we try to keep a set of variables such that the Variance Inflation Factors (VIFs) of all the variables are less than 1.5-2.0. We use “statsmodels” package in Python to implement the variance influence factor. Now we find the following table that contains the most important features with respective Information value. Base Paper (ctd...) C. Machine Learning Technique Once we complete selecting the features we use three techniques Support Vector Machines (SVM), Random Forest and artificial neural network (ANN). • Support Vector Machines Support vector machines are linear discriminant functions (classifier) with the maximum margin is the best. The margin is defined as the width that the boundary could be increased by, before hitting a data point • Random Forest Random Forests are ensemble classifiers constructed from of a set of Decision Trees, with the output of the classifier being the mode of the output of the Decision Trees. Random Forests combine the “bagging” idea of Breiman with the idea of random selection of features. The algorithm for inducing a Random Forest was developed by Leo Breiman and Adele Cutler. • Artificial Neural Network The artificial neural networks use neurons or perceptrons as the basic units. These perceptrons use a vector of real-valued inputs. These inputs are always having a linear combination between themselves. The Output is 1 is the function is more than if the result is greater than a threshold value and Output is 0, otherwise. Base Paper (ctd...) • D. Performance Measurement This work uses the following metrics for performance measurement. We consider true positives as those values in which the classifier predicts 1 when the target value is 1, true negatives are those values in which the classifier predicts 0 when the target value is 0, false positives are those values in which the classifier predicts 1 when the target value is 0 and false negatives are those values in which the classifier predicts 0 but the target values are 1. i) Accuracy: We Define Accuracy as Accuracy=(tp+tn)∗100/(tp+tn+fp+fn) ........(1) where, tp= true positives, tn=true: negatives, fp=false positivesand fn=false negatives . Base Paper (ctd...) ii) Precision: We Define Precision as Precision=(tp∗100)/(tp+fp) .............(2) where, tp=true positives, tn=true negatives, fp=false positives.
INTRODUCTION a Computer is a Programmable Machine Designed to Perform Arithmetic and Logical Operations Automatically and Sequentially on the Input Given by the User and Gives the Desired Output After Processing