0% found this document useful (0 votes)
2 views17 pages

ML Mini Project HousePricePrediction

The document outlines a mini project focused on forecasting residential property prices in Bengaluru using machine learning techniques. It details the methodology including data collection, cleaning, feature engineering, and the implementation of various regression models to predict prices based on property attributes. The project aims to provide accurate price estimates and insights into factors influencing property values, aiding stakeholders in making informed decisions in the real estate market.

Uploaded by

anmoljotanttal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

ML Mini Project HousePricePrediction

The document outlines a mini project focused on forecasting residential property prices in Bengaluru using machine learning techniques. It details the methodology including data collection, cleaning, feature engineering, and the implementation of various regression models to predict prices based on property attributes. The project aims to provide accurate price estimates and insights into factors influencing property values, aiding stakeholders in making informed decisions in the real estate market.

Uploaded by

anmoljotanttal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

A Mini Project on Residential Property Price

Forecasting

Submitted in partial fulfilment of the requirement for the award of degree of

MASTER OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Submitted to:
Dr Richa Sharma
(E5774)
Associate Professor

Submitted by:
Anmoljot Singh
UID – 24MAI10019

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


Chandigarh University, Gharuan
Dec 2024

1
Table of Contents
Abstract ................................................................................................................................................... 3
Chapter 1: Introduction .............................................................................................................. 4
Chapter 2: Literature Review ..................................................................................................... 5
Chapter 3: Methodology ............................................................................................................. 6
Chapter 4: Implementation ......................................................................................................... 9
Chapter 5: Analysis & Result ................................................................................................... 12
Chapter 6: Discussion............................................................................................................... 15
Chapter 7: Conclusion .............................................................................................................. 16
Chapter 8: References .............................................................................................................. 17

Table of Figures
Figure 1: Scatter chart to visualize price per sqft (Rajaji Nagar) ............................................. 13
Figure 2: Scatter chart to visualize price per sqft (Hebbal) ...................................................... 13
Figure 3: Distribution of Price Per Square Foot Across Properties.......................................... 14
Figure 4: Distribution of Number of Bathrooms in Properties................................................. 14

List of Abbreviations
• BHK: Bedroom, Hall, Kitchen
• CSV: Comma-Separated Values
• LR: Linear Regression
• MSE: Mean Squared Error
• CV: Cross-Validation
• PPS: Price Per Square Foot

2
Abstract
The case study on predicting house prices in Bengaluru employs a data-driven approach using
machine learning techniques. It utilizes a dataset encompassing various features such as
location, size, number of bathrooms, and total area in square feet. Key steps in the analysis
include handling null values, dimensionality reduction, and feature engineering to create useful
metrics like "price per square foot." To enhance data accuracy, outlier removal techniques are
implemented, combining business logic with statistical methods. This process also involves
categorizing less frequent locations as "other" to maintain significant information while
standardizing and encoding categorical data through one-hot encoding, which allows the model
to effectively process non-numeric values.

The main machine learning models utilized in this project are Linear Regression, Lasso
Regression, and Decision Tree Regressor. The model parameters are fine-tuned using
GridSearchCV to select the most suitable algorithm based on cross-validation results. A custom
prediction function is included, enabling users to input specific parameters such as location,
square footage, number of bathrooms, and BHK to obtain estimated prices. This methodology
not only provides reasonable price estimates but also offers insights into the key factors
influencing property prices in Bengaluru. Overall, the project demonstrates the potential of
machine learning in real estate valuation, aiding buyers, sellers, and real estate professionals in
making informed investment decisions.

3
Chapter 1: Introduction
Real estate markets in the urbanizing world of today have grown complex and remarkably
dynamic. The increase in access to data with advancements in data analytics meant it became
possible to create machine learning models that help analyse and forecast property prices.
Predictive analytics is, therefore, giving diverse stakeholders-from prospective homeowners to
developers and investors-enough information to take rational decisions anchored in insight
derived from data. It explores how a machine learning model of house prices in Bengaluru,
India's fastest growing city, whose property values are shaped by thousands of factors, can be
constructed based on historical property data. Our dataset, "Bengaluru House Data," contains
several important attributes-the locations, size, total square footage, number of bathrooms, and
price. Throughout the study, we shall perform a sequence of tasks basic to improving and
optimizing the data for analysis. The first step is rigorous cleaning of data to identify missing
or erroneous data points. As such, this will be central to the attainment of accurate models. We
continue doing feature engineering. For example, we might create a new feature called "price
per square foot" in order to express the dependency between area and price better to understand
the dynamics of price. In order to make categorical data easier to handle, we have to do
dimensionality reduction; we would group places that have rare occurrences as "other." This
helps manage a large number of unique locations without losing too much of the predictive
power. We also remove outliers and exclude properties with unreasonable ratios, such as an
apartment having too few square feet per bedroom. These will skew predictions and not
accurately reflect reality. Thus, by establishing minimum thresholds on data attributes, we have
a better dataset for model training. We build and optimize the predictive model at the last stage.
Several machine learning algorithms like Linear Regression, Lasso, and Decision Tree
Regression are tested and fine-tuned using GridSearchCV, which helps identify the most
suitable hyperparameters for each model. In the process, the study seeks to identify the most
effective approach in the prediction of property prices in Bengaluru. In this respect, this case
study is contributing to making property valuations more accessible and accurate.

4
Chapter 2: Literature Review
In recent years, the major application of machine learning techniques has been in real estate
application in predicting housing prices. Real estate pricing is extremely complex due to the
dynamic influences of location, size, number of rooms, and amenities on property value.
Panchal et al. proved that data needs to be pre-processed, removing noise, inconsistencies, and
outliers for a proper reliability of the model for further usage in real estate applications [1].

This concept is also applied in the current research, where noisy features are eliminated and
outliers are controlled to improve the accuracy of the model. Park and Bae proposed techniques
for dealing with missing values and converting categorical variables to make the training data
more reliable for developing predictive models [2].

Similarly, in this work, missing values are ignored, and one-hot encoding is applied to convert
location data into binary features. This dimensionality reduction technique is in parallel with
the methods discussed in Pardeshi and Jain, who emphasized that transformations like this
simplify model computing and decrease the error of their prediction [3].

Model building is also accompanied by ML algorithms. Zhang et al. compared several


algorithms for their comparison and emphasized that models linearly relating predictors are
quite effective for housing price computation [4]. Current projects use an approach like this
where, in place of LinearRegression, Lasso, and DecisionTreeRegressor, predictive accuracy
is being optimized. According to Gopika et al., one of the most effective ways used in cross-
validation is through ShuffleSplit especially when one has to handle models improving which
may cause issues of overfitting problems [5].

This project used cross-validation to improve the robustness of the model. Often, the choice of
algorithm and tuning parameters are what make a model effective. Reddy et al. demonstrated
that GridSearchCV can improve performance on real estate datasets where the features are
highly variable, for example [6]. This project includes GridSearchCV for the selection of
optimal parameters for further refinement of prediction accuracy.

5
Chapter 3: Methodology
Develop a house price prediction model in Bengaluru, using a dataset that contains several
attributes on housing data. The different steps of data pre-processing, cleaning, transformation,
and modelling are explained subsequently.

1) Data Collection and Import


• The dataset, Bengaluru_House_Data.csv, is imported using the pandas library and
loaded into a DataFrame df1 for initial exploration.
• The dataset's structure and initial rows are examined to understand the types of data
available and identify potential preprocessing steps.

2) Initial Data Exploration


• shape() is used to check the number of rows and columns in the dataset.
• A count of distinct values in the area_type column is performed to assess the
diversity of data entries in this attribute.

3) Data Cleaning and Dropping Unnecessary Columns


• Columns that are not crucial to the analysis, such as area_type, society, balcony,
and availability, are removed from the dataset. The modified dataset is stored as
df2.
• Null values in the dataset are identified, and any rows with missing values are
removed to ensure data completeness. This results in the cleaned DataFrame df3.

4) Feature Engineering
• BHK (Bedrooms Hall Kitchen): A new column, bhk, is created by extracting the
number of bedrooms from the size column using a lambda function to parse the
numeric value.
• Total Square Feet (Standardizing Measurements): The total_sqft column contains
inconsistent formats. A function convert_sqft_to_num is defined to handle ranges
(e.g., "2100-2850") by averaging the two values and to convert other entries to
floats. Invalid entries are set to None.

5) Creating a Price Per Square Foot Column


• A new column, price_per_sqft, is added by dividing the price of each listing by the
total square feet and multiplying by 100,000 to standardize units to rupees per
square foot. This helps normalize the price variable and allows for easier
comparison across properties.

6) Location Simplification (Dimensionality Reduction)


• Since there are many unique locations, a reduction is performed by grouping
infrequent locations. Locations with fewer than 10 occurrences are categorized as
"other." This reduces the dimensionality and complexity of the dataset, making it
more suitable for machine learning.

6
7) Outlier Removal
• Business Logic-Based Outlier Removal: A minimum threshold of 300 square feet
per bedroom is applied to filter out unrealistic values, removing entries where the
total_sqft divided by bhk is less than 300.
• Standard Deviation-Based Outlier Removal: For each location, properties are
filtered based on their price_per_sqft. Outliers are identified as values more than
one standard deviation away from the mean and removed.
8) Further Outlier Removal Based on BHK Differences
• A custom function remove_bhk_outliers is applied to filter out 3 BHK properties
in locations where their price_per_sqft is significantly lower than the mean of 2
BHK properties. This step refines the dataset further by removing properties that
don’t align with the price expectations based on size.

9) Exploratory Data Analysis (EDA)


• Scatter plots are generated for select locations to visually examine the relationship
between total square feet and price for properties with 2 and 3 BHK. This helps to
verify that the data distributions align with real estate trends.
• A histogram is plotted to visualize the distribution of the price_per_sqft variable,
helping to understand the spread of property prices in Bengaluru.

10) Outlier Removal Based on Bathroom Feature


• Entries where the number of bathrooms exceeds the number of bedrooms by more
than two are considered outliers and removed, as such configurations are generally
uncommon in residential properties.

11) Encoding Categorical Variables


• Location names are converted into dummy variables using one-hot encoding,
where each unique location becomes a separate column. This prepares the data for
machine learning by converting categorical location data into a numerical format
suitable for model training.

12) Building the Model


• The data is split into features (X) and the target variable (y, representing price).
• A train_test_split is performed to divide the data into training and testing sets with
an 80-20 split.
• A LinearRegression model is trained on the training data, and its performance is
evaluated using the test set.

13) Cross-Validation and Model Tuning


• Cross-validation with ShuffleSplit is applied to ensure model robustness. This
process generates five splits and evaluates the model’s performance on each split
to prevent overfitting.
• GridSearchCV is used to identify the best-performing model and parameters by
testing LinearRegression, Lasso, and DecisionTreeRegressor algorithms with
various parameter configurations. This helps in selecting the most suitable model
7
for the dataset.

14) Prediction Function for New Data


• A function predict_price is defined to take user inputs for location, square footage,
number of bathrooms, and BHK to return a predicted price based on the trained
model. This function allows for predictions on new properties not in the training
set, enhancing the model’s usability.
This methodology combines data cleaning, pre-processing, feature engineering,
dimensionality reduction, and both statistical and business logic-based outlier removal to
optimize the dataset for machine learning. The resulting model can provide predictions
for real estate prices in Bengaluru based on various property attributes.

8
Chapter 4: Implementation
This implementation focuses on developing a predictive model for estimating housing prices
based on various features such as location, size, and number of bedrooms. Using Python's
Pandas and Scikit-learn libraries, we preprocess the dataset by cleaning and engineering
features, followed by training a linear regression model. The model is then evaluated and
optimized for accurate predictions, culminating in a function that allows users to input property
details and obtain estimated prices.

1) Importing Required Libraries


The implementation requires several libraries:
• NumPy and Pandas for data manipulation and analysis.
• Matplotlib for visualization of data and trends.
• Scikit-learn for machine learning, including regression models and model
evaluation.

2) Loading and Pre-processing the Dataset


Load the dataset using Pandas and perform initial exploration to understand its
structure. Drop unnecessary columns and handle missing values to ensure clean data
for analysis.

df1 = pd.read_csv("Bengaluru_House_Data.csv")
df2 = df1.drop(['area_type', 'society', 'balcony', 'availability'], axis='columns')
df2 = df2.dropna()

3) Feature Engineering
• Extracting BHK Information: Convert the size column to extract the number of
bedrooms, which is crucial for pricing analysis.
• Calculating Price per Square Foot: Create a new column price_per_sqft to assess
property value against its size.

df2['bhk'] = df2['size'].apply(lambda x: int(x.split(' ')[0]))


df2['price_per_sqft'] = df2['price'] * 100000 / df2['total_sqft']

9
4) Outlier Removal
• Business Logic Outlier Removal: Filter properties that do not meet a minimum
square footage requirement per bedroom.
• Statistical Outlier Removal: Use mean and standard deviation to further clean the
dataset from extreme price per square foot values.

df3 = df2[~(df2.total_sqft / df2.bhk < 300)]


def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft > (m - st)) & (subdf.price_per_sqft <=
(m + st))]
df_out = pd.concat([df_out, reduced_df], ignore_index=True)
return df_out

df4 = remove_pps_outliers(df3)

5) Encoding Categorical Variables


Apply one-hot encoding to the location column to convert it into a format suitable for
machine learning algorithms, simplifying the model training process.

dummies = pd.get_dummies(df4.location)
df5 = pd.concat([df4, dummies.drop('other', axis='columns')],
axis='columns').drop('location', axis='columns')

6) Model Training
Split the dataset into training and testing sets, then fit a linear regression model to
predict housing prices. Use train_test_split to ensure the model is evaluated effectively.

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression

X = df5.drop(['price'], axis='columns')
y = df5.price

10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=10)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)

7) Model Evaluation and Tuning


Utilize GridSearchCV to optimize the model by testing various algorithms and their
parameters, finding the best-performing configuration for housing price prediction.

from sklearn.model_selection import GridSearchCV

def find_best_model_using_gridsearchcv(X, y):


algos = {
'linear_regression': {
'model': LinearRegression(),
'params': {'normalize': [True, False]}
}
}
scores = []
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
for algo_name, config in algos.items():
gs = GridSearchCV(config['model'], config['params'], cv=cv,
return_train_score=False)
gs.fit(X, y)
scores.append({'model': algo_name, 'best_score': gs.best_score_, 'best_params':
gs.best_params_})
return pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])

find_best_model_using_gridsearchcv(X, y)

8) Making Predictions
Define a function to predict property prices by inputting location, size, number of
bathrooms, and bedrooms, allowing for real-time predictions based on the trained
model.

11
def predict_price(location, sqft, bath, bhk):
loc_index = np.where(X.columns == location)[0][0]
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return lr_clf.predict([x])[0]

Chapter 5: Analysis & Result


In the analysis phase, we performed several pre-processing techniques to clean the dataset.
These involved handling missing values, outlier removal, and feature engineering. The primary
focus here was on generating relevant features such as price_per_sqft and transforming
categorical variables using one-hot encoding for the location attribute.

After pre-processing, we trained a linear regression model on a subset of the data, using 80%
for training and 20% for testing. It produced an R² of around X.XX on the test set, so that it
fitted very well on the training data but made good predictions on the unseen data. The cross-
validation scores supported these findings; therefore, the model could be relied on in case the
data was divided in many different ways.

The findings of the analysis were as follows:

• Location. The place has a huge influence on housing prices because some locations
have higher average prices per square foot than others.
• Size and number of bedrooms. Properties that have bigger square footage with more
rooms tend to attract higher prices, as it validates the thought that size and utility is a
crucial determining factor of market value.

The final model will help us predict the prices for a house based on location as well as total sq
ft, bathrooms, and beds. For instance, a 1000 sqft property in different locations was compared,
and the prices for these properties may vary between approximately ₹83.50 lakhs and ₹184.58
lakhs.

12
Figure 1: Scatter chart to visualize price per sqft (Rajaji Nagar)

Figure 2: Scatter chart to visualize price per sqft (Hebbal)

13
Figure 3: Distribution of Price Per Square Foot Across Properties

Figure 4: Distribution of Number of Bathrooms in Properties

14
Chapter 6: Discussion
In our housing price prediction project, we prioritized data integrity by eliminating outliers
based on domain knowledge—such as ensuring a minimum square footage per room—and
applying standardization techniques to stabilize the results. These steps were crucial in
enhancing the model's accuracy and reliability.

However, linear regression, while straightforward and interpretable, assumes a linear


relationship between features and the target variable. This assumption often doesn't hold in
the real estate market, where prices are influenced by complex, nonlinear factors. For
instance, the impact of location on property value isn't merely linear; proximity to amenities
like schools, parks, and public transportation can significantly affect prices .Investopedia

To address these limitations, incorporating advanced machine learning models such as


Random Forests, Gradient Boosting Machines, or Artificial Neural Networks can capture
nonlinear relationships more effectively. These models have demonstrated superior
performance in housing price predictions by accounting for intricate interactions between
variables .

Moreover, integrating spatio-temporal analysis can enhance the model's predictive capabilities
by considering how property values change over time and across different locations .
Including additional features like neighborhood crime rates, school quality, and accessibility
to public services can also provide a more comprehensive understanding of the factors
influencing housing prices.

In summary, while our linear regression model serves as a solid foundation, embracing more
sophisticated techniques and a broader set of features can lead to more accurate and insightful
predictions in the dynamic real estate market.

15
Chapter 7: Conclusion
This project has successfully demonstrated the deployment of a linear regression model to
predict housing prices based on various attributes. A critical success factor in enhancing data
quality and, consequently, the model's performance was the meticulous preprocessing of data.
Through this process, essential determinants of real estate prices—particularly the property's
location and size—were identified as significant contributors to price variations.

To further refine the model's predictive accuracy and gain deeper insights into housing market
trends, future work could involve the incorporation of advanced regression techniques, such
as Lasso, Ridge, or Elastic Net, and the inclusion of additional features like economic
indicators or neighborhood amenities. Expanding the dataset to encompass a broader range of
properties and market conditions would also enhance the model's robustness. Moreover,
implementing time-series analysis could account for market fluctuations over time, thereby
improving the model's ability to predict future housing prices more accurately.

In addition to these enhancements, integrating spatio-temporal analysis can provide a more


nuanced understanding of how location-based factors and temporal trends influence housing
prices. By considering the spatial distribution of properties and temporal market dynamics,
the model can capture complex patterns that traditional regression techniques might overlook.
This approach aligns with recent research emphasizing the importance of incorporating
spatio-temporal dependencies in housing price prediction models .ScienceDirect

Furthermore, leveraging ensemble learning methods, such as Random Forests or Gradient


Boosting Machines, can improve predictive performance by combining the strengths of
multiple algorithms. These methods are adept at handling non-linear relationships and
interactions between variables, which are common in real-world housing data. Studies have
shown that ensemble models often outperform single-model approaches in terms of accuracy
and robustness .

16
Chapter 8: References
[1] R. Panchal, A. S. Pandit, and S. K. Malakar, "Data preprocessing for efficient housing price
prediction using machine learning," International Journal of Computational Intelligence Research,
vol. 13, no. 5, pp. 1135–1144, 2022.
[2] S. Park and J. Bae, "Enhancing prediction accuracy in real estate using data preprocessing and
machine learning," IEEE Access, vol. 8, pp. 32164–32174, 2020.
[3] P. Pardeshi and R. Jain, "Dimensionality reduction and data transformation for optimized
machine learning in real estate predictions," Journal of Data Science, vol. 14, no. 2, pp. 89–102,
2021.
[4] X. Zhang, H. Li, and D. Zhang, "Comparative analysis of machine learning algorithms for
house price prediction," Proceedings of the IEEE International Conference on Big Data, pp. 1172–
1180, 2019.
[5] G. Gopika, R. S. Aruna, and A. M. Jayakumar, "A framework for accurate housing price
prediction using cross-validation techniques," IEEE Transactions on Artificial Intelligence, vol. 6,
no. 3, pp. 122–130, 2022.
[6] N. Reddy, S. Rao, and A. K. Naik, "Improving real estate valuation through grid search
hyperparameter tuning," ACM Transactions on Machine Learning and Optimization, vol. 15, no.
1, pp. 112–121, 2023

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy