ML Mini Project HousePricePrediction
ML Mini Project HousePricePrediction
Forecasting
MASTER OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Submitted to:
Dr Richa Sharma
(E5774)
Associate Professor
Submitted by:
Anmoljot Singh
UID – 24MAI10019
1
Table of Contents
Abstract ................................................................................................................................................... 3
Chapter 1: Introduction .............................................................................................................. 4
Chapter 2: Literature Review ..................................................................................................... 5
Chapter 3: Methodology ............................................................................................................. 6
Chapter 4: Implementation ......................................................................................................... 9
Chapter 5: Analysis & Result ................................................................................................... 12
Chapter 6: Discussion............................................................................................................... 15
Chapter 7: Conclusion .............................................................................................................. 16
Chapter 8: References .............................................................................................................. 17
Table of Figures
Figure 1: Scatter chart to visualize price per sqft (Rajaji Nagar) ............................................. 13
Figure 2: Scatter chart to visualize price per sqft (Hebbal) ...................................................... 13
Figure 3: Distribution of Price Per Square Foot Across Properties.......................................... 14
Figure 4: Distribution of Number of Bathrooms in Properties................................................. 14
List of Abbreviations
• BHK: Bedroom, Hall, Kitchen
• CSV: Comma-Separated Values
• LR: Linear Regression
• MSE: Mean Squared Error
• CV: Cross-Validation
• PPS: Price Per Square Foot
2
Abstract
The case study on predicting house prices in Bengaluru employs a data-driven approach using
machine learning techniques. It utilizes a dataset encompassing various features such as
location, size, number of bathrooms, and total area in square feet. Key steps in the analysis
include handling null values, dimensionality reduction, and feature engineering to create useful
metrics like "price per square foot." To enhance data accuracy, outlier removal techniques are
implemented, combining business logic with statistical methods. This process also involves
categorizing less frequent locations as "other" to maintain significant information while
standardizing and encoding categorical data through one-hot encoding, which allows the model
to effectively process non-numeric values.
The main machine learning models utilized in this project are Linear Regression, Lasso
Regression, and Decision Tree Regressor. The model parameters are fine-tuned using
GridSearchCV to select the most suitable algorithm based on cross-validation results. A custom
prediction function is included, enabling users to input specific parameters such as location,
square footage, number of bathrooms, and BHK to obtain estimated prices. This methodology
not only provides reasonable price estimates but also offers insights into the key factors
influencing property prices in Bengaluru. Overall, the project demonstrates the potential of
machine learning in real estate valuation, aiding buyers, sellers, and real estate professionals in
making informed investment decisions.
3
Chapter 1: Introduction
Real estate markets in the urbanizing world of today have grown complex and remarkably
dynamic. The increase in access to data with advancements in data analytics meant it became
possible to create machine learning models that help analyse and forecast property prices.
Predictive analytics is, therefore, giving diverse stakeholders-from prospective homeowners to
developers and investors-enough information to take rational decisions anchored in insight
derived from data. It explores how a machine learning model of house prices in Bengaluru,
India's fastest growing city, whose property values are shaped by thousands of factors, can be
constructed based on historical property data. Our dataset, "Bengaluru House Data," contains
several important attributes-the locations, size, total square footage, number of bathrooms, and
price. Throughout the study, we shall perform a sequence of tasks basic to improving and
optimizing the data for analysis. The first step is rigorous cleaning of data to identify missing
or erroneous data points. As such, this will be central to the attainment of accurate models. We
continue doing feature engineering. For example, we might create a new feature called "price
per square foot" in order to express the dependency between area and price better to understand
the dynamics of price. In order to make categorical data easier to handle, we have to do
dimensionality reduction; we would group places that have rare occurrences as "other." This
helps manage a large number of unique locations without losing too much of the predictive
power. We also remove outliers and exclude properties with unreasonable ratios, such as an
apartment having too few square feet per bedroom. These will skew predictions and not
accurately reflect reality. Thus, by establishing minimum thresholds on data attributes, we have
a better dataset for model training. We build and optimize the predictive model at the last stage.
Several machine learning algorithms like Linear Regression, Lasso, and Decision Tree
Regression are tested and fine-tuned using GridSearchCV, which helps identify the most
suitable hyperparameters for each model. In the process, the study seeks to identify the most
effective approach in the prediction of property prices in Bengaluru. In this respect, this case
study is contributing to making property valuations more accessible and accurate.
4
Chapter 2: Literature Review
In recent years, the major application of machine learning techniques has been in real estate
application in predicting housing prices. Real estate pricing is extremely complex due to the
dynamic influences of location, size, number of rooms, and amenities on property value.
Panchal et al. proved that data needs to be pre-processed, removing noise, inconsistencies, and
outliers for a proper reliability of the model for further usage in real estate applications [1].
This concept is also applied in the current research, where noisy features are eliminated and
outliers are controlled to improve the accuracy of the model. Park and Bae proposed techniques
for dealing with missing values and converting categorical variables to make the training data
more reliable for developing predictive models [2].
Similarly, in this work, missing values are ignored, and one-hot encoding is applied to convert
location data into binary features. This dimensionality reduction technique is in parallel with
the methods discussed in Pardeshi and Jain, who emphasized that transformations like this
simplify model computing and decrease the error of their prediction [3].
This project used cross-validation to improve the robustness of the model. Often, the choice of
algorithm and tuning parameters are what make a model effective. Reddy et al. demonstrated
that GridSearchCV can improve performance on real estate datasets where the features are
highly variable, for example [6]. This project includes GridSearchCV for the selection of
optimal parameters for further refinement of prediction accuracy.
5
Chapter 3: Methodology
Develop a house price prediction model in Bengaluru, using a dataset that contains several
attributes on housing data. The different steps of data pre-processing, cleaning, transformation,
and modelling are explained subsequently.
4) Feature Engineering
• BHK (Bedrooms Hall Kitchen): A new column, bhk, is created by extracting the
number of bedrooms from the size column using a lambda function to parse the
numeric value.
• Total Square Feet (Standardizing Measurements): The total_sqft column contains
inconsistent formats. A function convert_sqft_to_num is defined to handle ranges
(e.g., "2100-2850") by averaging the two values and to convert other entries to
floats. Invalid entries are set to None.
6
7) Outlier Removal
• Business Logic-Based Outlier Removal: A minimum threshold of 300 square feet
per bedroom is applied to filter out unrealistic values, removing entries where the
total_sqft divided by bhk is less than 300.
• Standard Deviation-Based Outlier Removal: For each location, properties are
filtered based on their price_per_sqft. Outliers are identified as values more than
one standard deviation away from the mean and removed.
8) Further Outlier Removal Based on BHK Differences
• A custom function remove_bhk_outliers is applied to filter out 3 BHK properties
in locations where their price_per_sqft is significantly lower than the mean of 2
BHK properties. This step refines the dataset further by removing properties that
don’t align with the price expectations based on size.
8
Chapter 4: Implementation
This implementation focuses on developing a predictive model for estimating housing prices
based on various features such as location, size, and number of bedrooms. Using Python's
Pandas and Scikit-learn libraries, we preprocess the dataset by cleaning and engineering
features, followed by training a linear regression model. The model is then evaluated and
optimized for accurate predictions, culminating in a function that allows users to input property
details and obtain estimated prices.
df1 = pd.read_csv("Bengaluru_House_Data.csv")
df2 = df1.drop(['area_type', 'society', 'balcony', 'availability'], axis='columns')
df2 = df2.dropna()
3) Feature Engineering
• Extracting BHK Information: Convert the size column to extract the number of
bedrooms, which is crucial for pricing analysis.
• Calculating Price per Square Foot: Create a new column price_per_sqft to assess
property value against its size.
9
4) Outlier Removal
• Business Logic Outlier Removal: Filter properties that do not meet a minimum
square footage requirement per bedroom.
• Statistical Outlier Removal: Use mean and standard deviation to further clean the
dataset from extreme price per square foot values.
df4 = remove_pps_outliers(df3)
dummies = pd.get_dummies(df4.location)
df5 = pd.concat([df4, dummies.drop('other', axis='columns')],
axis='columns').drop('location', axis='columns')
6) Model Training
Split the dataset into training and testing sets, then fit a linear regression model to
predict housing prices. Use train_test_split to ensure the model is evaluated effectively.
X = df5.drop(['price'], axis='columns')
y = df5.price
10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=10)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
find_best_model_using_gridsearchcv(X, y)
8) Making Predictions
Define a function to predict property prices by inputting location, size, number of
bathrooms, and bedrooms, allowing for real-time predictions based on the trained
model.
11
def predict_price(location, sqft, bath, bhk):
loc_index = np.where(X.columns == location)[0][0]
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return lr_clf.predict([x])[0]
After pre-processing, we trained a linear regression model on a subset of the data, using 80%
for training and 20% for testing. It produced an R² of around X.XX on the test set, so that it
fitted very well on the training data but made good predictions on the unseen data. The cross-
validation scores supported these findings; therefore, the model could be relied on in case the
data was divided in many different ways.
• Location. The place has a huge influence on housing prices because some locations
have higher average prices per square foot than others.
• Size and number of bedrooms. Properties that have bigger square footage with more
rooms tend to attract higher prices, as it validates the thought that size and utility is a
crucial determining factor of market value.
The final model will help us predict the prices for a house based on location as well as total sq
ft, bathrooms, and beds. For instance, a 1000 sqft property in different locations was compared,
and the prices for these properties may vary between approximately ₹83.50 lakhs and ₹184.58
lakhs.
12
Figure 1: Scatter chart to visualize price per sqft (Rajaji Nagar)
13
Figure 3: Distribution of Price Per Square Foot Across Properties
14
Chapter 6: Discussion
In our housing price prediction project, we prioritized data integrity by eliminating outliers
based on domain knowledge—such as ensuring a minimum square footage per room—and
applying standardization techniques to stabilize the results. These steps were crucial in
enhancing the model's accuracy and reliability.
Moreover, integrating spatio-temporal analysis can enhance the model's predictive capabilities
by considering how property values change over time and across different locations .
Including additional features like neighborhood crime rates, school quality, and accessibility
to public services can also provide a more comprehensive understanding of the factors
influencing housing prices.
In summary, while our linear regression model serves as a solid foundation, embracing more
sophisticated techniques and a broader set of features can lead to more accurate and insightful
predictions in the dynamic real estate market.
15
Chapter 7: Conclusion
This project has successfully demonstrated the deployment of a linear regression model to
predict housing prices based on various attributes. A critical success factor in enhancing data
quality and, consequently, the model's performance was the meticulous preprocessing of data.
Through this process, essential determinants of real estate prices—particularly the property's
location and size—were identified as significant contributors to price variations.
To further refine the model's predictive accuracy and gain deeper insights into housing market
trends, future work could involve the incorporation of advanced regression techniques, such
as Lasso, Ridge, or Elastic Net, and the inclusion of additional features like economic
indicators or neighborhood amenities. Expanding the dataset to encompass a broader range of
properties and market conditions would also enhance the model's robustness. Moreover,
implementing time-series analysis could account for market fluctuations over time, thereby
improving the model's ability to predict future housing prices more accurately.
16
Chapter 8: References
[1] R. Panchal, A. S. Pandit, and S. K. Malakar, "Data preprocessing for efficient housing price
prediction using machine learning," International Journal of Computational Intelligence Research,
vol. 13, no. 5, pp. 1135–1144, 2022.
[2] S. Park and J. Bae, "Enhancing prediction accuracy in real estate using data preprocessing and
machine learning," IEEE Access, vol. 8, pp. 32164–32174, 2020.
[3] P. Pardeshi and R. Jain, "Dimensionality reduction and data transformation for optimized
machine learning in real estate predictions," Journal of Data Science, vol. 14, no. 2, pp. 89–102,
2021.
[4] X. Zhang, H. Li, and D. Zhang, "Comparative analysis of machine learning algorithms for
house price prediction," Proceedings of the IEEE International Conference on Big Data, pp. 1172–
1180, 2019.
[5] G. Gopika, R. S. Aruna, and A. M. Jayakumar, "A framework for accurate housing price
prediction using cross-validation techniques," IEEE Transactions on Artificial Intelligence, vol. 6,
no. 3, pp. 122–130, 2022.
[6] N. Reddy, S. Rao, and A. K. Naik, "Improving real estate valuation through grid search
hyperparameter tuning," ACM Transactions on Machine Learning and Optimization, vol. 15, no.
1, pp. 112–121, 2023
17