Sanke 2024 Ijca 923900
Sanke 2024 Ijca 923900
ABSTRACT make sure that everyone is treated fairly and we can trust each
other when we make these deals.
The used car market is a complex ecosystem influenced by
various factors such as vehicle make, model, year, mileage, and In recent years, the application of machine learning techniques
condition. Predicting the price of a used car accurately requires has greatly impacted the used car market by providing
a comprehensive understanding of these factors. innovative solutions for complex predictive tasks. Machine
learning methods utilize large datasets and advanced
In this paper, a machine learning-based approach is proposed
algorithms to help stakeholders make effective predictive
to develop a chatbot that can predict the prices of used cars in
insights based on historical transactions and patterns. This has
India. To achieve this, three machine learning techniques
led to improved accuracy and efficiency in price prediction
namely Gradient Boosting, Random Forest, and Cat Boost have
[18].
been used. The data for prediction is collected from reputable
used car marketplaces such as OLX and Cars24, using web This paper is organized as follows. Section 2 provides overview
scraping tools like Scrapy and Selenium. The aforementioned of related work. Section 3 discusses the methodology. The
techniques have been applied and compared on their respective implementation and results are explained in Section 4. Section
performance to find the one that best suits the available dataset. 5 provides a conclusion.
Additionally, the model has been evaluated using test data and
an accuracy of over 80% has been achieved. The chatbot 2. RELATED WORK
interface has been provided which allows users to input car In recent years, predicting the prices of used cars has become a
details and get real-time price estimates, helping them make significant area of interest for researchers and practitioners
informed decisions in the used car market. alike. Various machine learning and data-driven approaches
have been proposed to address this problem, leveraging
Keywords different algorithms and datasets from diverse geographical
Machine Learning, used vehicles, catboost, gradient boosting, locations. This frame related work section provides a
random forest, regression, prediction, chatbot comprehensive overview of the methodologies, datasets, and
findings from multiple research studies on predicting used car
1. INTRODUCTION prices which is indicated in Table 1.
The market for used cars is a dynamic space where buyers and
sellers come together based on their individual needs and The study conducted by Anamika Das Mou et al. [1] focuses
preferences. Unlike the new car market, which is more on predicting the probability of buying a car based on several
predictable, the used car market involves several factors such features such as price, spare part availability, customer review,
as the age of the vehicle, mileage, condition, and ownership cylinder volume, and resale price. The researchers employed
history [15]. These factors make it difficult to determine machine learning algorithms including Naive Bayes, Support
pricing. Vector Machine (SVM), Random Forest, and K-nearest
neighbor (KNN) to compare their predictive accuracy. SVM
It is really important to know the right price when buying or emerged as the most accurate model with 87.6% accuracy.
selling used cars. This helps both the buyer and the seller make
fair deals and avoid any misunderstandings. Predicting the Bukvi´c et al. [2] proposed a supervised machine learning
accurate price for a used car is a very important skill that helps model to predict used-car prices in the Croatian market. They
to deal with uncertainties in the market. By doing this, we can utilized features like year of car production, motor type,
condition, kilometers traveled, horsepower, number of doors,
and mass of the car. Models such as Linear Regression,
8
International Journal of Computer Applications (0975 – 8887)
Volume 186 – No.37, August 2024
Random Forest, and SVM were employed, with R2 values color, and sale location, and compared the performance of
ranging from 0.24 to 0.95. simple linear regression, cubic regression, and S-curve model.
Fathalla et al. [3] introduced a deep learning architecture for Pudaruth [7] applied supervised machine learning techniques
predicting the price of second-hand items based on image and to predict the price of used cars in Mauritius. He experimented
textual descriptions. Their model combined long short-term with Multiple Linear Regression, K-Nearest Neighbors,
memory (LSTM) and convolutional neural networks (CNN) for Decision Trees (J48 and Random Forest), and Naïve Bayes
price prediction, achieving promising results. The dataset algorithms. Factors like make, model, cylinder volume, year,
consisted of second-hand item attributes collected from various mileage, and price were considered.
websites.
Gegic et al. [8] applied Support Vector Machine (SVM),
Longani et al. [4] developed a system using ensemble machine Artificial Neural Network (ANN), and Random Forest (RF) to
learning techniques to predict prices for used cars in the predict used car prices. Attributes such as brand, model, car
Mumbai region. They compared the performance of Random condition, fuel, year of manufacturing, power, transmission
Forest and eXtreme gradient boosting (XGBoost) algorithms. type, mileage, color, city, state, and number of doors were used.
Attributes such as year of purchase, mileage, showroom price, ANN exhibited the highest accuracy among the models tested.
mileage, engine capacity, seating capacity, and power capacity
of the car battery were considered. Monburinon et al. [9] conducted a comparative study on
regression-based supervised machine learning models for
Liu et al. [5] proposed the PSO-GRA-BPNN method for predicting used car prices using data from a German e-
predicting used car prices in the onlinemarket. They utilized commerce website. They considered variables like seller
variables like new car price, displacement, mileage, gearbox, information, offer type, and A/B testing variables, with
fuel consumption, registration time, drive mode, region, engine gradient boosted regression trees performing the best.
power, emission standard, body structure, and brand. Their
model outperformed traditional BPNN and GRA-BPNN Venkatasubbu and Ganesh [10] proposed deep end-to-end
models in terms of accuracy. learning models for predicting the retail price of used cars.
They compared the accuracy of Lasso Regression, Multiple
Salim and Abu [6] proposed a model for estimating used car Regression, and Regression Trees using data from the Kelly
prices in the Malaysian market, addressing the limitations of Blue Book. Their results indicated varying levels of accuracy
linear regression. They considered variables such as mileage, among the models tested.
9
International Journal of Computer Applications (0975 – 8887)
Volume 186 – No.37, August 2024
[5] Uses web Predict used car Grey Relation Use of multiple BP Neural PSO-GRA- Brand, drive
crawler prices accurately Analysis models and Network (BPNN) BPNN mode,
technolog by selecting (GRA) optimization Grey Relation model gearbox,
y to relevant features, effectively methods increased Analysis (GRA) achieves the engine power,
collect constructing and reduced the the complexity of Particle Swarm best body
used car optimizing training time the approach. Optimization accuracy structure,
data. prediction models, and improved (PSO) with a mileage,
and evaluating model's MAPE of usage time,
their performance. accuracy. 3.936% and displacement,
a MAE of fuel
0.475 consumption,
emission
standard,
region, new
car price.
[6] Data is provides a more S-Curve is limited dataset, Linear S-curve Mileage,
collected accurate pricing more realistic, and the findings Regression, Cubic model colour, and
from the model for used better forecast may not be Regression, S- shows a sale location.
Mudah. cars,acknowledgi and dynamic. universally Curve Model, slightly
my ng the limitations S-Curve has applicable to all Mean Squared higher MSE
website. of linear models improved used car markets Error (MSE).
in capturing real- accuracy.
world price trends
[7] Dataset systematic ML Small dataset, Multiple Linear Random Model,
consisted process of data techniques, challenges in Regression, Forest cylinder
of 97 collection, feature handling nominal K-Nearest exhibited volume, year
records of preprocessing, insights, and numeric Neighbors enhanced of
Toyota, application of model attributes. (KNN), Decision performance manufacture,
Nissan, diverse machine comparisons, Trees (J48 and when and price.
and learning data Random Forest), applied to
Honda techniques, and normalization, and Naïve Bayes the entire
cars evaluation and advanced training
collected metrics. algorithms in dataset.
from daily predicting car
newspaper prices.
[8] Data was Data was pre- SVM was consumes much Support Vector RF Brand, model,
collected processed by used both as a more Machine (SVM) achieved condition, fuel
from removing sparse standalone computational Artificial Neural 85.82% type, age,
autopijaca attributes and classifier and resources than Network, accuracy power,
.ba. converting to categorize single machine Random Forest (ANN)- mileage,
numeric attributes the car learning algorithm (RF) 83.91% colour, and
like mileage and samples into SVM- various
year into price 86.96%. features.
categorical values. categories.
[9] Dataset Compares the High Computational Multiple Linear Gradient Seller
was performance of predictive intensity, Regression, boosted information,
collected multiple linear accuracy, interpretability Random Forest regression offer type,
from regression, robust to challenges, Regression, MAE of A/B testing
www.kagg random forest nonlinear overfitting risk, Gradient Boosted 0.28, RF variables, and
le.com. regression, and relationships, hyper parameter Regression Trees regression others.
gradient boosted ensemble tuning sensitivity, MAE of
regression trees strength, and and maintenance 0.35,
for predicting sequential complexity in multiple
used car prices. learning production. linear
capabilities. regression
MAE of
0.55
Data was Used Lasso Uses ANOVA Focused on a Lasso Regression, Error rates : mileage,
[10] taken Regression, and Tukey's narrow dataset of Multiple Lasso make, model,
from GM Multiple test to ensure 2005 GM cars. Regression, Regression trim, type,
cars. Regression, and the robustness Regression Tree. 3.581% , cylinder, litre,
Regression Trees of the results. Multiple doors, cruise
on car data for Regression control, sound
price prediction, 3.468% , system, and
validated with Regression leather
ANOVA. Tree interiors.
3.512%
10
International Journal of Computer Applications (0975 – 8887)
Volume 186 – No.37, August 2024
The techniques and approaches used in predicting used car Model Training: The models are trained using the preprocessed
prices are diverse and varied. Data collection methods range dataset. Gradient Boosting excels in capturing complex
from web crawling to newspaper ads, and datasets originate relationships and iteratively improves predictive accuracy.
from different regions. Various algorithms such as Support Random Forest provides an ensemble approach for robust
Vector Machine, Random Forest, Gradient Boosted Regression predictions by constructing multiple decision trees. CatBoost
Trees, and different regression techniques are employed with specifically designed for categorical features, enhances model
their own advantages and disadvantages. Although some performance by handling categorical variables efficiently. The
models achieve high accuracies, concerns remain regarding training process involves optimizing model parameters to
dataset size, regional specificity, computational resources, and maximize predictive accuracy and generalization capability.
interpretability.
Dialogflow Integration: Dialogflow, a natural language
Commonly used attributes for prediction include mileage, processing (NLP) [18] tool, is seamlessly integrated to facilitate
brand, year of production, and customer reviews. user interactions. Intent recognition in Dialogflow identifies
user queries related to used car price predictions, ensuring a
Despite varying complexities and accuracies, these studies user-friendly conversational interface.
contribute to the development of more accurate and robust
pricing models for the used car market. User Interaction Flow: The chatbot engages users in a natural
language conversation, extracting relevant information such as
3. METHODOLOGY car model, mileage, and other features crucial for price
The purpose of this research is to assess the precision of prediction. Dialogflow processes user inputs, mapping them to
different predictive algorithms in determining the likelihood of specific intents and entities.
purchasing a car. The main goal is to recognize the algorithm Prediction Phase: Upon gathering user details, the chatbot
that yields the most accurate results and incorporate it into the triggers the trained machine learning models (GBT, CB, RF) to
chatbot. Figure 1 illustrates the system block diagram of the predict the used car price. The models run in parallel,
proposed methodology. contributing to a comprehensive prediction.
Result Aggregation: The individual predictions from GBT, CB,
and RF are aggregated using techniques like averaging to
provide a consolidated and more robust prediction.
User Feedback Loop: Optionally, the chatbot incorporates a
feedback loop, allowing users to provide feedback on the
predicted price. This feedback is invaluable for continuous
improvement of both the chatbot's interaction capabilities and
the predictive models.
11
International Journal of Computer Applications (0975 – 8887)
Volume 186 – No.37, August 2024
In regression tasks, it's typically the average prediction of all Table 2: Simple Statistics of Dataset
the trees.
Attributes Number of Count
CatBoost: is a state-of-the-art gradient boosting algorithm that
is particularly effective for categorical data. Developed by Data Collected 24348
Yandex, CatBoost stands out due to its ability to handle Training Data 19,478
categorical features efficiently without requiring prior
preprocessing or one-hot encoding. It automatically deals with Testing Data 4,870
categorical variables by implementing an efficient computation
scheme and feature combination method. Its functionality is
similar to other gradient boosting algorithms, where each new Evaluation Measurement
tree is trained to minimize the loss function, which includes
regularization terms to control model complexity and To evaluate the results, R2 score and accuracy measurement of
overfitting. the algorithms have been used as indicated in Table 3.
i. R2 score: R2 score [18] represents the proportion of the
4. IMPLEMENATION variance in the dependent variable that is predictable from the
To implement the car prediction algorithm, we utilized VScode independent variables, indicating the goodness of fit of a
and Google Colab with several Python libraries. The hardware regression model.
specifications we used were an Intel Core i5-10500H processor ii. Accuracy of algorithms: Accuracy [18] is a measurement
with a clock rate of 2.50GHz and 8GB of RAM. We carried out of how a model predicts correctly to the total number of input
the development on a Windows 11 (64-bit) operating system. samples. In our proposed method, the dataset is split into 80%
Data Extraction: for training and 20% for testing.
A parameterization mechanism has been implemented to Table 3: Performance evaluation of model on algorithms.
effectively manage NaN (Not a Number) values within the
dataset. This parameter ensures a systematic approach to Algorithm used R2 score Accuracy
dealing with missing or undefined data points. During the data
refinement process, we meticulously segregated merged data Gradient Boosting 0.69 78.05%
into distinct values based on their corresponding attributes.
Random Forest 0.78 78.63%
This segregation facilitated a more granular analysis and
manipulation of the dataset, enabling us to derive meaningful
CatBoost 0.72 80.03%
insights and conclusions from it. The program takes into
account several factors, including the car's make and model,
year, mileage, ownership status, fuel type, transmission, and According to the analysis, among the three algorithms
location. In the example shown, the program predicts that a considered, Gradient Boosting seems to be the least accurate
2015 Maruti Suzuki Baleno with 50,000 kilometers driven, with an accuracy rate of approximately 78.05%. Random
owned by its first owner, and located in Punjab, India, would Forest is more accurate than Gradient Boosting, with an
be worth ₹497,323.27. accuracy rate of about 78.63%. Finally, CatBoost stands out as
the most accurate of the three algorithms, with an accuracy rate
The collected dataset has been preprocessed to handle missing of around 80.03% as shown in Figure 3.
values, outliers, and ensure uniformity in the format. The
preprocessed dataset sample is indicated in Figure 2.
100
78.05 78.63 80.03
80
Accuracy(%)
60
40
20
0
Gradient Random CatBoost
Boosting Forest
Algorithms
12
International Journal of Computer Applications (0975 – 8887)
Volume 186 – No.37, August 2024
IJCATM : www.ijcaonline.org 13