100% found this document useful (1 vote)
198 views6 pages

Car Price Prediction Using Machine Learning Techniques

Uploaded by

sreeja maragoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
198 views6 pages

Car Price Prediction Using Machine Learning Techniques

Uploaded by

sreeja maragoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

TEM Journal. Volume 8, Issue 1, Pages 113-118, ISSN 2217-8309, DOI: 10.18421/TEM81-16, February 2019.

Car Price Prediction using Machine


Learning Techniques
Enis Gegic, Becir Isakovic, Dino Keco, Zerina Masetic, Jasmin Kevric
International Burch University, Sarajevo, Bosnia and Herzegovina

Abstract – A car price prediction has been a high- increase in future. This adds additional significance
interest research area, as it requires noticeable effort to the problem of the car price prediction.
and knowledge of the field expert. Considerable Accurate car price prediction involves expert
number of distinct attributes are examined for the knowledge, because price usually depends on many
reliable and accurate prediction. To build a model for
distinctive features and factors. Typically, most
predicting the price of used cars in Bosnia and
Herzegovina, we applied three machine learning
significant ones are brand and model, age,
techniques (Artificial Neural Network, Support Vector horsepower and mileage. The fuel type used in the
Machine and Random Forest). However, the car as well as fuel consumption per mile highly affect
mentioned techniques were applied to work as an price of a car due to a frequent changes in the price of
ensemble. The data used for the prediction was a fuel. Different features like exterior color, door
collected from the web portal autopijaca.ba using web number, type of transmission, dimensions, safety, air
scraper that was written in PHP programming condition, interior, whether it has navigation or not
language. Respective performances of different will also influence the car price. In this paper, we
algorithms were then compared to find one that best applied different methods and techniques in order to
suits the available data set. The final prediction model
achieve higher precision of the used car price
was integrated into Java application. Furthermore, the
model was evaluated using test data and the accuracy
prediction.
of 87.38% was obtained. This paper is organized in the following manner:
Section II contains related work in the field of price
Keywords – car price prediction, support vector
machines, classification, machine learning.
prediction of used cars. In section III, the research
methodology of our study is explain. Section IV
1. Introduction elaborates various machine learning algorithms and
examine their respective performances to predict the
Car price prediction is somehow interesting and price of the used cars. Finally, in section V, a
popular problem. As per information that was gotten conclusion of our work are given, together with the
from the Agency for Statistics of BiH, 921.456 future works plan.
vehicles were registered in 2014 from which 84% of
them are cars for personal usage [1]. This number is 2. Related Work
increased by 2.7% since 2013 and it is likely that this
Predicting price of a used cars has been studied
trend will continue, and the number of cars will
extensively in various researches. Listian discussed,
in her paper written for Master thesis [2], that
DOI: 10.18421/TEM81-16 regression model that was built using Support Vector
https://dx.doi.org/10.18421/TEM81-16 Machines (SVM) can predict the price of a car that
has been leased with better precision than
Corresponding author: Enis Gegic, multivariate regression or some simple multiple
International Burch University, Sarajevo, Bosnia and regression. This is on the grounds that Support Vector
Herzegovina
Machine (SVM) is better in dealing with datasets
Email: enis.gegic@ibu.edu.ba
with more dimensions and it is less prone to
Received: 29 March 2018. overfitting and underfitting. The weakness of this
Accepted: 28 January 2019. research is that a change of simple regression with
Published: 27 February 2019. more advanced SVM regression was not shown in
basic indicators like mean, variance or standard
© 2019 Enis Gegic et al; published by deviation.
UIKTEN. This work is licensed under the Creative Another approach was given by Richardson in his
Commons Attribution-NonCommercial-NoDerivs 3.0 thesis work [3]. His theory was that car producers
License. produce more durable cars. Richardson applied
The article is published with Open Access multiple regression analysis and demonstrated that
at www.temjournal.com hybrid cars retain their value for longer time than

TEM Journal – Volume 8 / Number 1 / 2019. 113


TEM Journal. Volume 8, Issue 1, Pages 113-118, ISSN 2217-8309, DOI: 10.18421/TEM81-16, February 2019.

traditional cars. This has roots in environmental setup authors were able to achieve prediction
concerns about the climate and it gives higher fuel accuracy of 98%.
efficiency. In the related work shown above, authors
Wu et al. [4] conducted car price prediction proposed prediction model based on the single
study, by using neuro-fuzzy knowledge-based machine learning algorithm. However, it is noticeable
system. They took into consideration the following that single machine learning algorithm approach did
attributes: brand, year of production and type of not give remarkable prediction results and could be
engine. Their prediction model produced similar enhanced by assembling various machine learning
results as the simple regression model. Moreover, methods in an ensemble.
they made an expert system named ODAV (Optimal
Distribution of Auction Vehicles) as there is a high 3. Materials and Methods
demand for selling the cars at the end of the leasing
year by car dealers. This system gives insights into Approach for car price prediction proposed in this
the best prices for vehicles, as well as the location paper is composed of several steps, shown in Fig. 1.
where the best price can be gained. Regression
model based on k-nearest neighbor machine learning
algorithm was used to predict the price of a car. This
system has a tendency to be exceptionally successful
since more than two million vehicles were
exchanged through it [5].
Gonggie [6] proposed a model that is built using
ANN (Artificial Neural Networks) for the price
prediction of a used car. He considered several Figure 1. Block diagram of the overall classification
attributes: miles passed, estimated car life and brand. process
The proposed model was built so it could deal with
nonlinear relations in data which was not the case Data is collected from a local web portal for
with previous models that were utilizing the simple selling and buying cars autopijaca.ba [9], during
linear regression techniques. The non-linear model winter season, as time interval itself has high impact
was able to predict prices of cars with better on the price of the cars in Bosnia and Herzegovina.
precision than other linear models. The following attributes were captured for each car:
Furthermore, Pudaruth [7] applied various brand, model, car condition, fuel, year of
machine learning algorithms, namely: k-nearest manufacturing, power in kilowatts, transmission type,
neighbors, multiple linear regression analysis, millage, color, city, state, number of doors, four
decision trees and naïve bayes for car price wheel drive (yes/no), damaged (yes/no), navigation
prediction in Mauritius. The dataset used to create a (yes/no), leather seats (yes/no), alarm (yes/no),
prediction model was collected manually from local aluminum rims (yes/no), digital air condition
newspapers in period less than one month, as time (yes/no), parking sensors (yes/no), xenon lights
can have a noticeable impact on price of the car. He (yes/no), remote unlock (yes/no), electric rear mirrors
studied the following attributes: brand, model, cubic (yes/no), seat heat (yes/no), panorama roof (yes/no),
capacity, mileage in kilometers, production year, cruise control (yes/no), abs (yes/no), esp (yes/no), asr
exterior color, transmission type and price. (yes/no) and price expressed in BAM (Bosnian
However, the author found out that Naive Bayes and Mark).
Decision Tree were unable to predict and classify Since manual data collection is time consuming
numeric values. Additionally, limited number of task, especially when there are numerous records to
dataset instances could not give high classification process, a “web scraper” as a part of this research is
performances, i.e. accuracies less than 70%. created to get this job done automatically and reduce
Noor and Jan [8] build a model for car price the time for data gathering. Web scraping is well
prediction by using multiple linear regression. The known technique to extract information from
dataset was created during the two-months period websites and save data into local file or database.
and included the following features: price, cubic Manual data extraction is time consuming and
capacity, exterior color, date when the ad was therefore web scrapers are used to do this job in a
posted, number of ad views, power steering, mileage fraction of time. Web scrapers are programed for
in kilometer, rims type, type of transmission, engine specific websites and can mimic regular users from
type, city, registered city, model, version, make and website’s point of view.
model year. After applying feature selection, the After raw data has been collected and stored to
authors considered only engine type, price, model local database, data preprocessing step was applied.
year and model as input features. With the given Many of the attributes were sparse and they do not

114 TEM Journal – Volume 8 / Number 1 / 2019.


TEM Journal. Volume 8, Issue 1, Pages 113-118, ISSN 2217-8309, DOI: 10.18421/TEM81-16, February 2019.

contain useful information for prediction. Hence, it is The color of the cars was normalized into fixed set
decided to remove them from the dataset. The of 15 different colors. Continuous attributes such as
attributes “state”, “city”, and “damaged” were “millage”, “year of manufacturing”, “power in
completely removed. kilowatts” and “price” are converted into categorical
values using predefined cluster intervals. The millage
Table 1. Processed data set sample in CSV format is converted into five distinct categories, the year of
power
in year of cruise
brand model fuel miles leather price
kilowat man control
ts
volkswagen golf2 Diesel 45-55 l7 l4 no no 0-1500
volkswagen golf2 Gasoline 0-45 l7 l4 no no 0-1500
ford escort Gasoline 45-55 l7 l1 no no 0-1500
ford fiesta Gasoline 55-65 l4 l2 no no 0-1500
mercedes-benz 190 Gasoline 45-55 l7 l4 no no 0-1500
volkswagen jetta Diesel 0-45 l7 l5 no no 0-1500
ford focus Gasoline 55-65 l6 l4 no no 0-1500
fiat punto Diesel 65-75 l5 l4 no no 0-1500
volkswagen golf2 Gasoline 65-75 l7 l4 no no 0-1500

The collected raw data set contains 1105 samples. manufacturing has been converted into seven
Since data is collected using web scraper, there are categories and the power in kilowatts is converted
many samples that have only few attributes. In order into eleven categories. The price attribute has been
to clean these samples, PHP script that is reading categorized into 15 distinct categories based on price
scraped data from database, perform cleaning and range. These categories are shown in Table 2 and
saves the cleaned samples into CSV file. The CSV similar principle was applied to other attributes. This
file is later used to load data into WEKA, software data transformation process converted regression
for building machine learning models [10]. prediction machine learning problem into
After cleanup process, the data set has been classification problem.
reduced to 797 samples. In particular, all brands that
Table 2. Price classification based on price ranges
have less than 10 samples and where the price is
higher than 60 000 BAM were removed due to the From To Class
skew class problem. 500 2000 500-2000
2000 3500 2000-3500
The whole dataset creation process is shown in
the Fig. 2. 3500 5000 3500-5000
5000 6500 5000-6500
6500 8000 6500-8000
8000 9500 8000-9500
9500 11000 9500-11000
11000 14000 11000-14000
14000 17000 14000-17000
17000 20000 17000-20000
20000 25000 20000-25000
25000 30000 25000-30000
30000 60000 30000-60000
Figure 2. Data gathering and transformation workflow
diagram

TEM Journal – Volume 8 / Number 1 / 2019. 115


TEM Journal. Volume 8, Issue 1, Pages 113-118, ISSN 2217-8309, DOI: 10.18421/TEM81-16, February 2019.

4. Model Implementation and Evaluation data set. This attribute divides cars into three price
categories: cheap (price < 12 000 BAM), moderate
Single machine learning classifier approach that (12 000 BAM <= price < 24 000 BAM) and
has been used in all previous researches was also expensive (24 000 BAM <= price).
tested in this research. The whole data set collected Ensemble method combines three machine
in this research has been split into training (90%) and learning algorithms that were applied in the first
testing (10%) subsets and Artificial Neural Network, experiment as single classifiers: RF, SVM, and ANN.
Support Vector Machine and Random Forest Random Forest algorithm was applied on the
classifiers models were built. whole dataset, to test how accurately the classifier
Random forest (RF) also known as random can categorize samples into cheap, moderate and
decision forest belongs to the category of ensemble expensive car classes. RF is a meta estimator that fits
methods. RF can be used for classification and a number of decision tree classifiers on various sub-
regression problems. The algorithm was developed samples of the dataset and use averaging to improve
by Ho as an improvement for overfitting of the the predictive accuracy and control over-fitting [15].
decision tree algorithms [11]. Artificial Neural The following features were used to build model:
Networks is the machine learning model that tries to brand, model, car condition, fuel, age, kilowatts,
solve problems in the same way as the human brain transmission, miles, color, doors, drive, leather seats,
does. Instead of neurons, the ANN is using artificial navigation, alarm, aluminum rims, digital AC,
neurons also known as perceptron. In the human manual AC, parking sensors, xenon, remote unlock,
brain, neurons are connected with axons while in seat heat, panorama roof, cruise control, abs, asr, esp
ANN the weighted matrices are used for connections and price.
between artificial neurons. Information travels Before model training step, numeric attribute price
through neurons using connections between them, was converted into nominal classes shown in Table 4.
from one neuron information travels to all the
neurons connected to it. Adjusting the weights Table 4. Nominal categories of car price attribute
between neurons system can be trained from input From To Class
examples [12]. Support Vector Machine can be used
0 12000 cheap
for solving classification and regression problems.
For input data set, the SVM can make a binary 12000 24000 moderate
decision and decide in which among the two 24000 …. expensive
categories the input sample belongs. The SVM
algorithm is trained to label input data into two Then, RF classifier is applied, and results are
categories that are divided by the widest area obtained (Table 5.).
possible between categories [12]. In cases when
input data is not labeled, SVM algorithm can not be Table 5. Classification results with RF classifier
applied. For unlabeled data, it is necessary to apply Type of evaluation % of correctly
unsupervised learning method and SVM has its classified
implementation called Support Vector Clustering Cross validation
(SVC) [13][14]. 85.82
with 10 folds
90% percentage
Table 3. Single classifier approach accuracy results 88.75
split

Classifier Accuracy Error Both classifiers, SVM and ANN are further
applied to each price category dataset: cheap,
RF 41.18% 8.04%
moderate and expensive cars datasets.
ANN 42.35% 7.05%
4.1 Applying classification on cheap dataset using SVM
SVM 48.23% 10.53% and ANN algorithms

Cheap dataset was divided into 2 nominal classes,


Results shown in Table 3. confirm that single shown in Table 6.
machine learning classifier approach is not reliable
for prediction of car prices. Therefore, in this paper Table 6. Nominal classes in Cheap dataset
ensemble method for car prices prediction was From To Class
proposed. To apply ensemble of machine learning
0 6000 0-6000
classifiers a new attribute “price rank” with values:
cheap, moderate and expensive has been added to the 6000 12000 6000-12000

116 TEM Journal – Volume 8 / Number 1 / 2019.


TEM Journal. Volume 8, Issue 1, Pages 113-118, ISSN 2217-8309, DOI: 10.18421/TEM81-16, February 2019.

In total, 230 samples of Cheap dataset were input Table 11. Accuracy results for SVM and ANN on
to SVM and ANN algorithms. Expensive dataset
After running SVM and ANN on given dataset, Type of evaluation SVM ANN
following results were obtained: Cross validation with 79.72 75
10 folds
Table 7. Accuracy results for SVM and ANN on Cheap
90% percentage split 90.48 85.71
dataset
Type of evaluation SVM ANN
After models are built, they have been assembled
Cross validation with 86.96 83.91 into the final prediction system, shown in Fig. 3. For
10 folds the case of 90% dataset split, SVM achieved the
90% percentage split 86.96 73.91 highest accuracy in Cheap and Expensive subsets,
while ANN performed better in Moderate subset.
4.2 Applying Classification on Moderate dataset using
SVM and ANN algorithms

The model is further trained on the Moderate


dataset. For this purpose, attribute price is ranked
into 2 classes, shown in Table 8. Figure 3. Prediction model for 90% split case

Table 8. Nominal classes in Moderate dataset The final prediction system has been incorporated
From To Class into the Java swing GUI application for the car price
prediction. The simple application GUI, shown in
12000 15000 12000-18000
Fig. 4. enables potential car buyers to estimate the
18000 21000 18000-24000
price of the desired car.
The proposed prediction model has been
After applying Multilayer Perceptron algorithm on evaluated on the test subset and model achieved
dataset, we got the following results. overall accuracy of 87.38%. This proves that
combination of multiple machine learning classifiers
Table 9. Accuracy results for SVM and ANN on Moderate
strengthens the classification performance overall.
dataset
Type of evaluation SVM ANN
Cross validation 78.65 76.41
with 10 folds
90% percentage 83.33 86.11
split

4.3 Applying Classification on Expensive dataset using


SVM algorithm
As for the previous datasets, the model is trained
on the Expensive dataset. For this purpose, the
attribute price is grouped into 2 classes.

Table 10. Nominal classes for Expensive dataset Figure 4. Graphical user interface of the Java application
for car price prediction
From To Class
24000 28000 24000-32000
5. Conclusion
32000 36000 32000-...
Car price prediction can be a challenging task due
SVM and ANN algorithms are further applied to to the high number of attributes that should be
Expensive dataset and results are obtained. considered for the accurate prediction. The major step
in the prediction process is collection and
preprocessing of the data. In this research, PHP
scripts were built to normalize, standardize and clean
data to avoid unnecessary noise for machine learning
algorithms.

TEM Journal – Volume 8 / Number 1 / 2019. 117


TEM Journal. Volume 8, Issue 1, Pages 113-118, ISSN 2217-8309, DOI: 10.18421/TEM81-16, February 2019.

Data cleaning is one of the processes that


increases prediction performance, yet insufficient for [6] Gongqi, S., Yansong, W., & Qiang, Z. (2011,
the cases of complex data sets as the one in this January). New Model for Residual Value Prediction
research. Applying single machine algorithm on the of the Used Car Based on BP Neural Network and
Nonlinear Curve Fit. In Measuring Technology and
data set accuracy was less than 50%. Therefore, the
Mechatronics Automation (ICMTMA), 2011 Third
ensemble of multiple machine learning algorithms International Conference on (Vol. 2, pp. 682-685).
has been proposed and this combination of ML IEEE.
methods gains accuracy of 92.38%. This is [7] Pudaruth, S. (2014). Predicting the price of used cars
significant improvement compared to single machine using machine learning techniques. Int. J. Inf.
learning method approach. However, the drawback Comput. Technol, 4(7), 753-764.
of the proposed system is that it consumes much [8] Noor, K., & Jan, S. (2017). Vehicle Price Prediction
more computational resources than single machine System using Machine Learning
learning algorithm. Techniques. International Journal of Computer
Although, this system has achieved astonishing Applications, 167(9), 27-31.
[9] Auto pijaca BiH. (n.d.), Retrieved
performance in car price prediction problem our aim
from: https://www.autopijaca.ba. [accessed August
for the future research is to test this system to work 10, 2018].
successfully with various data sets. We will extend [10] Weka 3 - Data Mining with Open Source Machine
our test data with eBay [16] and OLX [17] used cars Learning Software in Java. (n.d.), Retrieved
data sets and validate the proposed approach. from: https://www.cs.waikato.ac.nz/ml/weka/.
[August 04, 2018].
[11] Ho, T. K. (1995, August). Random decision forests.
References In Document analysis and recognition, 1995.,
[1] Agencija za statistiku BiH. (n.d.), retrieved proceedings of the third international conference
from: http://www.bhas.ba . [accessed July 18, on (Vol. 1, pp. 278-282). IEEE.
2018.] [12] Russell, S. (2015). Artificial Intelligence: A Modern
[2] Listiani, M. (2009). Support vector regression Approach (3rd edition). PE.
analysis for price prediction in a car leasing [13] Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik,
application (Doctoral dissertation, Master thesis, V. (2001). Support vector clustering. Journal of
TU Hamburg-Harburg). machine learning research, 2(Dec), 125-137.
[3] Richardson, M. S. (2009). Determinants of used car [14] Aizerman, M. A. (1964). Theoretical foundations of
resale value. Retrieved from: the potential function method in pattern recognition
https://digitalcc.coloradocollege.edu/islandora/object learning. Automation and remote control, 25, 821-
/coccc%3A1346 [accessed: August 1, 2018.] 837.
[4] Wu, J. D., Hsu, C. C., & Chen, H. C. (2009). An [15] 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier
expert system of price forecasting for used cars — scikit-learn 0.19.2 documentation. (n.d.).
using adaptive neuro-fuzzy inference. Expert Retrieved from: http://scikit-
Systems with Applications, 36(4), 7809-7817. learn.org/stable/modules/generated/sklearn.ensemble
[5] Du, J., Xie, L., & Schroeder, S. (2009). Practice .RandomForestClassifier.html [accessed: August
Prize Paper—PIN Optimal Distribution of Auction 30, 2018].
Vehicles System: Applying Price Forecasting, [16] Used cars database. (n.d.) Retrieved
Elasticity Estimation, and Genetic Algorithms to from: https://www.kaggle.com/orgesleka/used-cars-
Used-Vehicle Distribution. Marketing database. [accessed: June 04, 2018].
Science, 28(4), 637-644. [17] OLX. (n.d.), Retrieved from: https://olx.ba.
[accessed August 05,2018].

118 TEM Journal – Volume 8 / Number 1 / 2019.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy