Car Price Prediction
Car Price Prediction
1.Importing Libraries
In [1]:
import warnings # Supress display of warnings
warnings.filterwarnings('ignore')
In [2]:
from sklearn.model_selection import GridSearchCV # Used to search the best paramet
from sklearn.model_selection import train_test_split # Split the data for training and
import statsmodels.api as sm
2.Importing Data
In [3]:
df = pd.read_csv('train.csv')
186005
0 45654403 13328 1399 LEXUS RX 450 2010 Jeep Yes Hybrid 3.5
km
Prod. Leather Fuel Engine
ID Price Levy Manufacturer Model Category Mileage
year interior type volume
192000
1 44731507 16621 1018 CHEVROLET Equinox 2011 Jeep No Petrol 3
km
200000
2 45774419 8467 - HONDA FIT 2006 Hatchback No Petrol 1.3
km
168966
3 45769185 3607 862 FORD Escape 2011 Jeep Yes Hybrid 2.5
km
91901
4 45809263 11726 446 HONDA FIT 2014 Hatchback Yes Petrol 1.3
km
In [5]:
df.tail()
1616
19233 45778856 15681 831 HYUNDAI Sonata 2011 Sedan Yes Petrol 2.4
1163
19234 45804997 26108 836 HYUNDAI Tucson 2010 Jeep Yes Diesel 2
512
19235 45793526 5331 1288 CHEVROLET Captiva 2007 Jeep Yes Diesel 2
1869
19236 45813273 470 753 HYUNDAI Sonata 2012 Sedan Yes Hybrid 2.4
In [6]:
df.shape
In [7]:
y = df["Price"]
y.shape
Out[7]: (19237,)
In [8]:
# Drop Price column from the dataset
df = df.drop(['Price'], axis = 1)
In [9]:
df.shape
info
ID int64 18924
Manufacturer object 65
Category object 11
Cylinders float64 13
Doors object 3
Wheel object 2
Color object 16
Airbags int64 17
In [11]:
df.describe()
count 19237 19237 19237 19237 19237 19237 19237 19237 19237 19237
freq 5819 3769 1083 8736 13954 10150 3916 721 13514 12874
4. Distribution of variables
5. Study correlation
6. Detect outliers
In [13]:
df["Mileage"]=df["Mileage"].str.split(" ",n=1,expand=True) #Separate out km
In [14]:
df["Mileage"]=df["Mileage"].astype("float") #Converting string to float
In [15]:
# Replace all '0' valus in Mileage with mean value
In [16]:
# Checking the unique values of 'Doors' column
df["Doors"].unique()
In [17]:
# Removing string literals from 'Doors' column
In [18]:
#Checking the unique values of 'levy column'
df['Levy'].unique()
Out[18]: array(['1399', '1018', '-', '862', '446', '891', '761', '751', '394',
In [19]:
#Replacing the '-' character with 0
df["Levy"]=pd.to_numeric(df['Levy'].replace('-','0'), downcast='float')
In [20]:
#Replacing the 0 value with mean
df["Levy"]=np.where(df["Levy"]=='0.0',df["Levy"].mean(),df["Levy"])
In [21]:
#Check Unique values of Engine Volume\
df["Engine volume"].unique()
Out[21]: array(['3.5', '3', '1.3', '2.5', '2', '1.8', '2.4', '4', '1.6', '3.3',
'2.3 Turbo', '1.4', '5.5', '2.8 Turbo', '3.2', '3.8', '4.6', '1.2',
'5', '1.7', '2.9', '0.5', '1.8 Turbo', '2.4 Turbo', '3.5 Turbo',
'2.1', '0.7', '5.4', '1.3 Turbo', '3.7', '1', '2.5 Turbo', '2.6',
'1.9 Turbo', '4.4 Turbo', '4.7 Turbo', '0.8', '0.2 Turbo', '5.7',
'1.7 Turbo', '6.3 Turbo', '2.7 Turbo', '4.3', '4.2', '2.9 Turbo',
'0', '4.0 Turbo', '20', '3.6 Turbo', '0.3', '3.7 Turbo', '5.9',
'0.6 Turbo', '6.8', '4.5', '0.6', '7.3', '0.1', '1.0 Turbo', '6.3',
'4.5 Turbo', '0.8 Turbo', '4.2 Turbo', '3.1', '5.0 Turbo', '6.4',
'3.9', '5.7 Turbo', '0.9', '0.4 Turbo', '5.4 Turbo', '0.3 Turbo',
In [22]:
# Removing all the string literals
In [23]:
# Replace 0 values with the mean value
In [24]:
df.head()
0 45654403 1399.0 LEXUS RX 450 2010 Jeep Yes Hybrid 3.5 186005.0
3 45769185 862.0 FORD Escape 2011 Jeep Yes Hybrid 2.5 168966.0
4 45809263 446.0 HONDA FIT 2014 Hatchback Yes Petrol 1.3 91901.0
Feature engineering is the process of deriving new features(indepedent variable) from the existing
features(independent variable), that will help in training the model and in turn would improve the
accuracy of the model or reduce the loss of the model.
In [25]:
#Feature engineering using the 'production year' column
current_time = dt.datetime.now()
In [26]:
df.head()
In [27]:
#Checkinh missing values in the Dataset
df.isnull().sum()
Out[27]: ID 0
Levy 0
Manufacturer 0
Model 0
Prod. year 0
Category 0
Leather interior 0
Fuel type 0
Engine volume 0
Mileage 0
Cylinders 0
Drive wheels 0
Doors 0
Wheel 0
Color 0
Airbags 0
dtype: int64
In [28]:
# After Preprocessing Dataset
df.head()
sns.heatmap(df.isnull(),cbar=False)
plt.show()
Correlation:
1) The function 'corr' tells us about the correlation between two features.
5) We drop the features with high correation to avoid multicollinearity, since it is one of the
assumption of linear regression that the dataset should have no multicollinearity.
In [30]:
#Checking for correlation among the independent variables
sns.heatmap(df.corr(),cbar=True, annot=True)
plt.show()
We can see that 'Engine volume' is having high correlation(0.78) with 'Cylinders' column. We will
drop the 'Cylinders' column to avoid multicollinearity.
In [31]:
# Droping the Cylinders column
df.drop(["Cylinders"],axis=1, inplace=True)
In [32]:
# To identify distribution and skewness by plotting histogram of all numeric variables
df.hist()
plt.tight_layout()
plt.show()
We can see that 'Prod. year','Levy' and 'Engine volume' columns are right skewed.
In [33]: #Check the skewness values using the fuction skew()
In [34]:
#Check the normality distribution of target variable
y.hist()
plt.show()
In [35]:
#normalityt test using shapiro()
# the test returns the the test statistics and the p-value of the test
stat, p=shapiro(y)
Statistics=0.013, p-value=0.000
In [36]:
# set the level of significance to 0.05
alpha = 0.05
if p > alpha:
else:
In [37]:
#Checking the skewness value
In [38]:
# As from the shapiro test and the skewness value we can see that 'Price' column is hig
# Hence,we normalize the column using log transformation
y = np.log(y)
In [39]:
#Rechecking the Skewness after the log transformation
print('The skewness of the dependent variable after log tranformation',y.skew())
Hence, we can see that the skewness of the dependent variable has considerably got reduced.
In [40]:
# seperating the categorical and numerical data
In [41]:
categorical
Left
0 LEXUS RX 450 Jeep Yes Hybrid Automatic 4x4 04 Silver
wheel
Left
1 CHEVROLET Equinox Jeep No Petrol Tiptronic 4x4 04 Black
wheel
Right-
2 HONDA FIT Hatchback No Petrol Variator Front 04 hand Black
drive
Left
3 FORD Escape Jeep Yes Hybrid Automatic 4x4 04 White
wheel
Left
4 HONDA FIT Hatchback Yes Petrol Automatic Front 04 Silver
wheel
Leather Fuel Gear box Drive
Manufacturer Model Category Doors Wheel Color
interior type type wheels
... ... ... ... ... ... ... ... ... ... ...
MERCEDES- Left
19232 CLK 200 Coupe Yes CNG Manual Rear 02 Silver
BENZ wheel
Left
19233 HYUNDAI Sonata Sedan Yes Petrol Tiptronic Front 04 Red
wheel
Left
19234 HYUNDAI Tucson Jeep Yes Diesel Automatic Front 04 Grey
wheel
Left
19235 CHEVROLET Captiva Jeep Yes Diesel Automatic Front 04 Black
wheel
Left
19236 HYUNDAI Sonata Sedan Yes Hybrid Automatic Front 04 White
wheel
In [42]:
# getting dummies for the categorical variables
dummies
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
19232 0 0 0 0
19233 0 0 0 0
19234 0 0 0 0
19235 0 0 0 0
19236 0 0 0 0
In [43]:
# creating the final dataset. Concatinate categorical and numerical dataset
In [44]:
df_final.shape
Split the data into training and test set using train_test_split
In [45]:
#Splitting the data into test and train
X = df_final
Y = y
# Splitting the data before building the model in order to train the model and later ch
# The argument "test_size" tells about the ratio of data, that needs to be kept for tes
In [46]:
print(f'The shape of the X_train dats is : {X_train.shape}')
In [48]:
#Predicting the output of teh test data
predicted = model1.predict(X_test)
In [49]:
# calculating the metrics root mean squared log error
RMSLE = np.sqrt(mean_squared_log_error(predicted,Y_test))
mae = mean_absolute_error(Y_test,predicted)
In [50]:
# create the result table for all accuracy scores
'RMSE' : RMSE,
'RMSLE' : RMSLE,
})
We need to select significant features to train the model.If there are lot of features it results in
increasing the model complexity.
In this model, linear regression from statsmodels has been used to find the probability values of all
the features.All the significant features have been selected i.e. features having p-value less than 0.05.
Ridge regressor has been used to find the best parameters.These parameters have been used in the
ridge regressor to predict the price of cars
In [51]:
#instantiating the ordinary least square model
model2 = sm.OLS(Y_train,X_train).fit()
Df Model: 1350
Model_Aqua g soft leather sele -1.53e-07 5.39e-07 -0.284 0.776 -1.21e-06 9.03e-07
Model_Astra GTC 1.9 turbo dies 0.6202 1.344 0.461 0.644 -2.014 3.255
Model_B 170 Edition One 4.51e-07 5.45e-07 0.828 0.408 -6.17e-07 1.52e-06
Model_C 230 2.0 kompresor -0.5131 1.388 -0.370 0.712 -3.234 2.208
Model_C 250 1,8 turbo -0.0285 1.384 -0.021 0.984 -2.741 2.684
Model_C 250 1.8 ტურბო -2.704e-07 5.41e-07 -0.500 0.617 -1.33e-06 7.9e-07
Model_C 300 6.3 AMG Package 1.7029 1.383 1.231 0.218 -1.008 4.414
Model_CLS 450 CLS 400 2.133e-07 5.52e-07 0.386 0.699 -8.69e-07 1.3e-06
Model_E 350 4 Matic AMG Packag 2.28e-07 5.81e-07 0.393 0.695 -9.1e-07 1.37e-06
Model_Every Landy NISSAN SEREN -0.6713 1.314 -0.511 0.609 -3.246 1.904
Model_G 65 AMG G63 AMG 2.3514 1.386 1.697 0.090 -0.365 5.068
Model_GLC 300 GLC coupe 1.3062 1.004 1.301 0.193 -0.662 3.274
Model_GLE 400 Coupe, AMG Kit 2.5375 1.384 1.833 0.067 -0.176 5.250
Model_GX 470 SUV 4D (4.7L V8 S 0.5621 1.342 0.419 0.675 -2.069 3.193
Model_ML 350 SPECIAL EDITION 0.0255 1.384 0.018 0.985 -2.688 2.739
Model_Pajero Mini 2008 წლიანი 0.3481 1.333 0.261 0.794 -2.264 2.960
Model_Pajero Mini 2010 წლიანი -0.3425 1.334 -0.257 0.797 -2.957 2.272
Model_Range Rover Evoque 2.0 0.0321 1.266 0.025 0.980 -2.450 2.514
Model_Range Rover Evoque რესტა 0.5949 1.266 0.470 0.638 -1.887 3.077
Model_S 350 CDI 320 5.152e-07 6.09e-07 0.846 0.397 -6.78e-07 1.71e-06
Model_Step Wagon RG2 SPADA 0.2527 1.343 0.188 0.851 -2.380 2.886
Model_Tacoma TRD Off Road 2.0919 1.392 1.503 0.133 -0.637 4.820
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.41e-30. This might indicate that there are
In [52]:
#all the independent features having p-values less than 0.05 has been selected
In [53]:
# splitting the dataset into train and test data
X1_train,X1_test,Y1_train,Y1_test=train_test_split(X1,Y,test_size=0.2,random_state=30)
In [54]:
# instantiating the ridge regressor
ridge = Ridge()
parameters={'alpha':[1e-15, 1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 35, 40, 45, 50,
'solver' : ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'],
'normalize':[True,False]
In [ ]:
#using 'GridSearchCV' to find the best parameters
ridge_regressor=GridSearchCV(ridge,parameters,scoring='r2',cv=5)
ridge_regressor.fit(X1_train,Y1_train)
In [ ]:
#instantiating the ridge regressor with best parameters
model2.fit(X1_train,Y1_train)
In [ ]:
#predict the price of cars using the 'predict()'
p=model2.predict(X1_test)
rmse=rmse(p,y1_test)
rmsle=np.sqrt(mean_squared_log_error(p,y1_test))
mae=mean_absolute_error(p,y1_test)
In [ ]:
linreg_model2 = pd.Series({'Model': "Ridge Regression with feature selection using p-va
'RMSE':rmse,
'RMSLE': rmsle,
})
result_tabulation
rf=RandomForestRegressor()
In [ ]:
#Creating a panda series containing the features and their importances
best_features=pd.Series(rf.feature_importances_,index=X_train.columns)
In [ ]:
# Creating a dataframe containing the most significant independent features
In [ ]:
# splitting the data for training and testing the data
X3_train,X3_test,y3_train,y3_test=train_test_split(X3,Y,test_size=0.3,random_state=30)
model3=LinearRegression()
model3.fit(X3_train,y3_train)
In [ ]:
#predict the price of cars using the 'predict()'
p1=model3.predict(X3_test)
rms=rmse(p1,y3_test)
rmsle=np.sqrt(mean_squared_log_error(p1,y3_test))
mae=mean_absolute_error(p1,y3_test)
In [ ]:
linreg_model3 = pd.Series({'Model': "Linear regression with feature selection",
'RMSE':'-',
'RMSLE': rmsle,
})
result_tabulation
ridge=Ridge()
parameters={'alpha':[1e-15, 1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 35, 40, 45, 50,
'solver' : ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'],
'normalize':[True,False]
ridge_regressor=GridSearchCV(ridge,parameters,scoring='r2',cv=5)
ridge_regressor.fit(X3_train,y3_train)
print(ridge_regressor.best_params_)
In [ ]:
model4=Ridge(alpha=10, normalize=False, solver='svd')
model4.fit(X3_train,y3_train)
p2=model4.predict(X3_test)
rmse=rmse(p2,y3_test)
rmsle=np.sqrt(mean_squared_log_error(p2,y3_test))
mae=mean_absolute_error(p2,y3_test)
In [ ]:
linreg_model4 = pd.Series({'Model': "Ridge Regression with hyperparameter tuning and fe
'RMSE':'-',
'RMSLE': rmsle,
})
result_tabulation
11.Conclusion
Out of the four models built, metrics of Linear regression looks better than the other three models
as the losses are less. But as per the test data we should consider the second model i.e. ridge
regression with hyperparameter tuning and feature selection as it will reduce the variance.Hece
model 2 will work better on testing data.
In [ ]: