0% found this document useful (0 votes)
61 views

Data Mining

The document discusses predicting car prices using machine learning. It describes preprocessing a car dataset by handling missing values, encoding categorical variables, and scaling numerical features. Models like linear regression and decision trees are trained on preprocessed data to predict prices, and their performance is evaluated on test data using metrics like MAE, MSE, and R2.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Data Mining

The document discusses predicting car prices using machine learning. It describes preprocessing a car dataset by handling missing values, encoding categorical variables, and scaling numerical features. Models like linear regression and decision trees are trained on preprocessed data to predict prices, and their performance is evaluated on test data using metrics like MAE, MSE, and R2.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Sanjivani College of Engineering

Kopargaon - 423 603.

(Savitribai Phule Pune University, Pune)

Academic Year 2023-24

CIA ACTIVITY
MINI PROJECT
REPORT

Car Price Prediction

Submitted By

Yuvraj Shejul (50)


Mahesh Salpure (49)
Chetan Devre (14)
Pranav Autade (04)

(B. Tech)

Guide:

Prof. S.S.Kulkarni
1) Problem Statement

The car features and price prediction analysis aim to leverage machine learning techniques to predict
car prices based on various features. The objectives include exploring the dataset, preprocessing the
data, building predictive models, and evaluating their performance.

2) Solution

To implement the hotel reservation dataset analysis, follow these general steps.
1) Data Preparation: Acquire a comprehensive hotel reservation dataset with relevant
features.
2) Data Cleaning: Address missing values, outliers, and inconsistencies in the dataset to
ensure the quality of the analysis.
3) Exploratory Data Analysis: Conduct exploratory data analysis to understand the
distribution of key variables, identify patterns, and gain insights into the characteristics of
successful reservations and cancellations.

3) Dataset Structure:

The dataset is organized into rows, each representing a unique car entry, and columns representing
different attributes associated with the cars. The dataset contains the following columns:

ID: Unique identifier for each car entry.


Price: The target column, representing the price of the car.
Levy: Levy or tax applied to the car's price.
Manufacturer: The brand or manufacturer of the car.
Model: The specific model name or number of the car.
Prod. Year: The year in which the car was manufactured.
Category: The category to which the car belongs (e.g., sedan, SUV, hatchback).
Leather Interior: Whether the car features a leather interior (Yes/No).
Fuel Type: The type of fuel the car uses (e.g., gasoline, diesel).
Engine Volume: The volume of the car's engine in cubic centimeters (cc).
Mileage: The total distance the car has traveled in kilometers.
Cylinders: The number of cylinders in the car's engine.
Gear Box Type: The type of gear box in the car (e.g., automatic, manual).
Drive Wheels: The type of wheels the car uses for driving (e.g., front-wheel drive, all-wheel drive).
Doors: The number of doors on the car.
Wheel: The type of wheel the car has (e.g., left wheel, right wheel).
Color: The color of the car's exterior.
Airbags: The number of airbags installed in the car for safety.

4) Important Libraries:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("car_price_prediction.csv")
5) Understanding The Dataset:

df.describe().T
df.shape
df.info()
df.duplicated().sum()
df.isnull().sum()

6) Data Preprocessing:

Preprocessing Steps

The preprocessing steps for the given hotel reservation dataset involve handling
missing values, encoding categorical variables, and scaling numericalfeatures. Here's a
suggested set of preprocessing steps.
1) Handling Missing Values:

Check for any missing values in the dataset and decide on an appropriate
strategy for handling them. This might involve imputation, removal of rows or columns
with missing values, or other techniques depending on the nature of themissing data.

2) Encoding Categorical Variables:

Convert categorical variables, such as type_of_meal_plan, room_type_reserved,


and market_segment_type, into numerical representations. This can be achieved through
one-hot encoding or label encoding. One-hot encoding creates binary columns for each
category, while label encoding assigns aunique numerical label to each category.

3) Dropping Unnecessary Columns:

Remove any columns that are not relevant to the analysis or do notcontribute
meaningful information.
After performing these preprocessing steps, the dataset will be more suitable for
training machine learning models . The choice of specific preprocessing steps may vary
based on the characteristics of the data and the goalsof the analysis.
Code:

1. Remove Duplicates:

df = df.drop_duplicates()

2. Clean Values:

# Clean column names


df.columns = df.columns.str.strip()

# Remove 'km' from mileage


df["Mileage"] = df["Mileage"].str.replace(' km', '')

# Handle extra characters in Levy


df["Levy"] = df["Levy"].str.replace("-", '')
df["Levy"] = df["Levy"].str.replace("", '0')
df["Levy"] = df["Levy"].astype(str).astype(float)

# Extract 'With Turbo' information


df['With Turbo'] = df['Engine volume'].str.contains(' Turbo', case=False).astype(int)
df["Engine volume"] = df["Engine volume"].str.replace(' Turbo', '')
df["Engine volume"] = df["Engine volume"].astype(str).astype(float)

3. Handling Missing Values:

No missing values in all features.

4. Handling Data Types:

# Convert categorical columns to 'category' type


df['Manufacturer'] = df['Manufacturer'].astype('category')
df['Category'] = df['Category'].astype('category')
df['Leather interior'] = df['Leather interior'].astype('category')
df['Fuel type'] = df['Fuel type'].astype('category')
df['Gear box type'] = df['Gear box type'].astype('category')
df['Drive wheels'] = df['Drive wheels'].astype('category')
df['Doors'] = df['Doors'].astype('category')
df['Wheel'] = df['Wheel'].astype('category')
df['Color'] = df['Color'].astype('category')
df['Model'] = df['Model'].astype('category')

5. Scaling:

from sklearn.preprocessing import MinMaxScaler


scale = MinMaxScaler()
X = df["Mileage"].values.reshape(-1, 1)
scaledX = scale.fit_transform(X)
df["Mileage"] = scaledX

X2 = df["Levy"].values.reshape(-1, 1)
scaledX2 = scale.fit_transform(X2)
df["Levy"] = scaledX2
6. Encoding:

# Label encoding
# Replace values in some columns
df['Doors'].replace(['04-May', '02-Mar', ">5"], [1, 2, 3], inplace=True)
df['Wheel'].replace(['Left wheel', 'Right-hand drive'], [1, 2], inplace=True)
df['Drive wheels'].replace(['Front', '4x4', "Rear"], [1, 2, 3], inplace=True)
df['Gear box type'].replace(['Automatic', 'Tiptronic', "Manual", "Variator"], [1, 2, 3, 4], inplace=True)
df['Leather interior'].replace(['Yes', 'No'], [1, 2], inplace=True)

# One-hot encoding for categorical columns


df_encoded = pd.get_dummies(df, columns=["Manufacturer", "Category", "Color", "Fuel type",
"Model"])
# Plot a bar plot of average Price by Manufacturer
plt.figure(figsize=(12, 6))
sorted_manufacturer_avg_price.plot(kind='bar', color='skyblue')
plt.xlabel('Manufacturer')
plt.ylabel('Average Price')
plt.title('Average Car Price by Manufacturer')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
Exploratory Data Analysis (EDA):

# Correlation Heatmap

A heatmap to visualize the correlation between different numerical features in thedataset. Bar
Plots for Categorical Variables:
Code:

plt.figure(figsize=(15, 10))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

## 5.1 Price Distribution by Category


plt.figure(figsize=(14, 8))
sns.boxplot(x='Category', y='Price', data=df, palette='viridis')
plt.title('Price Distribution by Category')
plt.xlabel('Category')
plt.ylabel('Price')
plt.show()
## 5.2 Manufacturer vs. Price
plt.figure(figsize=(18, 8))
sns.barplot(x='Manufacturer', y='Price', data=df, palette='coolwarm')
plt.title('Manufacturer vs. Price')
plt.xlabel('Manufacturer')
plt.ylabel('Price')
plt.xticks(rotation=45, ha='right')
plt.show()

# 5.5 Pairplot for Numerical Features ('Price' and 'Levy')


numerical_columns_pairplot = ['Price', 'Levy']

plt.figure(figsize=(12, 8))
sns.pairplot(df[numerical_columns_pairplot])
plt.title('Pairplot for Numerical Features (Price and Levy)')
plt.show()

Modeling:

Building a machine learning model involves several key steps. Since you've already preprocessed
your data and selected a few algorithms (Decision Tree, Random Forest, and SVM), let's outline the
general process for building a model.

• Train-Test Split:

Split your dataset into training and testing sets. The training set is used to train the model, while the
testing set is used to evaluate its performance on unseen data.

• Feature Scaling:

Depending on the algorithms you're using, it might be necessary to scale your features. Some
algorithms, like SVM, are sensitive to the scale of input features.

• Model Training:

Train your selected algorithms on the training data.

• Model Evaluation:

Evaluate the performance of your models using metrics relevant to your problem.

Code:

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Linear Regression:

from sklearn.linear_model import LinearRegression

linearregression = LinearRegression()
linearregression.fit(X_train, y_train)

y_pred1 = linearregression.predict(X_test)
LinearRegression_score = linearregression.score(X_test, y_test)

Decision Tree Regressor:

from sklearn.tree import DecisionTreeRegressor

DecisiontreeRegressor = DecisionTreeRegressor()
DecisiontreeRegressor.fit(X_train, y_train)

y_pred2 = DecisiontreeRegressor.predict(X_test)
DecisionTreeRegressor_score = DecisiontreeRegressor.score(X_test, y_test)

Model Evaluation:

Linear Regression:

print("LinearRegression score:", LinearRegression_score)


# Calculate evaluation metrics
MAE.append(mean_absolute_error(y_test, y_pred1))
MSE.append(mean_squared_error(y_test, y_pred1))
RMSE.append(np.sqrt(MSE[-1]))
R2.append(r2_score(y_test, y_pred1))

# Print the evaluation metrics


print("\nLinearRegression model evaluation")
print(f'Mean Absolute Error (MAE): {MAE[-1]:.2f}')
print(f'Mean Squared Error (MSE): {MSE[-1]:.2f}')
print(f'Root Mean Squared Error (RMSE): {RMSE[-1]:.2f}')
print(f'R-squared (R²): {R2[-1]:.2f}\n')

# Result DataFrame
result1 = pd.DataFrame()
result1["y_test"] = y_test
result1["y_predicted"] = y_pred1

Decision Tree Regressor:

print("DecisionTreeRegressor score:", DecisionTreeRegressor_score)


# Calculate evaluation metrics
MAE.append(mean_absolute_error(y_test, y_pred2))
MSE.append(mean_squared_error(y_test, y_pred2))
RMSE.append(np.sqrt(MSE[-1]))
R2.append(r2_score(y_test, y_pred2))

# Print the evaluation metrics


print("\nDecisionTreeRegressor model evaluation")
print(f'Mean Absolute Error (MAE): {MAE[-1]:.2f}')
print(f'Mean Squared Error (MSE): {MSE[-1]:.2f}')
print(f'Root Mean Squared Error (RMSE): {RMSE[-1]:.2f}')
print(f'R-squared (R²): {R2[-1]:.2f}\n')

# Result DataFrame
result2 = pd.DataFrame()
result2["y_test"] = y_test
result2["y_predicted"] = y_pred

RESULT:

DecisionTreeRegressor score: 0.00019264796448403843


DecisionTreeRegressor model evaluation
Mean Absolute Error (MAE): 10226.77
Mean Squared Error (MSE): 140270221278.28
Root Mean Squared Error (RMSE): 374526.66
R-squared (R²): 0.00

Conclusion:

The project successfully explored, preprocessed, and modeled the car features and price dataset.
Both Linear Regression and Decision Tree Regressor models were implemented and evaluated.
Further analysis and optimization could be performed to enhance model performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy