Data Mining
Data Mining
CIA ACTIVITY
MINI PROJECT
REPORT
Submitted By
(B. Tech)
Guide:
Prof. S.S.Kulkarni
1) Problem Statement
The car features and price prediction analysis aim to leverage machine learning techniques to predict
car prices based on various features. The objectives include exploring the dataset, preprocessing the
data, building predictive models, and evaluating their performance.
2) Solution
To implement the hotel reservation dataset analysis, follow these general steps.
1) Data Preparation: Acquire a comprehensive hotel reservation dataset with relevant
features.
2) Data Cleaning: Address missing values, outliers, and inconsistencies in the dataset to
ensure the quality of the analysis.
3) Exploratory Data Analysis: Conduct exploratory data analysis to understand the
distribution of key variables, identify patterns, and gain insights into the characteristics of
successful reservations and cancellations.
3) Dataset Structure:
The dataset is organized into rows, each representing a unique car entry, and columns representing
different attributes associated with the cars. The dataset contains the following columns:
4) Important Libraries:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("car_price_prediction.csv")
5) Understanding The Dataset:
df.describe().T
df.shape
df.info()
df.duplicated().sum()
df.isnull().sum()
6) Data Preprocessing:
Preprocessing Steps
The preprocessing steps for the given hotel reservation dataset involve handling
missing values, encoding categorical variables, and scaling numericalfeatures. Here's a
suggested set of preprocessing steps.
1) Handling Missing Values:
Check for any missing values in the dataset and decide on an appropriate
strategy for handling them. This might involve imputation, removal of rows or columns
with missing values, or other techniques depending on the nature of themissing data.
Remove any columns that are not relevant to the analysis or do notcontribute
meaningful information.
After performing these preprocessing steps, the dataset will be more suitable for
training machine learning models . The choice of specific preprocessing steps may vary
based on the characteristics of the data and the goalsof the analysis.
Code:
1. Remove Duplicates:
df = df.drop_duplicates()
2. Clean Values:
5. Scaling:
X2 = df["Levy"].values.reshape(-1, 1)
scaledX2 = scale.fit_transform(X2)
df["Levy"] = scaledX2
6. Encoding:
# Label encoding
# Replace values in some columns
df['Doors'].replace(['04-May', '02-Mar', ">5"], [1, 2, 3], inplace=True)
df['Wheel'].replace(['Left wheel', 'Right-hand drive'], [1, 2], inplace=True)
df['Drive wheels'].replace(['Front', '4x4', "Rear"], [1, 2, 3], inplace=True)
df['Gear box type'].replace(['Automatic', 'Tiptronic', "Manual", "Variator"], [1, 2, 3, 4], inplace=True)
df['Leather interior'].replace(['Yes', 'No'], [1, 2], inplace=True)
# Correlation Heatmap
A heatmap to visualize the correlation between different numerical features in thedataset. Bar
Plots for Categorical Variables:
Code:
plt.figure(figsize=(15, 10))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
plt.figure(figsize=(12, 8))
sns.pairplot(df[numerical_columns_pairplot])
plt.title('Pairplot for Numerical Features (Price and Levy)')
plt.show()
Modeling:
Building a machine learning model involves several key steps. Since you've already preprocessed
your data and selected a few algorithms (Decision Tree, Random Forest, and SVM), let's outline the
general process for building a model.
• Train-Test Split:
Split your dataset into training and testing sets. The training set is used to train the model, while the
testing set is used to evaluate its performance on unseen data.
• Feature Scaling:
Depending on the algorithms you're using, it might be necessary to scale your features. Some
algorithms, like SVM, are sensitive to the scale of input features.
• Model Training:
• Model Evaluation:
Evaluate the performance of your models using metrics relevant to your problem.
Code:
Linear Regression:
linearregression = LinearRegression()
linearregression.fit(X_train, y_train)
y_pred1 = linearregression.predict(X_test)
LinearRegression_score = linearregression.score(X_test, y_test)
DecisiontreeRegressor = DecisionTreeRegressor()
DecisiontreeRegressor.fit(X_train, y_train)
y_pred2 = DecisiontreeRegressor.predict(X_test)
DecisionTreeRegressor_score = DecisiontreeRegressor.score(X_test, y_test)
Model Evaluation:
Linear Regression:
# Result DataFrame
result1 = pd.DataFrame()
result1["y_test"] = y_test
result1["y_predicted"] = y_pred1
# Result DataFrame
result2 = pd.DataFrame()
result2["y_test"] = y_test
result2["y_predicted"] = y_pred
RESULT:
Conclusion:
The project successfully explored, preprocessed, and modeled the car features and price dataset.
Both Linear Regression and Decision Tree Regressor models were implemented and evaluated.
Further analysis and optimization could be performed to enhance model performance.