Data Mining Mid Project Report-Sagor
Data Mining Mid Project Report-Sagor
Data Mining Mid Project Report-Sagor
Submitted by
Name ID
Submitted to
DR. AKINUL ISLAM JONY
Associate Professor & Head (UG)
Faculty of Science and technology
Project Description :
E-commerce shipping prediction involves using data and algorithms to estimate the delivery time
or arrival of a package once it has been ordered online. This process typically takes into account
various factors such as the Warehouse Block, Customer Rating, Cost of the Product, historical
shipping data, and potential delays like holidays.
Machine learning models are often employed to analyze past shipping patterns and make
predictions based on the gathered information. Naive Bayes model can continuously learn and
improve their accuracy over time as more data becomes available.
ID : ID Number of Customers.
Warehouse block: The Company have big Warehouse which is divided in to block such as
A,B,C,D,E.
Mode of shipment: The Company Ships the products in multiple way such as Ship, Flight and
Road.
Customer care calls: The number of calls made from enquiry for enquiry of the shipment.
Customer rating: The company has rated from every customer. 1 is the lowest (Worst), 5 is the
highest (Best).
Cost of the product: Cost of the Product in US Dollars.
Prior purchases: The Number of Prior Purchase.
Product importance: The company has categorized the product in the various parameter such as
low, medium, high.
Gender: Male and Female.
Discount offered: Discount offered on that specific product.
Weight in gms: It is the weight in grams.
Reached on time: It is the target variable, where 1 Indicates that the product has NOT reached
on time and 0 indicates it has reached on time.
Read the Data:
The read.csv function reads the CSV file located at "C:/Users/user/Data Mining/Train (2).csv"
and creates a data frame named data with column headers and data from the file. This data frame
will contain the information from the CSV file, which can then be used for further analysis and
processing.
Find Out Missing Value:
There were no missing values among 12 variables in our dataset. So, we don't need to replace or
discard any of the instances.
Histogram :
A histogram is like a visual summary of data distribution. It's a kind of bar graph that shows the
frequency of values in a data set. A histogram groups the data into intervals, or "bins," and then
uses bars to represent how many data points fall into each bin. The height of each bar corresponds to
the frequency of data points in that bin. Here, we can see around 4500 orders did not reached on time and
around 6500 orders reached on time.
Labeling String Data to Categorial :
This is a common preprocessing step in machine learning to convert categorical data into a
format that can be fed into models.
data['Warehouse_block'] = data['Warehouse_block'].astype('category').cat.codes
data['Mode_of_Shipment'] = data['Mode_of_Shipment'].astype('category').cat.codes
data['Product_importance'] = data['Product_importance'].astype('category').cat.codes
data['Gender'] = data['Gender'].astype('category').cat.codes
Converts the 'Warehouse_block' column to a categorical type and then assigns integer codes to each
category. Converts the 'Warehouse_block' column to a categorical type and then assigns integer codes to
each category and same for the 'Gender' column.
Correlation Heatmap :
A correlation heatmap is a graphical representation of the correlation matrix, which shows the
correlation coefficients between a set of variables. In simpler terms, it helps visualize how
strongly different variables are related to each other.
Each cell in the heatmap represents the correlation coefficient between two variables. The colors
of the cells typically range from one color extreme to another, indicating the strength and
direction of the correlation.
Histplot :
A histplot is a type of graphical representation used in data analysis and statistics to visualize the
distribution of a univariate (single variable) dataset. It combines elements of a histogram and a
plot.
In a histplot, the dataset is divided into bins, and the frequency of data points within each bin is
represented by bars. Additionally, a smooth curve or line is often overlaid on the bars to provide
a sense of the overall shape of the distribution.
Histplots are useful for understanding the central tendency, variability, and skewness of the data.
Calculate_prior :
Prior probability refers to the initial probability assigned to a hypothesis before taking into
consideration new evidence or data. It represents the belief or probability assigned to an event
based on existing knowledge, experience, or information before any new observations are made.
In Bayesian statistics, the prior probability is combined with the likelihood of new evidence to
update the probability of a hypothesis using Bayes' theorem.
Calculate_likelihood_categorical :
The likelihood is calculated based on the probability distribution assumed by the model. For
example, if we assume that our data follows a normal distribution with parameters μ and σ, the
likelihood function would involve the probability density function (PDF) of the normal
distribution. The formula for the likelihood often depends on the specific assumptions and model
chosen for the data.
Naïve bayes categorical :
Categorical Naïve Bayes is a variant of the Naive Bayes algorithm specifically designed for
categorical data. In the context of machine learning and statistics, the Naïve Bayes algorithm is a
probabilistic classification algorithm based on Bayes' theorem.
For categorical data, which consists of discrete and categorical variables, the Categorical Naïve
Bayes assumes that the features are categorical and follows a multinomial distribution. It's called
"Naïve" because it makes a strong assumption of independence between features, meaning that
the presence or absence of a particular feature is considered independent of the presence or
absence of any other feature.
Train and Test :
The train_test_split function from a machine learning library, likely scikit-learn in Python. This
function is commonly used to split a dataset into training and testing sets for model evaluation.
20% of the data will be used for testing, and the remaining 80% will be used for training. train and test,
which contain the training and testing data, respectively. X_test, which contains the features of the testing
set, and Y_test, which contains the corresponding labels. This is likely to be the prediction of variable
where the prediction of target variable will be stored depending on training and test data.
Confusion Matrix :
A confusion matrix is a table used in classification to evaluate the performance of a machine
learning model. It's particularly useful when assessing the accuracy of a model on a dataset with
known true outcomes. The matrix summarizes the results of the model's predictions compared to
the actual outcomes.
1. True Positive (TP): Instances where the model correctly predicts the positive class.
2. True Negative (TN): Instances where the model correctly predicts the negative class.
3. False Positive (FP): Instances where the model predicts the positive class incorrectly
(Type I error).
4. False Negative (FN): Instances where the model predicts the negative class incorrectly
(Type II error).
Accuracy:
The accuracy of a Naive Bayes classifier, or any classifier for that matter, is a measure of how
well the model predicts the correct class labels compared to the total number of instances. We
have got the accuracy of 0.5668449197860963 for Naive Bayes Model.
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: