ds cs
ds cs
A V SeshaSai (RA2211003010469)
Milan P Joseph (RA2211003010478)
B Jaswanth Patnaik (RA2211003010501)
Bhargava Satya Himank Chinni (RA2211003010510)
SCHOOL OF COMPUTING
FACULTYOFENGINEERINGANDTECHNOLOGY
BONAFIDE CERTIFICATE
Certified that A Case Study Report titled Blinkit Dataset Analysis is the bonafide work of
A V SeshaSai (RA221003010469), Milan P Joseph (RA2211003010478), B Jaswanth Patnaik
(RA2211003010501), and Bhargava Satya Himank Chinni (RA2211003010510), who carried out
the case study under my supervision. Certified further, that to the best of my knowledge the work
reported herein does not form any other work.
Faculty Signature
Dr. R. Anita
Assistant Professor
Department of Computing Technologies
Date:
2
ABSTRACT
Accuracy and consistency of data are essential for both customer satisfaction and operational
efficiency in the quickly changing field of online grocery delivery. A thorough data wrangling
and cleaning procedure applied to a Blinkit customer order dataset is presented in this study. The
goal was to convert unstructured, unreliable data into a format that could be analyzed. Numerous
preprocessing methods were used, such as dealing with missing values, getting rid of duplicates,
fixing data types, and getting rid of outliers. In order to find revenue trends by restaurant,
exploratory data transformation entailed grouping data and reshaping datasets using pivoting and
melting. Strategic imputation was used to address important issues such as null values in crucial
columns (such as Order Status and Amount Paid). Additionally, the integrity of the dataset was
guaranteed by standardizing column names and validating monetary fields. The Interquartile
Range (IQR) method of outlier detection improved the quality of insights obtained by filtering
out anomalous data. Within the Blinkit ecosystem, the cleaned and organized dataset is now
appropriate for real-time decision-making, predictive modeling, and advanced analytics. In order
to enable meaningful business intelligence and improve data-driven strategies, this study
highlights the significance of strong data preprocessing.
INTRODUCTION
Ensuring high-quality data is essential for both operational success and customer satisfaction
in the fast-paced world of online grocery delivery. A fundamental step in getting datasets ready
for insightful analysis is data wrangling, which is the process of cleaning, transforming, and
organizing raw data. Inconsistencies and missing data can have a detrimental effect on business
intelligence and decision-making because customer orders, delivery performance, and
restaurant interactions occur at scale.
Customer information, payment details, delivery status, and restaurant performance metrics are
just a few of the many order-related data that Blinkit, a well-known quick-commerce platform,
manages. The raw Blinkit data, like most real-world datasets, can be untidy, though, with
outliers, inconsistent formats, duplicate entries, and missing values that must be fixed before
any trustworthy analysis can be conducted.
This project performs systematic data wrangling and cleaning using a real-world Blinkit order
dataset. Its main objectives are to deal with missing values, standardize fields, fix data types,
and prepare the dataset for analysis. These procedures are necessary to guarantee accuracy in
downstream analytics, including customer behaviour modelling, delivery optimization, and
revenue tracking.
• To identify and eliminate outliers and duplicate records that might distort analysis.
• To get a clean, trustworthy dataset ready for upcoming data science projects like trend
analysis, visualization, and predictive modelling.
EXPLORATORY DATAANALYSIS (EDA)
Introduction to EDA
One crucial stage in the data analysis process is exploratory data analysis, or EDA. To gain a
deeper understanding of the dataset's structure and underlying relationships, it entails analyzing
it using summary statistics, pattern recognition, and visualizations. EDA assists in identifying
hidden trends, missing values, anomalies, and data imbalances that might otherwise go
overlooked. It also has a significant impact on how data preprocessing procedures are shaped
and how model-building tactics are informed.
To better understand customer behavior, order trends, and delivery performance, EDA was
applied to the Blinkit order dataset for this project. Numerous attributes, including order ID,
restaurant ID, customer information, payment amount, delivery staff, and order status, are
included in the dataset. Our goal is to derive valuable insights from the analysis of these features
in order to improve delivery logistics, boost customer satisfaction, and inform operational
strategies.
1. Data Overview
Exploratory Data Analysis (EDA) began with a general overview of the dataset. The first few
rows were displayed using df.head() to get a sense of the structure. The info() method was used
to inspect data types and check for null values, while describe() provided summary statistics
for numerical features.
• Understanding Feature Distribution: We analyzed the distribution of key variables
such as customer gender, order frequency, age, purchase amount, and delivery time. For
numerical features like cart value, delivery charges, and item quantity, we used
histograms and density plots to understand their spread, central tendency, and
purchasing behavior patterns.
Identifying Relationships with User Retention: One of the main goals was to identify which
customer attributes were most strongly associated with repeat orders or drop-offs. This was
achieved through grouped bar charts, crosstabulations, and statistical summaries. For example,
we compared retention rates across different delivery time slots, payment methods, and order
frequency to understand their influence on continued usage.
DATA WRANGLING METHODS
An essential first step in evaluating real-world datasets, such as the Blinkit Customer Dataset,
is data wrangling, also known as data preprocessing. It entails transforming raw data into an
analysis-ready format by cleaning, organizing, and enriching it. This procedure guarantees the
dataset is accurate, dependable, and prepared for significant insights because raw data
frequently contains inconsistencies like missing values, duplicate entries, and formatting
problems.
Missing values can arise due to various reasons, such as incomplete data collection or system
errors. Identifying and properly handling missing data is crucial for maintaining data integrity. In
this project, missing values were detected using Pandas' isnull() function. Based on the type of
missing values, different imputation techniques were applied. Numerical fields such as 'Amount
Paid' were filled with 0.0, whereas categorical fields like 'Order Status' and 'Customer Name' were
assigned default placeholders such as 'Unknown' and 'Anonymous'. This approach ensures that the
dataset remains complete and reliable for further analysis.
Handling Duplicate Values
Duplicate rows were checked using data.duplicated().sum(). The result showed that the dataset
contained no duplicate records, ensuring data integrity for the analysis.
The initial few rows of the dataset were inspected using df.head(). From this, we observed key
order-related features such as OrderID, RestaurantID, Amount Paid, Order
Status, Customer Name, and Delivery Guy Name. Columns like Delivery Guy
Name and Customer Name contain missing values and placeholders such as "Unassigned" and
"Anonymous," which were used to handle missing data. The OrderID column is critical for
uniquely identifying orders and was retained during analysis. Columns deemed irrelevant for
predictive modeling or analysis, such as those with excessive missing values or constant values,
were dropped from the dataset to streamline further processing.
Transforming the Dataset
Data transformation is required to convert raw data into a format suitable for analysis. This
includes normalization, encoding, and reshaping data. Normalization ensures that numerical
values are on the same scale, preventing bias in calculations. Categorical encoding is used to
convert text-based categories into numerical representations. Additionally, reshaping data allows
for better visualization and usability in further computations. Pivoting and melting techniques in
Pandas were utilized to restructure the dataset efficiently.
Outcome
After applying data wrangling techniques, the dataset became more structured and
meaningful. Missing values were successfully handled, transformations were applied to ensure
consistency, and the data was reshaped to facilitate efficient analysis. These improvements will
enhance data-driven decision-making and reporting accuracy.
DATA CLEANING AND PREPARATION
1. Introduction
Data cleaning is a vital step in ensuring that datasets are free from inconsistencies,
errors, and outliers. A clean dataset leads to more accurate analyses and prevents
misinterpretation of data. The Blinkit dataset required several cleaning procedures, including
removing duplicate entries, correcting data types, and handling outliers. By applying these
steps, we ensure that the dataset is reliable and suitable for use in predictive modeling and
statistical analysis.
Duplicate entries often occur due to multiple submissions, errors in data entry, or
system glitches. Keeping duplicate records can lead to biased results, making analysis
unreliable. To address this, we identified and removed duplicate entries using the
drop_duplicates() function in Pandas, ensuring that each record in the dataset is unique.
df = df.drop_duplicates()
Incorrect data types can lead to computational errors and misinterpretations. In this
dataset, some numerical values were mistakenly stored as strings, affecting mathematical
operations. To fix this, the 'Amount Paid' column was converted to a numerical format using
Pandas’ to_numeric() function. This correction ensures accurate calculations and analysis.
Q1 = df['Amount Paid'].quantile(0.25)
Q3 = df['Amount Paid'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Amount Paid'] >= Q1 - 1.5 * IQR) & (df['Amount Paid'] <= Q3 + 1.5 *
IQR)]
3. Outcome
After implementing data cleaning techniques, the dataset is now free from duplicate
records, incorrect data types, and extreme outliers. These improvements ensure that
data-driven insights are accurate and meaningful
Model Building
1. Introduction
Data cleaning and model building are critical steps in preparing and analyzing datasets to
derive meaningful insights. The Blinkit dataset, containing food delivery order details,
required extensive cleaning to address inconsistencies, missing values, and outliers. A
clean dataset ensures reliable predictive modeling and accurate statistical analysis. The
data cleaning process involved removing duplicate entries, correcting data types,
handling missing values, and managing outliers. Following cleaning, a predictive model
was developed to classify order statuses, enabling operational insights for Blinkit. These
steps ensure the dataset is robust and suitable for data-driven decision-making.
Before building the model, we prepared the dataset to make it suitable for predictions. We
used the Amount Paid column as the main piece of information to predict Order Status, as
it shows how much customers spent on their orders. The Order Status was organized into
categories like Delivered, Pending, In Progress, Cancelled, and Unknown. We split the
data into two parts: one part to train the model and another to test how well it works. This
setup helps the model learn from the data and check its accuracy.
We built a model that learns from the Amount Paid to predict the Order Status. The model
looks for patterns, like whether higher payments are linked to delivered orders. It was
trained using the training part of the data, teaching it to recognize connections between
payment amounts and order outcomes. After training, we tested the model on the other part
of the data to see if it could correctly predict the status of new orders.
To understand how well the model works, we compared its predictions to the actual Order
Status in the test data. The model did a good job predicting common statuses like Delivered
and In Progress, but it was less accurate for statuses like Cancelled, which were less
common. This check helped us see the model’s strengths and areas where it could improve.
5. Outcome
The model successfully predicted Order Status for many orders, especially for Delivered
orders, showing that Amount Paid is useful for understanding order outcomes. The results
can help Blinkit identify why some orders are cancelled or delayed, improving their delivery
process. With a clean dataset and a working model, Blinkit can use these insights to make
better decisions and enhance customer satisfaction.
Result And Discussion
1. Introduction
Model building helps analyze data to make predictions and find useful patterns. For the
Blinkit dataset, which contains food delivery order details, we built a model to predict the
status of orders, such as Delivered, Pending, or Cancelled. This process helps Blinkit
understand what affects their orders and improve their service. The model used a cleaned
dataset and focused on the Amount Paid to predict Order Status, providing insights into
delivery patterns.
The model was tested to see how well it predicted Order Status. It correctly identified many
orders, especially those marked as Delivered, which were the most common in the dataset.
For example, orders with higher Amount Paid were often predicted as Delivered, showing a
clear pattern. However, the model was less accurate for Cancelled or Unknown statuses, as
these were less frequent. Overall, the model worked well for common order types but
struggled with rarer ones.
3. Analysis of Findings
The results show that Amount Paid is a helpful factor for predicting Order Status, especially
for Delivered orders. This suggests that customers who spend more are more likely to have
their orders completed successfully. The lower accuracy for Cancelled orders indicates that
other factors, like delivery issues or customer preferences, might also matter. The model’s
performance highlights that common order statuses are easier to predict than less common
ones, which could be due to having more data for Delivered orders.
These findings can help Blinkit improve their delivery process by focusing on factors that
lead to successful deliveries, like ensuring high-value orders are prioritized. To make the
model better, we could include more information, such as Restaurant ID or Delivery Guy
Name, to improve predictions for Cancelled orders. Collecting more data on less common
statuses could also help. This model provides a starting point for Blinkit to make smarter
decisions and enhance customer satisfaction.
Code
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt
# Impute missing Customer Name and Delivery Guy Name with 'Unknown'
df['Customer Name'] = df['Customer Name'].fillna('Unknown')
df['Delivery Guy Name'] = df['Delivery Guy Name'].fillna('Unknown')
# Impute missing Order Status with 'Unknown'
df['Order Status'] = df['Order Status'].fillna('Unknown')
# Restaurant Popularity
df['Restaurant Popularity'] =
df.groupby('RestaurantID')['OrderID'].transform('count')
# Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d',
cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
We successfully cleaned and transformed the Blinkit customer order dataset by addressing
key issues such as missing values, outliers, incorrect data types, and inconsistent column
formats. Through reshaping and grouping the data, we made it more suitable for analysis and
modeling. As a result, the dataset is now structured, reliable, and ready for advanced analytics
and machine learning. These preprocessing steps will significantly contribute to making
informed business decisions and building accurate predictive models. The main takeaway is
that good data preprocessing forms the foundation of any successful data science project, as
clean data leads to better insights, more accurate models, and smarter strategies.