0% found this document useful (0 votes)
32 views22 pages

ds cs

This case study report analyzes the Blinkit customer order dataset, focusing on data wrangling and cleaning to enhance data quality for predictive modeling. The study employs various preprocessing methods, including handling missing values, removing duplicates, and detecting outliers, to prepare the dataset for analysis. Ultimately, a predictive model is developed to classify order statuses, providing insights that can improve operational efficiency and customer satisfaction for Blinkit.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views22 pages

ds cs

This case study report analyzes the Blinkit customer order dataset, focusing on data wrangling and cleaning to enhance data quality for predictive modeling. The study employs various preprocessing methods, including handling missing values, removing duplicates, and detecting outliers, to prepare the dataset for analysis. Ultimately, a predictive model is developed to classify order statuses, providing insights that can improve operational efficiency and customer satisfaction for Blinkit.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

BLINKIT DATASET ANALYSIS

A CASE STUDY REPORT


Submitted by

A V SeshaSai (RA2211003010469)
Milan P Joseph (RA2211003010478)
B Jaswanth Patnaik (RA2211003010501)
Bhargava Satya Himank Chinni (RA2211003010510)

For the course


Data Science - 21CSS303T
In partial fulfillment of the requirements for the degree of
BACHELOR OF TECHNOLOGY

DEPARTMENT OF COMPUTING TECHNOLOGIES

SCHOOL OF COMPUTING

FACULTYOFENGINEERINGANDTECHNOLOGY

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

KATTANKULATHUR - 603 203.


SRMINSTITUTEOFSCIENCEANDTECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that A Case Study Report titled Blinkit Dataset Analysis is the bonafide work of
A V SeshaSai (RA221003010469), Milan P Joseph (RA2211003010478), B Jaswanth Patnaik
(RA2211003010501), and Bhargava Satya Himank Chinni (RA2211003010510), who carried out
the case study under my supervision. Certified further, that to the best of my knowledge the work
reported herein does not form any other work.

Faculty Signature

Dr. R. Anita
Assistant Professor
Department of Computing Technologies

Date:

2
ABSTRACT

Accuracy and consistency of data are essential for both customer satisfaction and operational
efficiency in the quickly changing field of online grocery delivery. A thorough data wrangling
and cleaning procedure applied to a Blinkit customer order dataset is presented in this study. The
goal was to convert unstructured, unreliable data into a format that could be analyzed. Numerous
preprocessing methods were used, such as dealing with missing values, getting rid of duplicates,
fixing data types, and getting rid of outliers. In order to find revenue trends by restaurant,
exploratory data transformation entailed grouping data and reshaping datasets using pivoting and
melting. Strategic imputation was used to address important issues such as null values in crucial
columns (such as Order Status and Amount Paid). Additionally, the integrity of the dataset was
guaranteed by standardizing column names and validating monetary fields. The Interquartile
Range (IQR) method of outlier detection improved the quality of insights obtained by filtering
out anomalous data. Within the Blinkit ecosystem, the cleaned and organized dataset is now
appropriate for real-time decision-making, predictive modeling, and advanced analytics. In order
to enable meaningful business intelligence and improve data-driven strategies, this study
highlights the significance of strong data preprocessing.
INTRODUCTION

Ensuring high-quality data is essential for both operational success and customer satisfaction
in the fast-paced world of online grocery delivery. A fundamental step in getting datasets ready
for insightful analysis is data wrangling, which is the process of cleaning, transforming, and
organizing raw data. Inconsistencies and missing data can have a detrimental effect on business
intelligence and decision-making because customer orders, delivery performance, and
restaurant interactions occur at scale.

Customer information, payment details, delivery status, and restaurant performance metrics are
just a few of the many order-related data that Blinkit, a well-known quick-commerce platform,
manages. The raw Blinkit data, like most real-world datasets, can be untidy, though, with
outliers, inconsistent formats, duplicate entries, and missing values that must be fixed before
any trustworthy analysis can be conducted.

This project performs systematic data wrangling and cleaning using a real-world Blinkit order
dataset. Its main objectives are to deal with missing values, standardize fields, fix data types,
and prepare the dataset for analysis. These procedures are necessary to guarantee accuracy in
downstream analytics, including customer behaviour modelling, delivery optimization, and
revenue tracking.

This study's primary objectives are:

• To find and address inconsistent or missing data entries.

• To improve the dataset's usability by standardizing and restructuring it.

• To identify and eliminate outliers and duplicate records that might distort analysis.

• To get a clean, trustworthy dataset ready for upcoming data science projects like trend
analysis, visualization, and predictive modelling.
EXPLORATORY DATAANALYSIS (EDA)

Introduction to EDA

One crucial stage in the data analysis process is exploratory data analysis, or EDA. To gain a
deeper understanding of the dataset's structure and underlying relationships, it entails analyzing
it using summary statistics, pattern recognition, and visualizations. EDA assists in identifying
hidden trends, missing values, anomalies, and data imbalances that might otherwise go
overlooked. It also has a significant impact on how data preprocessing procedures are shaped
and how model-building tactics are informed.

To better understand customer behavior, order trends, and delivery performance, EDA was
applied to the Blinkit order dataset for this project. Numerous attributes, including order ID,
restaurant ID, customer information, payment amount, delivery staff, and order status, are
included in the dataset. Our goal is to derive valuable insights from the analysis of these features
in order to improve delivery logistics, boost customer satisfaction, and inform operational
strategies.

The EDA focused on the following key aspects:

1. Data Overview
Exploratory Data Analysis (EDA) began with a general overview of the dataset. The first few
rows were displayed using df.head() to get a sense of the structure. The info() method was used
to inspect data types and check for null values, while describe() provided summary statistics
for numerical features.
• Understanding Feature Distribution: We analyzed the distribution of key variables
such as customer gender, order frequency, age, purchase amount, and delivery time. For
numerical features like cart value, delivery charges, and item quantity, we used
histograms and density plots to understand their spread, central tendency, and
purchasing behavior patterns.
 Identifying Relationships with User Retention: One of the main goals was to identify which
customer attributes were most strongly associated with repeat orders or drop-offs. This was
achieved through grouped bar charts, crosstabulations, and statistical summaries. For example,
we compared retention rates across different delivery time slots, payment methods, and order
frequency to understand their influence on continued usage.
DATA WRANGLING METHODS

An essential first step in evaluating real-world datasets, such as the Blinkit Customer Dataset,
is data wrangling, also known as data preprocessing. It entails transforming raw data into an
analysis-ready format by cleaning, organizing, and enriching it. This procedure guarantees the
dataset is accurate, dependable, and prepared for significant insights because raw data
frequently contains inconsistencies like missing values, duplicate entries, and formatting
problems.

Data Wrangling Methods Applied Handling Missing Values

Missing values can arise due to various reasons, such as incomplete data collection or system
errors. Identifying and properly handling missing data is crucial for maintaining data integrity. In
this project, missing values were detected using Pandas' isnull() function. Based on the type of
missing values, different imputation techniques were applied. Numerical fields such as 'Amount
Paid' were filled with 0.0, whereas categorical fields like 'Order Status' and 'Customer Name' were
assigned default placeholders such as 'Unknown' and 'Anonymous'. This approach ensures that the
dataset remains complete and reliable for further analysis.
Handling Duplicate Values

Duplicate rows were checked using data.duplicated().sum(). The result showed that the dataset
contained no duplicate records, ensuring data integrity for the analysis.

Inspecting the Dataset

The initial few rows of the dataset were inspected using df.head(). From this, we observed key
order-related features such as OrderID, RestaurantID, Amount Paid, Order
Status, Customer Name, and Delivery Guy Name. Columns like Delivery Guy
Name and Customer Name contain missing values and placeholders such as "Unassigned" and
"Anonymous," which were used to handle missing data. The OrderID column is critical for
uniquely identifying orders and was retained during analysis. Columns deemed irrelevant for
predictive modeling or analysis, such as those with excessive missing values or constant values,
were dropped from the dataset to streamline further processing.
Transforming the Dataset

Data transformation is required to convert raw data into a format suitable for analysis. This
includes normalization, encoding, and reshaping data. Normalization ensures that numerical
values are on the same scale, preventing bias in calculations. Categorical encoding is used to
convert text-based categories into numerical representations. Additionally, reshaping data allows
for better visualization and usability in further computations. Pivoting and melting techniques in
Pandas were utilized to restructure the dataset efficiently.

# Pivoting and melting data


df_pivoted = df.pivot(index='OrderID', columns='Order Status', values='Amount Paid')

Structuring the Dataset

A well-structured dataset enhances analytical efficiency. Grouping data based on key


attributes allows for better trend identification. In this dataset, orders were grouped by
'RestaurantID' to analyze total revenue per restaurant. Additionally, column names were
standardized by stripping unnecessary spaces, ensuring consistency in data handling.

# Grouping data for better analysis


df_grouped = df.groupby('RestaurantID').agg({'Amount Paid': 'sum'})

# Standardizing column names


df.columns = df.columns.str.strip()

Outcome

After applying data wrangling techniques, the dataset became more structured and
meaningful. Missing values were successfully handled, transformations were applied to ensure
consistency, and the data was reshaped to facilitate efficient analysis. These improvements will
enhance data-driven decision-making and reporting accuracy.
DATA CLEANING AND PREPARATION

1. Introduction

Data cleaning is a vital step in ensuring that datasets are free from inconsistencies,
errors, and outliers. A clean dataset leads to more accurate analyses and prevents
misinterpretation of data. The Blinkit dataset required several cleaning procedures, including
removing duplicate entries, correcting data types, and handling outliers. By applying these
steps, we ensure that the dataset is reliable and suitable for use in predictive modeling and
statistical analysis.

2. Data Cleaning Process

2.1 Removing Duplicates

Duplicate entries often occur due to multiple submissions, errors in data entry, or
system glitches. Keeping duplicate records can lead to biased results, making analysis
unreliable. To address this, we identified and removed duplicate entries using the
drop_duplicates() function in Pandas, ensuring that each record in the dataset is unique.

# Removing duplicate rows

df = df.drop_duplicates()

2.2 Correcting Data Types

Incorrect data types can lead to computational errors and misinterpretations. In this
dataset, some numerical values were mistakenly stored as strings, affecting mathematical
operations. To fix this, the 'Amount Paid' column was converted to a numerical format using
Pandas’ to_numeric() function. This correction ensures accurate calculations and analysis.

# Convert numerical columns

df['Amount Paid'] = pd.to_numeric(df['Amount Paid'], errors='coerce').fillna(0)


2.3 Handling Outliers & Validation

Outliers can significantly skew statistical analysis and affect decision-making. To


identify and remove outliers, we used the Interquartile Range (IQR) method. This approach
helps in filtering extreme values that might result from errors or anomalies in data collection.
Additionally, logical validation checks were performed to ensure that monetary values did not
contain negative amounts.

# Detecting and removing outliers using IQR

Q1 = df['Amount Paid'].quantile(0.25)

Q3 = df['Amount Paid'].quantile(0.75)

IQR = Q3 - Q1

# Removing extreme outliers

df = df[(df['Amount Paid'] >= Q1 - 1.5 * IQR) & (df['Amount Paid'] <= Q3 + 1.5 *
IQR)]

# Ensure no negative payments

df = df[df['Amount Paid'] >= 0]

3. Outcome

After implementing data cleaning techniques, the dataset is now free from duplicate
records, incorrect data types, and extreme outliers. These improvements ensure that
data-driven insights are accurate and meaningful
Model Building

1. Introduction

Data cleaning and model building are critical steps in preparing and analyzing datasets to
derive meaningful insights. The Blinkit dataset, containing food delivery order details,
required extensive cleaning to address inconsistencies, missing values, and outliers. A
clean dataset ensures reliable predictive modeling and accurate statistical analysis. The
data cleaning process involved removing duplicate entries, correcting data types,
handling missing values, and managing outliers. Following cleaning, a predictive model
was developed to classify order statuses, enabling operational insights for Blinkit. These
steps ensure the dataset is robust and suitable for data-driven decision-making.

2. Preparing the Data

Before building the model, we prepared the dataset to make it suitable for predictions. We
used the Amount Paid column as the main piece of information to predict Order Status, as
it shows how much customers spent on their orders. The Order Status was organized into
categories like Delivered, Pending, In Progress, Cancelled, and Unknown. We split the
data into two parts: one part to train the model and another to test how well it works. This
setup helps the model learn from the data and check its accuracy.

3. Creating the Prediction Model

We built a model that learns from the Amount Paid to predict the Order Status. The model
looks for patterns, like whether higher payments are linked to delivered orders. It was
trained using the training part of the data, teaching it to recognize connections between
payment amounts and order outcomes. After training, we tested the model on the other part
of the data to see if it could correctly predict the status of new orders.

4. Checking Model Performance

To understand how well the model works, we compared its predictions to the actual Order
Status in the test data. The model did a good job predicting common statuses like Delivered
and In Progress, but it was less accurate for statuses like Cancelled, which were less
common. This check helped us see the model’s strengths and areas where it could improve.
5. Outcome

The model successfully predicted Order Status for many orders, especially for Delivered
orders, showing that Amount Paid is useful for understanding order outcomes. The results
can help Blinkit identify why some orders are cancelled or delayed, improving their delivery
process. With a clean dataset and a working model, Blinkit can use these insights to make
better decisions and enhance customer satisfaction.
Result And Discussion
1. Introduction

Model building helps analyze data to make predictions and find useful patterns. For the
Blinkit dataset, which contains food delivery order details, we built a model to predict the
status of orders, such as Delivered, Pending, or Cancelled. This process helps Blinkit
understand what affects their orders and improve their service. The model used a cleaned
dataset and focused on the Amount Paid to predict Order Status, providing insights into
delivery patterns.

2. Model Performance Outcomes

The model was tested to see how well it predicted Order Status. It correctly identified many
orders, especially those marked as Delivered, which were the most common in the dataset.
For example, orders with higher Amount Paid were often predicted as Delivered, showing a
clear pattern. However, the model was less accurate for Cancelled or Unknown statuses, as
these were less frequent. Overall, the model worked well for common order types but
struggled with rarer ones.

3. Analysis of Findings

The results show that Amount Paid is a helpful factor for predicting Order Status, especially
for Delivered orders. This suggests that customers who spend more are more likely to have
their orders completed successfully. The lower accuracy for Cancelled orders indicates that
other factors, like delivery issues or customer preferences, might also matter. The model’s
performance highlights that common order statuses are easier to predict than less common
ones, which could be due to having more data for Delivered orders.

4. Implications and Future Steps

These findings can help Blinkit improve their delivery process by focusing on factors that
lead to successful deliveries, like ensuring high-value orders are prioritized. To make the
model better, we could include more information, such as Restaurant ID or Delivery Guy
Name, to improve predictions for Cancelled orders. Collecting more data on less common
statuses could also help. This model provides a starting point for Blinkit to make smarter
decisions and enhance customer satisfaction.
Code
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Load the dataset


# If the file is uploaded directly to Colab
from google.colab import files
uploaded = files.upload() # Upload the .xlsx file when prompted

# Read the Excel file


df = pd.read_excel('Copy of Blinkit_Customer_Dataset_(2)(1).xlsx')

# Alternatively, if using Google Drive:


# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_excel('/content/drive/My Drive/path_to_file/Copy of
Blinkit_Customer_Dataset_(2)(1).xlsx')

# Display the first few rows and basic info


print("Initial Dataset Info:")
print(df.info())
print("\nFirst 5 Rows:")
print(df.head())

# Step 2: Data Wrangling and Cleaning

# 2.1 Handle Missing Values


print("\nMissing Values:")
print(df.isnull().sum())

# Impute missing OrderID with unique identifiers


df['OrderID'] = df['OrderID'].fillna(pd.Series('ORD' +
pd.Series(range(10000, 10000 + len(df))).astype(str)))

# Impute missing RestaurantID with 'REST_UNKNOWN'


df['RestaurantID'] = df['RestaurantID'].fillna('REST_UNKNOWN')

# Impute missing Amount Paid with median of non-missing values


df['Amount Paid'] = df['Amount Paid'].fillna(df['Amount Paid'].median())

# Impute missing Customer Name and Delivery Guy Name with 'Unknown'
df['Customer Name'] = df['Customer Name'].fillna('Unknown')
df['Delivery Guy Name'] = df['Delivery Guy Name'].fillna('Unknown')
# Impute missing Order Status with 'Unknown'
df['Order Status'] = df['Order Status'].fillna('Unknown')

# 2.2 Handle Duplicates


print("\nNumber of Duplicate Rows:", df.duplicated().sum())
df = df.drop_duplicates()

# Check for duplicate OrderIDs


duplicate_order_ids = df[df['OrderID'].duplicated(keep=False)]
if not duplicate_order_ids.empty:
print("\nDuplicate OrderIDs Found:", duplicate_order_ids)
# Keep the first occurrence of duplicate OrderIDs
df = df.drop_duplicates(subset=['OrderID'], keep='first')

# 2.3 Data Type Correction


df['OrderID'] = df['OrderID'].astype(str)
df['RestaurantID'] = df['RestaurantID'].astype(str)
df['Amount Paid'] = df['Amount Paid'].astype(float)
df['Order Status'] = df['Order Status'].astype('category')
df['Customer Name'] = df['Customer Name'].str.strip().str.title()
df['Delivery Guy Name'] = df['Delivery Guy Name'].str.strip().str.title()

# 2.4 Handle Inconsistent Data


# Standardize Order Status (remove leading/trailing spaces, ensure
consistent case)
df['Order Status'] = df['Order Status'].str.strip().str.title()

# 2.5 Feature Engineering


# Create Order Value Category
bins = [0, 200, 500, float('inf')]
labels = ['Low', 'Medium', 'High']
df['Order Value Category'] = pd.cut(df['Amount Paid'], bins=bins,
labels=labels, include_lowest=True)

# Customer Order Frequency


df['Customer Order Frequency'] = df.groupby('Customer
Name')['OrderID'].transform('count')

# Restaurant Popularity
df['Restaurant Popularity'] =
df.groupby('RestaurantID')['OrderID'].transform('count')

# Delivery Guy Workload


df['Delivery Guy Workload'] = df.groupby('Delivery Guy
Name')['OrderID'].transform('count')

# Binary Order Status (Completed vs. Not Completed)


df['Order Completed'] = df['Order Status'].apply(lambda x: 1 if x ==
'Delivered' else 0)

# Display cleaned dataset info


print("\nCleaned Dataset Info:")
print(df.info())
print("\nFirst 5 Rows of Cleaned Data:")
print(df.head())

# Step 3: Exploratory Data Analysis (EDA)


# Distribution of Order Status
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Order Status')
plt.title('Distribution of Order Status')
plt.xticks(rotation=45)
plt.show()

# Distribution of Amount Paid


plt.figure(figsize=(8, 6))
sns.histplot(df['Amount Paid'], bins=30, kde=True)
plt.title('Distribution of Amount Paid')
plt.show()

# Step 4: Model Building (Predicting Order Status)


# 4.1 Prepare Features and Target
# Select features (exclude OrderID, Customer Name, Delivery Guy Name for
simplicity)
features = ['Amount Paid', 'Restaurant Popularity', 'Customer Order
Frequency', 'Delivery Guy Workload']
X = df[features]
y = df['Order Status']

# Encode categorical target (Order Status)


le = LabelEncoder()
y_encoded = le.fit_transform(y)

# 4.2 Handle Imbalanced Data with SMOTE


smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y_encoded)

# 4.3 Train-Test Split


X_train, X_test, y_train, y_test = train_test_split(X_resampled,
y_resampled, test_size=0.2, random_state=42)

# 4.4 Train Random Forest Classifier


rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# 4.5 Evaluate Model


y_pred = rf_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

# Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d',
cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# 4.6 Feature Importance


feature_importance = pd.DataFrame({'Feature': features, 'Importance':
rf_model.feature_importances_})
print("\nFeature Importance:")
print(feature_importance.sort_values(by='Importance', ascending=False))

# Step 5: Save the Cleaned Dataset (Optional)


df.to_csv('cleaned_blinkit_dataset.csv', index=False)
files.download('cleaned_blinkit_dataset.csv') # Download the cleaned
dataset
CONCLUSION

We successfully cleaned and transformed the Blinkit customer order dataset by addressing
key issues such as missing values, outliers, incorrect data types, and inconsistent column
formats. Through reshaping and grouping the data, we made it more suitable for analysis and
modeling. As a result, the dataset is now structured, reliable, and ready for advanced analytics
and machine learning. These preprocessing steps will significantly contribute to making
informed business decisions and building accurate predictive models. The main takeaway is
that good data preprocessing forms the foundation of any successful data science project, as
clean data leads to better insights, more accurate models, and smarter strategies.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy