0% found this document useful (0 votes)

32 views22 pages

ds cs

This case study report analyzes the Blinkit customer order dataset, focusing on data wrangling and cleaning to enhance data quality for predictive modeling. The study employs various preprocessing methods, including handling missing values, removing duplicates, and detecting outliers, to prepare the dataset for analysis. Ultimately, a predictive model is developed to classify order statuses, providing insights that can improve operational efficiency and customer satisfaction for Blinkit.

Uploaded by

Akella Venkata Sesha Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views22 pages

ds cs

Uploaded by

Akella Venkata Sesha Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

BLINKIT DATASET ANALYSIS

A CASE STUDY REPORT

Submitted by

A V SeshaSai (RA2211003010469)
Milan P Joseph (RA2211003010478)
B Jaswanth Patnaik (RA2211003010501)
Bhargava Satya Himank Chinni (RA2211003010510)

For the course

Data Science - 21CSS303T
In partial fulfillment of the requirements for the degree of
BACHELOR OF TECHNOLOGY

DEPARTMENT OF COMPUTING TECHNOLOGIES

SCHOOL OF COMPUTING

FACULTYOFENGINEERINGANDTECHNOLOGY

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

KATTANKULATHUR - 603 203.

SRMINSTITUTEOFSCIENCEANDTECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that A Case Study Report titled Blinkit Dataset Analysis is the bonafide work of
A V SeshaSai (RA221003010469), Milan P Joseph (RA2211003010478), B Jaswanth Patnaik
(RA2211003010501), and Bhargava Satya Himank Chinni (RA2211003010510), who carried out
the case study under my supervision. Certified further, that to the best of my knowledge the work
reported herein does not form any other work.

Faculty Signature

Dr. R. Anita
Assistant Professor
Department of Computing Technologies

Date:

2
ABSTRACT

Accuracy and consistency of data are essential for both customer satisfaction and operational
efficiency in the quickly changing field of online grocery delivery. A thorough data wrangling
and cleaning procedure applied to a Blinkit customer order dataset is presented in this study. The
goal was to convert unstructured, unreliable data into a format that could be analyzed. Numerous
preprocessing methods were used, such as dealing with missing values, getting rid of duplicates,
fixing data types, and getting rid of outliers. In order to find revenue trends by restaurant,
exploratory data transformation entailed grouping data and reshaping datasets using pivoting and
melting. Strategic imputation was used to address important issues such as null values in crucial
columns (such as Order Status and Amount Paid). Additionally, the integrity of the dataset was
guaranteed by standardizing column names and validating monetary fields. The Interquartile
Range (IQR) method of outlier detection improved the quality of insights obtained by filtering
out anomalous data. Within the Blinkit ecosystem, the cleaned and organized dataset is now
appropriate for real-time decision-making, predictive modeling, and advanced analytics. In order
to enable meaningful business intelligence and improve data-driven strategies, this study
highlights the significance of strong data preprocessing.
INTRODUCTION

Ensuring high-quality data is essential for both operational success and customer satisfaction
in the fast-paced world of online grocery delivery. A fundamental step in getting datasets ready
for insightful analysis is data wrangling, which is the process of cleaning, transforming, and
organizing raw data. Inconsistencies and missing data can have a detrimental effect on business
intelligence and decision-making because customer orders, delivery performance, and
restaurant interactions occur at scale.

Customer information, payment details, delivery status, and restaurant performance metrics are
just a few of the many order-related data that Blinkit, a well-known quick-commerce platform,
manages. The raw Blinkit data, like most real-world datasets, can be untidy, though, with
outliers, inconsistent formats, duplicate entries, and missing values that must be fixed before
any trustworthy analysis can be conducted.

This project performs systematic data wrangling and cleaning using a real-world Blinkit order
dataset. Its main objectives are to deal with missing values, standardize fields, fix data types,
and prepare the dataset for analysis. These procedures are necessary to guarantee accuracy in
downstream analytics, including customer behaviour modelling, delivery optimization, and
revenue tracking.

This study's primary objectives are:

• To find and address inconsistent or missing data entries.

• To improve the dataset's usability by standardizing and restructuring it.

• To identify and eliminate outliers and duplicate records that might distort analysis.

• To get a clean, trustworthy dataset ready for upcoming data science projects like trend
analysis, visualization, and predictive modelling.
EXPLORATORY DATAANALYSIS (EDA)

Introduction to EDA

One crucial stage in the data analysis process is exploratory data analysis, or EDA. To gain a
deeper understanding of the dataset's structure and underlying relationships, it entails analyzing
it using summary statistics, pattern recognition, and visualizations. EDA assists in identifying
hidden trends, missing values, anomalies, and data imbalances that might otherwise go
overlooked. It also has a significant impact on how data preprocessing procedures are shaped
and how model-building tactics are informed.

To better understand customer behavior, order trends, and delivery performance, EDA was
applied to the Blinkit order dataset for this project. Numerous attributes, including order ID,
restaurant ID, customer information, payment amount, delivery staff, and order status, are
included in the dataset. Our goal is to derive valuable insights from the analysis of these features
in order to improve delivery logistics, boost customer satisfaction, and inform operational
strategies.

The EDA focused on the following key aspects:

1. Data Overview
Exploratory Data Analysis (EDA) began with a general overview of the dataset. The first few
rows were displayed using df.head() to get a sense of the structure. The info() method was used
to inspect data types and check for null values, while describe() provided summary statistics
for numerical features.
• Understanding Feature Distribution: We analyzed the distribution of key variables
such as customer gender, order frequency, age, purchase amount, and delivery time. For
numerical features like cart value, delivery charges, and item quantity, we used
histograms and density plots to understand their spread, central tendency, and
purchasing behavior patterns.
 Identifying Relationships with User Retention: One of the main goals was to identify which
customer attributes were most strongly associated with repeat orders or drop-offs. This was
achieved through grouped bar charts, crosstabulations, and statistical summaries. For example,
we compared retention rates across different delivery time slots, payment methods, and order
frequency to understand their influence on continued usage.
DATA WRANGLING METHODS

An essential first step in evaluating real-world datasets, such as the Blinkit Customer Dataset,
is data wrangling, also known as data preprocessing. It entails transforming raw data into an
analysis-ready format by cleaning, organizing, and enriching it. This procedure guarantees the
dataset is accurate, dependable, and prepared for significant insights because raw data
frequently contains inconsistencies like missing values, duplicate entries, and formatting
problems.

Data Wrangling Methods Applied Handling Missing Values

Missing values can arise due to various reasons, such as incomplete data collection or system
errors. Identifying and properly handling missing data is crucial for maintaining data integrity. In
this project, missing values were detected using Pandas' isnull() function. Based on the type of
missing values, different imputation techniques were applied. Numerical fields such as 'Amount
Paid' were filled with 0.0, whereas categorical fields like 'Order Status' and 'Customer Name' were
assigned default placeholders such as 'Unknown' and 'Anonymous'. This approach ensures that the
dataset remains complete and reliable for further analysis.
Handling Duplicate Values

Duplicate rows were checked using data.duplicated().sum(). The result showed that the dataset
contained no duplicate records, ensuring data integrity for the analysis.

Inspecting the Dataset

The initial few rows of the dataset were inspected using df.head(). From this, we observed key
order-related features such as OrderID, RestaurantID, Amount Paid, Order
Status, Customer Name, and Delivery Guy Name. Columns like Delivery Guy
Name and Customer Name contain missing values and placeholders such as "Unassigned" and
"Anonymous," which were used to handle missing data. The OrderID column is critical for
uniquely identifying orders and was retained during analysis. Columns deemed irrelevant for
predictive modeling or analysis, such as those with excessive missing values or constant values,
were dropped from the dataset to streamline further processing.
Transforming the Dataset

Data transformation is required to convert raw data into a format suitable for analysis. This
includes normalization, encoding, and reshaping data. Normalization ensures that numerical
values are on the same scale, preventing bias in calculations. Categorical encoding is used to
convert text-based categories into numerical representations. Additionally, reshaping data allows
for better visualization and usability in further computations. Pivoting and melting techniques in
Pandas were utilized to restructure the dataset efficiently.

# Pivoting and melting data

df_pivoted = df.pivot(index='OrderID', columns='Order Status', values='Amount Paid')

Structuring the Dataset

A well-structured dataset enhances analytical efficiency. Grouping data based on key

attributes allows for better trend identification. In this dataset, orders were grouped by
'RestaurantID' to analyze total revenue per restaurant. Additionally, column names were
standardized by stripping unnecessary spaces, ensuring consistency in data handling.

# Grouping data for better analysis

df_grouped = df.groupby('RestaurantID').agg({'Amount Paid': 'sum'})

# Standardizing column names

df.columns = df.columns.str.strip()

Outcome

After applying data wrangling techniques, the dataset became more structured and
meaningful. Missing values were successfully handled, transformations were applied to ensure
consistency, and the data was reshaped to facilitate efficient analysis. These improvements will
enhance data-driven decision-making and reporting accuracy.
DATA CLEANING AND PREPARATION

1. Introduction

Data cleaning is a vital step in ensuring that datasets are free from inconsistencies,
errors, and outliers. A clean dataset leads to more accurate analyses and prevents
misinterpretation of data. The Blinkit dataset required several cleaning procedures, including
removing duplicate entries, correcting data types, and handling outliers. By applying these
steps, we ensure that the dataset is reliable and suitable for use in predictive modeling and
statistical analysis.

2. Data Cleaning Process

2.1 Removing Duplicates

Duplicate entries often occur due to multiple submissions, errors in data entry, or
system glitches. Keeping duplicate records can lead to biased results, making analysis
unreliable. To address this, we identified and removed duplicate entries using the
drop_duplicates() function in Pandas, ensuring that each record in the dataset is unique.

# Removing duplicate rows

df = df.drop_duplicates()

2.2 Correcting Data Types

Incorrect data types can lead to computational errors and misinterpretations. In this
dataset, some numerical values were mistakenly stored as strings, affecting mathematical
operations. To fix this, the 'Amount Paid' column was converted to a numerical format using
Pandas’ to_numeric() function. This correction ensures accurate calculations and analysis.

# Convert numerical columns

df['Amount Paid'] = pd.to_numeric(df['Amount Paid'], errors='coerce').fillna(0)

2.3 Handling Outliers & Validation

Outliers can significantly skew statistical analysis and affect decision-making. To

identify and remove outliers, we used the Interquartile Range (IQR) method. This approach
helps in filtering extreme values that might result from errors or anomalies in data collection.
Additionally, logical validation checks were performed to ensure that monetary values did not
contain negative amounts.

# Detecting and removing outliers using IQR

Q1 = df['Amount Paid'].quantile(0.25)

Q3 = df['Amount Paid'].quantile(0.75)

IQR = Q3 - Q1

# Removing extreme outliers

df = df[(df['Amount Paid'] >= Q1 - 1.5 * IQR) & (df['Amount Paid'] <= Q3 + 1.5 *
IQR)]

# Ensure no negative payments

df = df[df['Amount Paid'] >= 0]

3. Outcome

After implementing data cleaning techniques, the dataset is now free from duplicate
records, incorrect data types, and extreme outliers. These improvements ensure that
data-driven insights are accurate and meaningful
Model Building

1. Introduction

Data cleaning and model building are critical steps in preparing and analyzing datasets to
derive meaningful insights. The Blinkit dataset, containing food delivery order details,
required extensive cleaning to address inconsistencies, missing values, and outliers. A
clean dataset ensures reliable predictive modeling and accurate statistical analysis. The
data cleaning process involved removing duplicate entries, correcting data types,
handling missing values, and managing outliers. Following cleaning, a predictive model
was developed to classify order statuses, enabling operational insights for Blinkit. These
steps ensure the dataset is robust and suitable for data-driven decision-making.

2. Preparing the Data

Before building the model, we prepared the dataset to make it suitable for predictions. We
used the Amount Paid column as the main piece of information to predict Order Status, as
it shows how much customers spent on their orders. The Order Status was organized into
categories like Delivered, Pending, In Progress, Cancelled, and Unknown. We split the
data into two parts: one part to train the model and another to test how well it works. This
setup helps the model learn from the data and check its accuracy.

3. Creating the Prediction Model

We built a model that learns from the Amount Paid to predict the Order Status. The model
looks for patterns, like whether higher payments are linked to delivered orders. It was
trained using the training part of the data, teaching it to recognize connections between
payment amounts and order outcomes. After training, we tested the model on the other part
of the data to see if it could correctly predict the status of new orders.

4. Checking Model Performance

To understand how well the model works, we compared its predictions to the actual Order
Status in the test data. The model did a good job predicting common statuses like Delivered
and In Progress, but it was less accurate for statuses like Cancelled, which were less
common. This check helped us see the model’s strengths and areas where it could improve.
5. Outcome

The model successfully predicted Order Status for many orders, especially for Delivered
orders, showing that Amount Paid is useful for understanding order outcomes. The results
can help Blinkit identify why some orders are cancelled or delayed, improving their delivery
process. With a clean dataset and a working model, Blinkit can use these insights to make
better decisions and enhance customer satisfaction.
Result And Discussion
1. Introduction

Model building helps analyze data to make predictions and find useful patterns. For the
Blinkit dataset, which contains food delivery order details, we built a model to predict the
status of orders, such as Delivered, Pending, or Cancelled. This process helps Blinkit
understand what affects their orders and improve their service. The model used a cleaned
dataset and focused on the Amount Paid to predict Order Status, providing insights into
delivery patterns.

2. Model Performance Outcomes

The model was tested to see how well it predicted Order Status. It correctly identified many
orders, especially those marked as Delivered, which were the most common in the dataset.
For example, orders with higher Amount Paid were often predicted as Delivered, showing a
clear pattern. However, the model was less accurate for Cancelled or Unknown statuses, as
these were less frequent. Overall, the model worked well for common order types but
struggled with rarer ones.

3. Analysis of Findings

The results show that Amount Paid is a helpful factor for predicting Order Status, especially
for Delivered orders. This suggests that customers who spend more are more likely to have
their orders completed successfully. The lower accuracy for Cancelled orders indicates that
other factors, like delivery issues or customer preferences, might also matter. The model’s
performance highlights that common order statuses are easier to predict than less common
ones, which could be due to having more data for Delivered orders.

4. Implications and Future Steps

These findings can help Blinkit improve their delivery process by focusing on factors that
lead to successful deliveries, like ensuring high-value orders are prioritized. To make the
model better, we could include more information, such as Restaurant ID or Delivery Guy
Name, to improve predictions for Cancelled orders. Collecting more data on less common
statuses could also help. This model provides a starting point for Blinkit to make smarter
decisions and enhance customer satisfaction.
Code
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Load the dataset

# If the file is uploaded directly to Colab
from google.colab import files
uploaded = files.upload() # Upload the .xlsx file when prompted

# Read the Excel file

df = pd.read_excel('Copy of Blinkit_Customer_Dataset_(2)(1).xlsx')

# Alternatively, if using Google Drive:

# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_excel('/content/drive/My Drive/path_to_file/Copy of
Blinkit_Customer_Dataset_(2)(1).xlsx')

# Display the first few rows and basic info

print("Initial Dataset Info:")
print(df.info())
print("\nFirst 5 Rows:")
print(df.head())

# Step 2: Data Wrangling and Cleaning

# 2.1 Handle Missing Values

print("\nMissing Values:")
print(df.isnull().sum())

# Impute missing OrderID with unique identifiers

df['OrderID'] = df['OrderID'].fillna(pd.Series('ORD' +
pd.Series(range(10000, 10000 + len(df))).astype(str)))

# Impute missing RestaurantID with 'REST_UNKNOWN'

df['RestaurantID'] = df['RestaurantID'].fillna('REST_UNKNOWN')

# Impute missing Amount Paid with median of non-missing values

df['Amount Paid'] = df['Amount Paid'].fillna(df['Amount Paid'].median())

# Impute missing Customer Name and Delivery Guy Name with 'Unknown'
df['Customer Name'] = df['Customer Name'].fillna('Unknown')
df['Delivery Guy Name'] = df['Delivery Guy Name'].fillna('Unknown')
# Impute missing Order Status with 'Unknown'
df['Order Status'] = df['Order Status'].fillna('Unknown')

# 2.2 Handle Duplicates

print("\nNumber of Duplicate Rows:", df.duplicated().sum())
df = df.drop_duplicates()

# Check for duplicate OrderIDs

duplicate_order_ids = df[df['OrderID'].duplicated(keep=False)]
if not duplicate_order_ids.empty:
print("\nDuplicate OrderIDs Found:", duplicate_order_ids)
# Keep the first occurrence of duplicate OrderIDs
df = df.drop_duplicates(subset=['OrderID'], keep='first')

# 2.3 Data Type Correction

df['OrderID'] = df['OrderID'].astype(str)
df['RestaurantID'] = df['RestaurantID'].astype(str)
df['Amount Paid'] = df['Amount Paid'].astype(float)
df['Order Status'] = df['Order Status'].astype('category')
df['Customer Name'] = df['Customer Name'].str.strip().str.title()
df['Delivery Guy Name'] = df['Delivery Guy Name'].str.strip().str.title()

# 2.4 Handle Inconsistent Data

# Standardize Order Status (remove leading/trailing spaces, ensure
consistent case)
df['Order Status'] = df['Order Status'].str.strip().str.title()

# 2.5 Feature Engineering

# Create Order Value Category
bins = [0, 200, 500, float('inf')]
labels = ['Low', 'Medium', 'High']
df['Order Value Category'] = pd.cut(df['Amount Paid'], bins=bins,
labels=labels, include_lowest=True)

# Customer Order Frequency

df['Customer Order Frequency'] = df.groupby('Customer
Name')['OrderID'].transform('count')

# Restaurant Popularity
df['Restaurant Popularity'] =
df.groupby('RestaurantID')['OrderID'].transform('count')

# Delivery Guy Workload

df['Delivery Guy Workload'] = df.groupby('Delivery Guy
Name')['OrderID'].transform('count')

# Binary Order Status (Completed vs. Not Completed)

df['Order Completed'] = df['Order Status'].apply(lambda x: 1 if x ==
'Delivered' else 0)

# Display cleaned dataset info

print("\nCleaned Dataset Info:")
print(df.info())
print("\nFirst 5 Rows of Cleaned Data:")
print(df.head())

# Step 3: Exploratory Data Analysis (EDA)

# Distribution of Order Status
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Order Status')
plt.title('Distribution of Order Status')
plt.xticks(rotation=45)
plt.show()

# Distribution of Amount Paid

plt.figure(figsize=(8, 6))
sns.histplot(df['Amount Paid'], bins=30, kde=True)
plt.title('Distribution of Amount Paid')
plt.show()

# Step 4: Model Building (Predicting Order Status)

# 4.1 Prepare Features and Target
# Select features (exclude OrderID, Customer Name, Delivery Guy Name for
simplicity)
features = ['Amount Paid', 'Restaurant Popularity', 'Customer Order
Frequency', 'Delivery Guy Workload']
X = df[features]
y = df['Order Status']

# Encode categorical target (Order Status)

le = LabelEncoder()
y_encoded = le.fit_transform(y)

# 4.2 Handle Imbalanced Data with SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y_encoded)

# 4.3 Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X_resampled,
y_resampled, test_size=0.2, random_state=42)

# 4.4 Train Random Forest Classifier

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# 4.5 Evaluate Model

y_pred = rf_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

# Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d',
cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# 4.6 Feature Importance

feature_importance = pd.DataFrame({'Feature': features, 'Importance':
rf_model.feature_importances_})
print("\nFeature Importance:")
print(feature_importance.sort_values(by='Importance', ascending=False))

# Step 5: Save the Cleaned Dataset (Optional)

df.to_csv('cleaned_blinkit_dataset.csv', index=False)
files.download('cleaned_blinkit_dataset.csv') # Download the cleaned
dataset
CONCLUSION

We successfully cleaned and transformed the Blinkit customer order dataset by addressing
key issues such as missing values, outliers, incorrect data types, and inconsistent column
formats. Through reshaping and grouping the data, we made it more suitable for analysis and
modeling. As a result, the dataset is now structured, reliable, and ready for advanced analytics
and machine learning. These preprocessing steps will significantly contribute to making
informed business decisions and building accurate predictive models. The main takeaway is
that good data preprocessing forms the foundation of any successful data science project, as
clean data leads to better insights, more accurate models, and smarter strategies.

Analytics and Big Data for Accountants
From Everand
Analytics and Big Data for Accountants
Jim Lindell
No ratings yet
Premarital Sex, Premarital Cohabitation, and The Risk of Subsequent Marital Dissolution Among Women
100% (1)
Premarital Sex, Premarital Cohabitation, and The Risk of Subsequent Marital Dissolution Among Women
12 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
Get Hired as a Data Analyst FAST in 2024
From Everand
Get Hired as a Data Analyst FAST in 2024
Silas Meadowlark
No ratings yet
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Quality: Empowering Businesses with Analytics and AI
From Everand
Data Quality: Empowering Businesses with Analytics and AI
Prashanth Southekal
No ratings yet
Document (2)
No ratings yet
Document (2)
29 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Business Analytics for Effective Decision Making
From Everand
Business Analytics for Effective Decision Making
Sumi K.V.
No ratings yet
B Tech-AIML-question bank-2 Answer Key
No ratings yet
B Tech-AIML-question bank-2 Answer Key
9 pages
CIW Data Analyst Exam Prep: 500 Practice Questions for Certification Success
From Everand
CIW Data Analyst Exam Prep: 500 Practice Questions for Certification Success
Steve Brown
No ratings yet
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Delhivery Feature Engineering - Solution Approach
No ratings yet
Delhivery Feature Engineering - Solution Approach
7 pages
Team 56B
No ratings yet
Team 56B
17 pages
Capstone CLA1
No ratings yet
Capstone CLA1
16 pages
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet
Data Analysis and Data Science Task - 2
No ratings yet
Data Analysis and Data Science Task - 2
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Project Report
100% (1)
Project Report
16 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
Blinkit Dashboard Presentation
No ratings yet
Blinkit Dashboard Presentation
17 pages
Interview
No ratings yet
Interview
2 pages
Solution
No ratings yet
Solution
16 pages
Eda Document Longterm
No ratings yet
Eda Document Longterm
10 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Business Analytics: Leveraging Data for Insights and Competitive Advantage
From Everand
Business Analytics: Leveraging Data for Insights and Competitive Advantage
Ronald BLaha
No ratings yet
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Annual Report 1
No ratings yet
Annual Report 1
23 pages
Module 3
No ratings yet
Module 3
76 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
data wrangling
No ratings yet
data wrangling
6 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
From Everand
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
Sumitra Kumari
No ratings yet
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
NM
No ratings yet
NM
33 pages
Business Analytics and Big Data
From Everand
Business Analytics and Big Data
Sachin Naha
No ratings yet
Presentation 1
No ratings yet
Presentation 1
14 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
b3f21553-48d0-4047-b679-18ead90bcc52
No ratings yet
b3f21553-48d0-4047-b679-18ead90bcc52
24 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Question Data Engineering
No ratings yet
Question Data Engineering
32 pages
Consumption-Based Forecasting and Planning: Predicting Changing Demand Patterns in the New Digital Economy
From Everand
Consumption-Based Forecasting and Planning: Predicting Changing Demand Patterns in the New Digital Economy
Charles W. Chase
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
From Everand
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
Pradeep Pasupuleti
No ratings yet
blink
No ratings yet
blink
2 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Data Integration with Blendo: Definitive Reference for Developers and Engineers
From Everand
Data Integration with Blendo: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DM Unit2
No ratings yet
DM Unit2
9 pages
Ali Shafi BSBA 2-A 6522 Sales Market Data
No ratings yet
Ali Shafi BSBA 2-A 6522 Sales Market Data
40 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Journal Rough Set
No ratings yet
Journal Rough Set
42 pages
What Is A Correlation Matrix?
No ratings yet
What Is A Correlation Matrix?
4 pages
100 Days of ML
No ratings yet
100 Days of ML
383 pages
Constantinescu - Constantinescu The - Romanian IPIP-50-revisited For COPP2016 PDF
No ratings yet
Constantinescu - Constantinescu The - Romanian IPIP-50-revisited For COPP2016 PDF
11 pages
Chocolate Cake. Guilt or Celebration? Associations With Healthy Eating Attitudes, Perceived Behavioural Control, Intentions and Weight-Loss
No ratings yet
Chocolate Cake. Guilt or Celebration? Associations With Healthy Eating Attitudes, Perceived Behavioural Control, Intentions and Weight-Loss
7 pages
CS F415 Data Mining Data Preprocessing
No ratings yet
CS F415 Data Mining Data Preprocessing
103 pages
SPSS Guide and Normal Values
No ratings yet
SPSS Guide and Normal Values
83 pages
Credit Card Default Prediction PRESENTATION
No ratings yet
Credit Card Default Prediction PRESENTATION
12 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
PS5 Answer Key
No ratings yet
PS5 Answer Key
6 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
Research Paper On Ethical Consideration of Editing of Data in Research
No ratings yet
Research Paper On Ethical Consideration of Editing of Data in Research
25 pages
PPT. For A Study On Job Attractiveness and Aspiration of Ethiopian Unemployed Urban Youth
No ratings yet
PPT. For A Study On Job Attractiveness and Aspiration of Ethiopian Unemployed Urban Youth
31 pages
Spss 150 Family Spss Inc download
No ratings yet
Spss 150 Family Spss Inc download
51 pages
paper2
No ratings yet
paper2
9 pages
Stata Guide To Accompany Introductory Econometrics For Finance
No ratings yet
Stata Guide To Accompany Introductory Econometrics For Finance
175 pages
KMBN IT01 LM Consolidated
No ratings yet
KMBN IT01 LM Consolidated
123 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Health Analytics With Python A Comprehensive Guide For 2024 Van Der Post instant download
100% (1)
Health Analytics With Python A Comprehensive Guide For 2024 Van Der Post instant download
76 pages
Machine Learning Data Science Project Documentation
No ratings yet
Machine Learning Data Science Project Documentation
4 pages
22IZ023 Nikhil - Exercise 5_ Data Preprocessing
No ratings yet
22IZ023 Nikhil - Exercise 5_ Data Preprocessing
4 pages
Type of Vegetarian Diet, Body Weight, and Prevalence of Type 2 Diabetes
No ratings yet
Type of Vegetarian Diet, Body Weight, and Prevalence of Type 2 Diabetes
6 pages
Advanced Intelligent Systems for Sustainable Development AI2SD 2018 Vol 3 Advanced Intelligent Systems Applied to Environment Mostafa Ezziyyani download
100% (4)
Advanced Intelligent Systems for Sustainable Development AI2SD 2018 Vol 3 Advanced Intelligent Systems Applied to Environment Mostafa Ezziyyani download
64 pages
T-Test: T-TEST PAIRS Pengetahuan2 WITH Umur (PAIRED) /CRITERIA CI (.9500) /missing Analysis
No ratings yet
T-Test: T-TEST PAIRS Pengetahuan2 WITH Umur (PAIRED) /CRITERIA CI (.9500) /missing Analysis
3 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Research Methodology Final
100% (2)
Research Methodology Final
70 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ds cs

Uploaded by

ds cs

Uploaded by

BLINKIT DATASET ANALYSIS

A CASE STUDY REPORT

For the course

DEPARTMENT OF COMPUTING TECHNOLOGIES

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

KATTANKULATHUR - 603 203.

This study's primary objectives are:

• To find and address inconsistent or missing data entries.

• To improve the dataset's usability by standardizing and restructuring it.

The EDA focused on the following key aspects:

Data Wrangling Methods Applied Handling Missing Values

Inspecting the Dataset

# Pivoting and melting data

Structuring the Dataset

A well-structured dataset enhances analytical efficiency. Grouping data based on key

# Grouping data for better analysis

# Standardizing column names

2. Data Cleaning Process

2.1 Removing Duplicates

# Removing duplicate rows

2.2 Correcting Data Types

# Convert numerical columns

df['Amount Paid'] = pd.to_numeric(df['Amount Paid'], errors='coerce').fillna(0)

Outliers can significantly skew statistical analysis and affect decision-making. To

# Detecting and removing outliers using IQR

# Removing extreme outliers

# Ensure no negative payments

df = df[df['Amount Paid'] >= 0]

2. Preparing the Data

3. Creating the Prediction Model

4. Checking Model Performance

2. Model Performance Outcomes

4. Implications and Future Steps

# Step 1: Load the dataset

# Read the Excel file

# Alternatively, if using Google Drive:

# Display the first few rows and basic info

# Step 2: Data Wrangling and Cleaning

# 2.1 Handle Missing Values

# Impute missing OrderID with unique identifiers

# Impute missing RestaurantID with 'REST_UNKNOWN'

# Impute missing Amount Paid with median of non-missing values

# 2.2 Handle Duplicates

# Check for duplicate OrderIDs

# 2.3 Data Type Correction

# 2.4 Handle Inconsistent Data

# 2.5 Feature Engineering

# Customer Order Frequency

# Delivery Guy Workload

# Binary Order Status (Completed vs. Not Completed)

# Display cleaned dataset info

# Step 3: Exploratory Data Analysis (EDA)

# Distribution of Amount Paid

# Step 4: Model Building (Predicting Order Status)

# Encode categorical target (Order Status)

# 4.2 Handle Imbalanced Data with SMOTE

# 4.3 Train-Test Split

# 4.4 Train Random Forest Classifier

# 4.5 Evaluate Model

# 4.6 Feature Importance

# Step 5: Save the Cleaned Dataset (Optional)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.