0% found this document useful (0 votes)

70 views

TASK 1 Data - Quality - Analysis

Data Analyst KPMG

Uploaded by

hendra kuswantoro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

TASK 1 Data - Quality - Analysis

Data Analyst KPMG

Uploaded by

hendra kuswantoro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

Hello,

This is Meghna Rawat from KPMG Data Analytics (Virtual Internship) team. Thank you
for providing us with the three datasets from Sprocket Central Pty Ltd. The below table
highlights the summary statistics from the three datasets received. Please let us know if
the figures are not aligned with your understanding.

The following are the details of analysis done on the dataset:

Table Name Table Records Table Analysis

Before Data Cleaning After Data Cleaning

Transaction Data 20000 rows & 13 19445 rows & 14  Total profit: $10,930,284 (app.)
columns columns  ‘Solex’ is the most purchased brand name
(1542 blank cells) (0 blank cell)  The most and least sold product line is
‘Standard’ and ‘Mountain’ respectively
New Customer List 1000 rows & 18 columns 878 rows & 18 columns  Most new customers are from the New
(152 cells) (0 blank cell) South Wales, Australia
 Most customers own cars
Customer 4000 rows & 13 columns 3413 rows & 13 columns  Most customers are ‘mass customers’ in
Demographic (806 blank cells) (0 blank cell) wealth segment
 Most customers are working in
manufacturing and financial services
industry
Customer Address 3999 rows & 6 columns 3999 rows & 6 columns • Most customers are from New Sales
(0 blank cell) (0 blank cell) Wales (NSW)
• Most customers have post code between
2000 to 2190

Notable data quality issues that were encountered and the methods used to mitigate the
identified data inconsistencies are as follows. Furthermore, recommendations have been
provided to avoid the reoccurrence of data quality issues and improve the accuracy of the
underlying data used to drive business decisions.

● Additional customer_ids in the ‘Transactions table’ and ‘Customer Address table’ but
not in ‘Customer Master (Customer Demographic)’
Mitigation: Please ensure that all tables are from the same period. Only customers in the Customer Master list will be
used as a training set for our model.
This indicates that the data received may not be in sync with each other which may skew the
analysis results if there are missing data records. Please refer to excel file ‘data_outliers.xlsx’
for the list of outliers between tables.

● Various columns, such as the brand of a purchase, online order or job title, have
empty values in certain records
Mitigation: If only a small number of rows are empty, filter out the record entirely from the training set for prediction.
Else, if it is a core field, impute based on distribution in the training dataset.
For key datasets, such as transactions, less than 1% of transactions (totaling less than 0.7% of
revenue) have missing fields. These records have been removed from the training dataset.
● Inaccurate data in DOB (e.g. DOB is 1984 in NewCustomerList which is an incorrect
value for DOB)
Mitigation: Please ensure that the data provided is accurate as such inaccurate data can highly affect the training set
for our model.
I have filtered out such inaccurate data from the dataset so that it makes the next processes
easy to manage and shows correct results without generating any errors or outliers. Also, an
additional table named ‘default’ have been removed as it consisted of trash values.

● Inconsistent values for the same attribute (e.g. Victoria being represented as “V”,
“Vic” and “Victoria”) Mitigation: Use regular expression to replaced extended values into abbreviations to
ensure consistency across addresses. Recommendation: Enforce a drop-down list for the user entering the data rather
than a free text field.
In order to construct meaningful variables for the model, the data has been cleaned to avoid
multiple representations of the same value. Additionally, gender records where ‘U’ have been
replaced based on the distribution from the training dataset.

● Inconsistent data type for the same attribute (e.g. numeric values for some fields and
strings for others)
Mitigation: Convert selected records in characters to numeric. Remove non-numeric characters from string.
Recommendation: Ensure that fact tables in the given database have constraints on data types.
Having different data types for a given field make it difficult to interpret results at the later
stage. Therefore, appropriate data transformations are made to ensure consistent data types for
a given field.

Moving forward, the team will continue with the data cleaning, standardisation and
transformation process for the purpose of model analysis. Questions will be raised along the
way and assumptions documented. After we have completed this, it would be great to spend
some time with your data SME to ensure that all assumptions are aligned with Sprocket
Central’s understanding.

Kind regards,
Meghna Rawat

Module 1 Answer PDF
100% (4)
Module 1 Answer PDF
2 pages
SQL Fundamentals I Solutions 04
100% (1)
SQL Fundamentals I Solutions 04
3 pages
Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
No ratings yet
Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
38 pages
Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise
No ratings yet
Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise
34 pages
Taking An Exam: Proctoring
No ratings yet
Taking An Exam: Proctoring
55 pages
SQL & Advanced SQL
100% (6)
SQL & Advanced SQL
37 pages
Task 1-Email
No ratings yet
Task 1-Email
3 pages
KPMG Data Analytics - Task 1
100% (1)
KPMG Data Analytics - Task 1
1 page
Additional Customer - Ids in The Transactions Table' and Customer Address Table' But Not in Customer Master (Customer Demographic) '
No ratings yet
Additional Customer - Ids in The Transactions Table' and Customer Address Table' But Not in Customer Master (Customer Demographic) '
2 pages
Linear Discriminant Analysis - Credit Card Default Analysis
No ratings yet
Linear Discriminant Analysis - Credit Card Default Analysis
7 pages
KPMG
100% (1)
KPMG
2 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Under Sampling
100% (1)
Under Sampling
7 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Project LDA
100% (1)
Project LDA
32 pages
Fradulent Credit Case Study
100% (1)
Fradulent Credit Case Study
31 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Homework 2
100% (1)
Homework 2
12 pages
Curriculum Guide: Artificial Intelligence and Machine Learning: Business Applications
No ratings yet
Curriculum Guide: Artificial Intelligence and Machine Learning: Business Applications
8 pages
Case Study 2
100% (1)
Case Study 2
12 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Business Report 16 April 2023
No ratings yet
Business Report 16 April 2023
16 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Homework 2
100% (1)
Homework 2
14 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Graded Assignment Questions
No ratings yet
Graded Assignment Questions
7 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Internship Report - Software - Salaries Predictions
100% (1)
Internship Report - Software - Salaries Predictions
17 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Quest Stat
100% (1)
Quest Stat
2 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Machine Learning
100% (1)
Machine Learning
17 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Data Visualization Using Tableau (WEEK-1)
No ratings yet
Data Visualization Using Tableau (WEEK-1)
9 pages
Preparation and Evaluation of Polyherbal Hair Oil
100% (1)
Preparation and Evaluation of Polyherbal Hair Oil
13 pages
Wine Quality Prediction Using ML PPR
100% (1)
Wine Quality Prediction Using ML PPR
8 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Business Report MRA Project
No ratings yet
Business Report MRA Project
48 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Credit Eda Case Study
100% (2)
Credit Eda Case Study
17 pages
Machine Learning: Notes by Aniket Sahoo - Part II
No ratings yet
Machine Learning: Notes by Aniket Sahoo - Part II
140 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Childhood Asthma Prediction Model Using SVM
No ratings yet
Childhood Asthma Prediction Model Using SVM
9 pages
A Star Search PDF
100% (1)
A Star Search PDF
6 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
PG Program Dsba
No ratings yet
PG Program Dsba
16 pages
Addressed Issues
100% (1)
Addressed Issues
3 pages
Communication & Energy Wire World Summary: Market Values & Financials by Country
From Everand
Communication & Energy Wire World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Logical Database Design
No ratings yet
Logical Database Design
21 pages
Sap BW Thesis
100% (3)
Sap BW Thesis
8 pages
Ontology-Based Question Answering System
No ratings yet
Ontology-Based Question Answering System
3 pages
Oracle DBA Commands
100% (1)
Oracle DBA Commands
41 pages
Ado Net Notes
No ratings yet
Ado Net Notes
1 page
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
The Normal Forms 3NF and BCNF
No ratings yet
The Normal Forms 3NF and BCNF
25 pages
Table Expressions: Derived Tables
No ratings yet
Table Expressions: Derived Tables
6 pages
SAP HANA Modeling Guide
100% (1)
SAP HANA Modeling Guide
266 pages
Data Manipulation
No ratings yet
Data Manipulation
7 pages
Singly Linked List
No ratings yet
Singly Linked List
3 pages
chap6 Integrity constraint
No ratings yet
chap6 Integrity constraint
30 pages
Nguyen 2021
No ratings yet
Nguyen 2021
21 pages
Jornal Management
No ratings yet
Jornal Management
244 pages
Crosley CR24-003A Manual
No ratings yet
Crosley CR24-003A Manual
16 pages
DBMS Insem Question Answers
No ratings yet
DBMS Insem Question Answers
15 pages
SQL Server Memory Usage Related Interview Questions and Answers
No ratings yet
SQL Server Memory Usage Related Interview Questions and Answers
11 pages
E Commerce Big Data Mining and Analytics 1st Edition by Jie Cao 9819935873 978-9819935871 - The ebook with rich content is ready for you to download
100% (7)
E Commerce Big Data Mining and Analytics 1st Edition by Jie Cao 9819935873 978-9819935871 - The ebook with rich content is ready for you to download
80 pages
DWH Syllabus
No ratings yet
DWH Syllabus
7 pages
Data Warehousing & Data Mining Syllabus
No ratings yet
Data Warehousing & Data Mining Syllabus
2 pages
Rendy Khonelius Studi Kasus MYSQL
No ratings yet
Rendy Khonelius Studi Kasus MYSQL
19 pages
Movie Database SQL
No ratings yet
Movie Database SQL
3 pages
Migration On-Premise Database To Azure Database: Some of The Migration Tools
No ratings yet
Migration On-Premise Database To Azure Database: Some of The Migration Tools
9 pages
Creating A Simple PHP Forum Tutorial
No ratings yet
Creating A Simple PHP Forum Tutorial
14 pages
Department of Computer Science and Engineering: Rajalakshmi Institute of Technology
No ratings yet
Department of Computer Science and Engineering: Rajalakshmi Institute of Technology
16 pages
Resume Template - ATMECS Format
No ratings yet
Resume Template - ATMECS Format
3 pages
A Mapping Study About Data Lakes
No ratings yet
A Mapping Study About Data Lakes
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

TASK 1 Data - Quality - Analysis

Uploaded by

TASK 1 Data - Quality - Analysis

Uploaded by

Hello,

The following are the details of analysis done on the dataset:

Table Name Table Records Table Analysis

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.