0% found this document useful (0 votes)
70 views

TASK 1 Data - Quality - Analysis

Data Analyst KPMG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

TASK 1 Data - Quality - Analysis

Data Analyst KPMG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Hello,

This is Meghna Rawat from KPMG Data Analytics (Virtual Internship) team. Thank you
for providing us with the three datasets from Sprocket Central Pty Ltd. The below table
highlights the summary statistics from the three datasets received. Please let us know if
the figures are not aligned with your understanding.

The following are the details of analysis done on the dataset:

Table Name Table Records Table Analysis


Before Data Cleaning After Data Cleaning

Transaction Data 20000 rows & 13 19445 rows & 14  Total profit: $10,930,284 (app.)
columns columns  ‘Solex’ is the most purchased brand name
(1542 blank cells) (0 blank cell)  The most and least sold product line is
‘Standard’ and ‘Mountain’ respectively
New Customer List 1000 rows & 18 columns 878 rows & 18 columns  Most new customers are from the New
(152 cells) (0 blank cell) South Wales, Australia
 Most customers own cars
Customer 4000 rows & 13 columns 3413 rows & 13 columns  Most customers are ‘mass customers’ in
Demographic (806 blank cells) (0 blank cell) wealth segment
 Most customers are working in
manufacturing and financial services
industry
Customer Address 3999 rows & 6 columns 3999 rows & 6 columns • Most customers are from New Sales
(0 blank cell) (0 blank cell) Wales (NSW)
• Most customers have post code between
2000 to 2190

Notable data quality issues that were encountered and the methods used to mitigate the
identified data inconsistencies are as follows. Furthermore, recommendations have been
provided to avoid the reoccurrence of data quality issues and improve the accuracy of the
underlying data used to drive business decisions.

● Additional customer_ids in the ‘Transactions table’ and ‘Customer Address table’ but
not in ‘Customer Master (Customer Demographic)’
Mitigation: Please ensure that all tables are from the same period. Only customers in the Customer Master list will be
used as a training set for our model.
This indicates that the data received may not be in sync with each other which may skew the
analysis results if there are missing data records. Please refer to excel file ‘data_outliers.xlsx’
for the list of outliers between tables.

● Various columns, such as the brand of a purchase, online order or job title, have
empty values in certain records
Mitigation: If only a small number of rows are empty, filter out the record entirely from the training set for prediction.
Else, if it is a core field, impute based on distribution in the training dataset.
For key datasets, such as transactions, less than 1% of transactions (totaling less than 0.7% of
revenue) have missing fields. These records have been removed from the training dataset.
● Inaccurate data in DOB (e.g. DOB is 1984 in NewCustomerList which is an incorrect
value for DOB)
Mitigation: Please ensure that the data provided is accurate as such inaccurate data can highly affect the training set
for our model.
I have filtered out such inaccurate data from the dataset so that it makes the next processes
easy to manage and shows correct results without generating any errors or outliers. Also, an
additional table named ‘default’ have been removed as it consisted of trash values.

● Inconsistent values for the same attribute (e.g. Victoria being represented as “V”,
“Vic” and “Victoria”) Mitigation: Use regular expression to replaced extended values into abbreviations to
ensure consistency across addresses. Recommendation: Enforce a drop-down list for the user entering the data rather
than a free text field.
In order to construct meaningful variables for the model, the data has been cleaned to avoid
multiple representations of the same value. Additionally, gender records where ‘U’ have been
replaced based on the distribution from the training dataset.

● Inconsistent data type for the same attribute (e.g. numeric values for some fields and
strings for others)
Mitigation: Convert selected records in characters to numeric. Remove non-numeric characters from string.
Recommendation: Ensure that fact tables in the given database have constraints on data types.
Having different data types for a given field make it difficult to interpret results at the later
stage. Therefore, appropriate data transformations are made to ensure consistent data types for
a given field.

Moving forward, the team will continue with the data cleaning, standardisation and
transformation process for the purpose of model analysis. Questions will be raised along the
way and assumptions documented. After we have completed this, it would be great to spend
some time with your data SME to ensure that all assumptions are aligned with Sprocket
Central’s understanding.

Kind regards,
Meghna Rawat

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy