0% found this document useful (0 votes)

1 views

Project Descriptioin

This project involves using Apache Spark and SQL to create, query, and audit data from multiple CSV files in a relational database context. Key objectives include data manipulation with Spark, performing data quality checks, and deriving business insights through SQL queries. Deliverables include a Spark Notebook with code, audit log entries, and an optional write-up summarizing findings.

Uploaded by

traore.arouna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Project Descriptioin

Uploaded by

traore.arouna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Project 1

Apache Spark, Relational Database & MySQL

Due January 7, 2025, 11:59PM EST

1. Introduction
This project focuses on using Spark’s structured API and Spark SQL to create, query, and
audit data within a relational database environment. You will work with multiple CSV files,
perform data quality checks, and address business-related questions.

Key Learning Objectives

1. Use Spark Structured APIs

Learn how to load data into Spark and manipulate large datasets using DataFrames and
Spark transformations.
2. Perform Data Auditing
Understand how to identify and log inconsistencies in the data (e.g., missing values or
invalid dates).
3. Create and Query a Relational Database
Practice creating database tables from CSV files and writing SQL queries to retrieve
insights.
4. Business Insights & Decision Making
Apply real-world techniques for analyzing sales, payroll, and other metrics to inform
business decisions.

2. Project Files and Setup

You have the following files for this project:

1. Project Description (this file)

2. tbl2015payroll.csv
3. tbl2015property.csv
4. tbl2015sales.csv
5. sales-0.csv
6. sales-1.csv
Important:

You will create tables using the CSV files listed above and perform various transformations or
queries using Spark SQL. Remember that there may be multiple ways to answer a question, so
your approach can be unique.

3. Retrieving Information from a Database

3.1 Step 1: Create Spark Tables

1. Create a Databricks Notebook named PDDS-Project.

2. Upload the following CSV files:
o tbl2015payroll.csv
o tbl2015property.csv
o tbl2015sales.csv
3. Create Spark tables from these CSVs:
o 2015payroll
o 2015property
o 2015sales

(Refer to in-class demos or Databricks documentation if you need a refresher on how to upload
CSV files and create tables.)

3.2 Step 2: Answer Business Questions

Using the Spark SQL environment, write queries to answer the questions below.
(Hint: Each question can be solved with a single or multiple SQL statements. Be creative and
efficient!)

1. Which store manager(s) make more than $100,000?

2. What is the total payroll for each position where total wages exceed $700,000?
o Suggestion: Group by position and use SUM(wages), then filter by that sum.
3. Manager Pay Increase
o The company wants to give a 3% wage increase to managers whose 2015 store
sales is bigger than their 2014 store sales.
1. Calculate the 3% increase for these eligible managers.
2. Who receives the highest increase?
4. Feasibility of a 3% Payroll Increase (Company-Wide)
o The company is skeptical about a 3% raise for all employees.
o By store, calculate the difference between the current payroll and what the
payroll would be if everyone received a 3% raise.
5. Sales by Store and State
o List each store’s total sales alongside the state in which the store is located.
o Hint: Join 2015sales with 2015property on a store or location identifier.
3.3 (Optional) Additional Business Questions

(This is not going to be graded but you may want to consider if time permits or to deepen your
analysis.)

1. Which positions have the highest average wage?

2. Which store(s) or region(s) have the largest gap between 2014 and 2015 sales?
3. Top-Selling Products or Categories (If your data includes product details): Identify the
top 3 categories or products generating the most revenue in 2015.
4. Year-Over-Year Sales Comparison: Which stores saw a sales decrease from 2014 to
2015?

4. Data Quality Checks

To ensure high data quality, you will create a staging table and an audit log table. This will help
you identify and keep track of invalid or incomplete records.

4.1 Step 1: Create a Staging and Auditing Environment

1. stage_sales Table
o Create a table called stage_sales by first loading sales-0.csv (which contains
the header).
o Then update it (append/insert) using sales-1.csv.
o This table serves as the staging area for your sales data before it goes into
production tables.
2. audit_log Table
o Create another table called audit_log, which will hold records of all data quality
issues found in stage_sales.
o The columns in audit_log should include:
§ tbl: Name of the table being audited (e.g., stage_sales).
§ audit_type: A short label for the type of audit
(e.g., 'missing_orderID').
§ count: The number of records that have the specific audit problem.
§ ts: The timestamp when the audit was performed.

4.2 Step 2: Perform Data Quality Checks

You will now insert into audit_log the counts of records that violate specific rules. At a
minimum, capture the following types of data issues from stage_sales:

1. Blank orderID
o In general, you want to check for blank or null values in important columns.
o Insert a record into audit_log indicating how many such rows exist.
2. Missing Location
o For records with missing or null location fields.
o Insert another record in audit_log for the count of these issues.
3. Invalid Dates
o For records with dates that do not match a known valid format (e.g., YYYY-MM-DD
or any valid date format).
o Insert the count of invalid date records into audit_log as well.

(THE QUESTIONS BELOW ARE NOT GOING TO BE GRADRED! But feel free to expand your
checks to include any other relevant data quality concerns: negative sales amounts, impossible
store IDs, etc.)

4.3 (Optional) Additional Data Auditing Ideas

If time allows, consider logging the following additional scenarios:

1. Duplicate Records: Check if there are any exact duplicate rows in stage_sales.
2. Future Dates: Check if any sales dates are beyond 2015 or 2016 (depending on your data
context).
3. Unusually Large or Negative Sales: Identify any sales that are unreasonably large or
negative.

5. Deliverables
1. Spark Notebook (PDDS-Project):
o Contains all code (SQL queries or DataFrame operations) that create tables,
answer business questions, and perform data quality checks.
o Well-commented with explanations of your approach.
2. Audit Log Entries:
o Show the results of your data quality checks (e.g., using SELECT * FROM
audit_log).
3. Short Write-Up (Optional):
o Summarize your findings, highlight any interesting insights, and note any
assumptions made.

6. Grading and Submission

• Uniqueness: Each student’s approach may differ in syntax or strategy; there is no single
“correct” approach.
• Completeness: Ensure you address all steps—creating tables, querying, auditing, and
logging data.
• Clarity: Write clean, organized code and meaningful explanations.
• Timeliness: Submit before the due date.

Store Management System Report
100% (1)
Store Management System Report
29 pages
Sanyo PLC-XP57 SM PDF
No ratings yet
Sanyo PLC-XP57 SM PDF
122 pages
Netezza Stored Procedures Guide
100% (1)
Netezza Stored Procedures Guide
88 pages
MATODA Raport Store20
No ratings yet
MATODA Raport Store20
13 pages
Wrangle Report
No ratings yet
Wrangle Report
7 pages
Coursera Car Project
No ratings yet
Coursera Car Project
3 pages
Singh Advanced data cleaning techniques for e-commerce projects
No ratings yet
Singh Advanced data cleaning techniques for e-commerce projects
14 pages
Basic Data Profiling
No ratings yet
Basic Data Profiling
2 pages
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Document (2)
No ratings yet
Document (2)
29 pages
Task 3 Data Analyst
No ratings yet
Task 3 Data Analyst
2 pages
10 SQL Projects to Enhance Your Data Analyst Resume in 2023
No ratings yet
10 SQL Projects to Enhance Your Data Analyst Resume in 2023
6 pages
Data warehouse (1)
No ratings yet
Data warehouse (1)
14 pages
Data warehouse
No ratings yet
Data warehouse
11 pages
Becoming A Data Analyst
100% (2)
Becoming A Data Analyst
348 pages
Please help me with real time SQL query for ETL t...
No ratings yet
Please help me with real time SQL query for ETL t...
3 pages
PROJECT PRESENTATION
No ratings yet
PROJECT PRESENTATION
4 pages
DrishtiPanjwani Mis Jaipur
No ratings yet
DrishtiPanjwani Mis Jaipur
3 pages
Apple SQL Interview Questions For Data Analyst 1730988406
No ratings yet
Apple SQL Interview Questions For Data Analyst 1730988406
21 pages
B Tech-AIML-question bank-2 Answer Key
No ratings yet
B Tech-AIML-question bank-2 Answer Key
9 pages
5 Jan 2025
No ratings yet
5 Jan 2025
7 pages
Questions_For_Preparation (1)
No ratings yet
Questions_For_Preparation (1)
9 pages
Resume 6
No ratings yet
Resume 6
1 page
Zakiyatun Surya: Technical Test Business Intelligence Analyst
No ratings yet
Zakiyatun Surya: Technical Test Business Intelligence Analyst
17 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1
No ratings yet
Big Data Quality Assurance (Manual) - Interview Questionnaire v1.0 1
9 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
A Deep Dive Into Data Quality
No ratings yet
A Deep Dive Into Data Quality
8 pages
20 Scenario Q&A for Data Analyst
No ratings yet
20 Scenario Q&A for Data Analyst
4 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
6 pages
SQL Retail Sales Project
No ratings yet
SQL Retail Sales Project
5 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Task-by-Task Guide - Retail Data Analysis (2)
No ratings yet
Task-by-Task Guide - Retail Data Analysis (2)
6 pages
Placement Preparation Material
No ratings yet
Placement Preparation Material
22 pages
My first ETL pipeline
No ratings yet
My first ETL pipeline
10 pages
Mayur DOT Resume
No ratings yet
Mayur DOT Resume
2 pages
.Trashed 1701997325 Bryan
No ratings yet
.Trashed 1701997325 Bryan
7 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Mosheer Khan CV
No ratings yet
Mosheer Khan CV
1 page
Mosheer Khan CV
No ratings yet
Mosheer Khan CV
1 page
Basic Audit Data Analytics
No ratings yet
Basic Audit Data Analytics
142 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
PL SQL Assignment
No ratings yet
PL SQL Assignment
11 pages
Business Analytics Nanodegree Syllabus: Before You Start
No ratings yet
Business Analytics Nanodegree Syllabus: Before You Start
5 pages
Analytics Engineer Roadmap
No ratings yet
Analytics Engineer Roadmap
6 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
REPORT SHAWARI_Copy
No ratings yet
REPORT SHAWARI_Copy
10 pages
Interview QnAs - CloudyML
No ratings yet
Interview QnAs - CloudyML
13 pages
Task 1 Data Science With Documentation (1)
No ratings yet
Task 1 Data Science With Documentation (1)
11 pages
Datathon at UCI Resource Sheet
No ratings yet
Datathon at UCI Resource Sheet
15 pages
Project Report
100% (1)
Project Report
16 pages
DWDMPROJECTREPORT (1)
No ratings yet
DWDMPROJECTREPORT (1)
9 pages
S
No ratings yet
S
22 pages
Data Extraction
No ratings yet
Data Extraction
14 pages
ITECH2302 MainAssessment Report
No ratings yet
ITECH2302 MainAssessment Report
8 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
The Most Commonly Used SQL Queries
No ratings yet
The Most Commonly Used SQL Queries
29 pages
Business Analytics Nanodegree Syllabus: Before You Start
No ratings yet
Business Analytics Nanodegree Syllabus: Before You Start
5 pages
Project PPT
No ratings yet
Project PPT
8 pages
Interview Questions and Answers For Data Analysts
No ratings yet
Interview Questions and Answers For Data Analysts
8 pages
Microsoft NAV Interview Questions: Unofficial Microsoft Navision Business Solution Certification Review
From Everand
Microsoft NAV Interview Questions: Unofficial Microsoft Navision Business Solution Certification Review
Equity Press
1/5 (1)
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
Hamiltonian1-MTA
No ratings yet
Hamiltonian1-MTA
11 pages
Copy_of_doc-6-pdf-14b14198b6e26157b7eba06b390ab763-original[1]
No ratings yet
Copy_of_doc-6-pdf-14b14198b6e26157b7eba06b390ab763-original[1]
716 pages
Spectral_Theory_of_Infinite_Dimensional_Dissipativ
No ratings yet
Spectral_Theory_of_Infinite_Dimensional_Dissipativ
28 pages
Hamiltonian_perturbation_theory_periodic
No ratings yet
Hamiltonian_perturbation_theory_periodic
20 pages
handbook-of-research-on-emerging-theories-models-and-application-2021
No ratings yet
handbook-of-research-on-emerging-theories-models-and-application-2021
464 pages
2009039
No ratings yet
2009039
29 pages
PT06
No ratings yet
PT06
12 pages
Forum 1
No ratings yet
Forum 1
1 page
1.1 Basic Excel
No ratings yet
1.1 Basic Excel
75 pages
CNC Notes
No ratings yet
CNC Notes
36 pages
DIR 825M EG Datasheet 636ba8280d01a
No ratings yet
DIR 825M EG Datasheet 636ba8280d01a
3 pages
P3 Instruction Manual
No ratings yet
P3 Instruction Manual
89 pages
Eio0000002836 00
No ratings yet
Eio0000002836 00
122 pages
Generic Job App - Tesha Belton
No ratings yet
Generic Job App - Tesha Belton
2 pages
Call Centre Proposal
No ratings yet
Call Centre Proposal
18 pages
Partitioning in Datastage
No ratings yet
Partitioning in Datastage
27 pages
Just in Time (JIT) Compiler in
No ratings yet
Just in Time (JIT) Compiler in
17 pages
Jasveer and Jianbin - 2018 - Comparison of Different Types of 3D Printing Techn
No ratings yet
Jasveer and Jianbin - 2018 - Comparison of Different Types of 3D Printing Techn
9 pages
Online Money Earning in Afghanistan
No ratings yet
Online Money Earning in Afghanistan
17 pages
DM 224 S 2020 - FY 2019 DepEd Computerization Program
No ratings yet
DM 224 S 2020 - FY 2019 DepEd Computerization Program
7 pages
Blue Prism - Modern Browser Automation - Complete
No ratings yet
Blue Prism - Modern Browser Automation - Complete
102 pages
Celtx Vs Word Script
No ratings yet
Celtx Vs Word Script
2 pages
Test Management
No ratings yet
Test Management
9 pages
6.1.7 Electrical Heat Tracing Specification
No ratings yet
6.1.7 Electrical Heat Tracing Specification
16 pages
5.04A Language and Me
No ratings yet
5.04A Language and Me
6 pages
Operation Analytics
No ratings yet
Operation Analytics
2 pages
Agricultural and Food Engineering (Dual Degree) : Shubham Patidar - 16ag30024
No ratings yet
Agricultural and Food Engineering (Dual Degree) : Shubham Patidar - 16ag30024
1 page
CAMAD Program2017
No ratings yet
CAMAD Program2017
20 pages
Structural Optimization
No ratings yet
Structural Optimization
310 pages
Unit-15 - Project Management
No ratings yet
Unit-15 - Project Management
18 pages
Documenting Data Flow Diagrams
No ratings yet
Documenting Data Flow Diagrams
4 pages
WHAT IS SEO: How You Can Take Advantage of SEO Strategies To Make Your Business Soar
No ratings yet
WHAT IS SEO: How You Can Take Advantage of SEO Strategies To Make Your Business Soar
5 pages
Resume Latest
67% (3)
Resume Latest
3 pages
EDCR Manual PDF
100% (2)
EDCR Manual PDF
76 pages
Cloud Computing Technology PowerPoint Templates
No ratings yet
Cloud Computing Technology PowerPoint Templates
45 pages
Srx110 Quick Start Guide
100% (1)
Srx110 Quick Start Guide
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Project Descriptioin

Uploaded by

Project Descriptioin

Uploaded by

Project 1

Apache Spark, Relational Database & MySQL

Key Learning Objectives

1. Use Spark Structured APIs

2. Project Files and Setup

1. Project Description (this file)

3. Retrieving Information from a Database

1. Create a Databricks Notebook named PDDS-Project.

3.2 Step 2: Answer Business Questions

1. Which store manager(s) make more than $100,000?

1. Which positions have the highest average wage?

4. Data Quality Checks

4.1 Step 1: Create a Staging and Auditing Environment

4.2 Step 2: Perform Data Quality Checks

4.3 (Optional) Additional Data Auditing Ideas

If time allows, consider logging the following additional scenarios:

6. Grading and Submission

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.