0% found this document useful (0 votes)
1 views

Project Descriptioin

This project involves using Apache Spark and SQL to create, query, and audit data from multiple CSV files in a relational database context. Key objectives include data manipulation with Spark, performing data quality checks, and deriving business insights through SQL queries. Deliverables include a Spark Notebook with code, audit log entries, and an optional write-up summarizing findings.

Uploaded by

traore.arouna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Project Descriptioin

This project involves using Apache Spark and SQL to create, query, and audit data from multiple CSV files in a relational database context. Key objectives include data manipulation with Spark, performing data quality checks, and deriving business insights through SQL queries. Deliverables include a Spark Notebook with code, audit log entries, and an optional write-up summarizing findings.

Uploaded by

traore.arouna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Project 1

Apache Spark, Relational Database & MySQL


Due January 7, 2025, 11:59PM EST

1. Introduction
This project focuses on using Spark’s structured API and Spark SQL to create, query, and
audit data within a relational database environment. You will work with multiple CSV files,
perform data quality checks, and address business-related questions.

Key Learning Objectives

1. Use Spark Structured APIs


Learn how to load data into Spark and manipulate large datasets using DataFrames and
Spark transformations.
2. Perform Data Auditing
Understand how to identify and log inconsistencies in the data (e.g., missing values or
invalid dates).
3. Create and Query a Relational Database
Practice creating database tables from CSV files and writing SQL queries to retrieve
insights.
4. Business Insights & Decision Making
Apply real-world techniques for analyzing sales, payroll, and other metrics to inform
business decisions.

2. Project Files and Setup


You have the following files for this project:

1. Project Description (this file)


2. tbl2015payroll.csv
3. tbl2015property.csv
4. tbl2015sales.csv
5. sales-0.csv
6. sales-1.csv
Important:

You will create tables using the CSV files listed above and perform various transformations or
queries using Spark SQL. Remember that there may be multiple ways to answer a question, so
your approach can be unique.

3. Retrieving Information from a Database


3.1 Step 1: Create Spark Tables

1. Create a Databricks Notebook named PDDS-Project.


2. Upload the following CSV files:
o tbl2015payroll.csv
o tbl2015property.csv
o tbl2015sales.csv
3. Create Spark tables from these CSVs:
o 2015payroll
o 2015property
o 2015sales

(Refer to in-class demos or Databricks documentation if you need a refresher on how to upload
CSV files and create tables.)

3.2 Step 2: Answer Business Questions

Using the Spark SQL environment, write queries to answer the questions below.
(Hint: Each question can be solved with a single or multiple SQL statements. Be creative and
efficient!)

1. Which store manager(s) make more than $100,000?


2. What is the total payroll for each position where total wages exceed $700,000?
o Suggestion: Group by position and use SUM(wages), then filter by that sum.
3. Manager Pay Increase
o The company wants to give a 3% wage increase to managers whose 2015 store
sales is bigger than their 2014 store sales.
1. Calculate the 3% increase for these eligible managers.
2. Who receives the highest increase?
4. Feasibility of a 3% Payroll Increase (Company-Wide)
o The company is skeptical about a 3% raise for all employees.
o By store, calculate the difference between the current payroll and what the
payroll would be if everyone received a 3% raise.
5. Sales by Store and State
o List each store’s total sales alongside the state in which the store is located.
o Hint: Join 2015sales with 2015property on a store or location identifier.
3.3 (Optional) Additional Business Questions

(This is not going to be graded but you may want to consider if time permits or to deepen your
analysis.)

1. Which positions have the highest average wage?


2. Which store(s) or region(s) have the largest gap between 2014 and 2015 sales?
3. Top-Selling Products or Categories (If your data includes product details): Identify the
top 3 categories or products generating the most revenue in 2015.
4. Year-Over-Year Sales Comparison: Which stores saw a sales decrease from 2014 to
2015?

4. Data Quality Checks


To ensure high data quality, you will create a staging table and an audit log table. This will help
you identify and keep track of invalid or incomplete records.

4.1 Step 1: Create a Staging and Auditing Environment

1. stage_sales Table
o Create a table called stage_sales by first loading sales-0.csv (which contains
the header).
o Then update it (append/insert) using sales-1.csv.
o This table serves as the staging area for your sales data before it goes into
production tables.
2. audit_log Table
o Create another table called audit_log, which will hold records of all data quality
issues found in stage_sales.
o The columns in audit_log should include:
§ tbl: Name of the table being audited (e.g., stage_sales).
§ audit_type: A short label for the type of audit
(e.g., 'missing_orderID').
§ count: The number of records that have the specific audit problem.
§ ts: The timestamp when the audit was performed.

4.2 Step 2: Perform Data Quality Checks

You will now insert into audit_log the counts of records that violate specific rules. At a
minimum, capture the following types of data issues from stage_sales:

1. Blank orderID
o In general, you want to check for blank or null values in important columns.
o Insert a record into audit_log indicating how many such rows exist.
2. Missing Location
o For records with missing or null location fields.
o Insert another record in audit_log for the count of these issues.
3. Invalid Dates
o For records with dates that do not match a known valid format (e.g., YYYY-MM-DD
or any valid date format).
o Insert the count of invalid date records into audit_log as well.

(THE QUESTIONS BELOW ARE NOT GOING TO BE GRADRED! But feel free to expand your
checks to include any other relevant data quality concerns: negative sales amounts, impossible
store IDs, etc.)

4.3 (Optional) Additional Data Auditing Ideas

If time allows, consider logging the following additional scenarios:

1. Duplicate Records: Check if there are any exact duplicate rows in stage_sales.
2. Future Dates: Check if any sales dates are beyond 2015 or 2016 (depending on your data
context).
3. Unusually Large or Negative Sales: Identify any sales that are unreasonably large or
negative.

5. Deliverables
1. Spark Notebook (PDDS-Project):
o Contains all code (SQL queries or DataFrame operations) that create tables,
answer business questions, and perform data quality checks.
o Well-commented with explanations of your approach.
2. Audit Log Entries:
o Show the results of your data quality checks (e.g., using SELECT * FROM
audit_log).
3. Short Write-Up (Optional):
o Summarize your findings, highlight any interesting insights, and note any
assumptions made.

6. Grading and Submission


• Uniqueness: Each student’s approach may differ in syntax or strategy; there is no single
“correct” approach.
• Completeness: Ensure you address all steps—creating tables, querying, auditing, and
logging data.
• Clarity: Write clean, organized code and meaningful explanations.
• Timeliness: Submit before the due date.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy