Project Descriptioin
Project Descriptioin
1. Introduction
This project focuses on using Spark’s structured API and Spark SQL to create, query, and
audit data within a relational database environment. You will work with multiple CSV files,
perform data quality checks, and address business-related questions.
You will create tables using the CSV files listed above and perform various transformations or
queries using Spark SQL. Remember that there may be multiple ways to answer a question, so
your approach can be unique.
(Refer to in-class demos or Databricks documentation if you need a refresher on how to upload
CSV files and create tables.)
Using the Spark SQL environment, write queries to answer the questions below.
(Hint: Each question can be solved with a single or multiple SQL statements. Be creative and
efficient!)
(This is not going to be graded but you may want to consider if time permits or to deepen your
analysis.)
1. stage_sales Table
o Create a table called stage_sales by first loading sales-0.csv (which contains
the header).
o Then update it (append/insert) using sales-1.csv.
o This table serves as the staging area for your sales data before it goes into
production tables.
2. audit_log Table
o Create another table called audit_log, which will hold records of all data quality
issues found in stage_sales.
o The columns in audit_log should include:
§ tbl: Name of the table being audited (e.g., stage_sales).
§ audit_type: A short label for the type of audit
(e.g., 'missing_orderID').
§ count: The number of records that have the specific audit problem.
§ ts: The timestamp when the audit was performed.
You will now insert into audit_log the counts of records that violate specific rules. At a
minimum, capture the following types of data issues from stage_sales:
1. Blank orderID
o In general, you want to check for blank or null values in important columns.
o Insert a record into audit_log indicating how many such rows exist.
2. Missing Location
o For records with missing or null location fields.
o Insert another record in audit_log for the count of these issues.
3. Invalid Dates
o For records with dates that do not match a known valid format (e.g., YYYY-MM-DD
or any valid date format).
o Insert the count of invalid date records into audit_log as well.
(THE QUESTIONS BELOW ARE NOT GOING TO BE GRADRED! But feel free to expand your
checks to include any other relevant data quality concerns: negative sales amounts, impossible
store IDs, etc.)
1. Duplicate Records: Check if there are any exact duplicate rows in stage_sales.
2. Future Dates: Check if any sales dates are beyond 2015 or 2016 (depending on your data
context).
3. Unusually Large or Negative Sales: Identify any sales that are unreasonably large or
negative.
5. Deliverables
1. Spark Notebook (PDDS-Project):
o Contains all code (SQL queries or DataFrame operations) that create tables,
answer business questions, and perform data quality checks.
o Well-commented with explanations of your approach.
2. Audit Log Entries:
o Show the results of your data quality checks (e.g., using SELECT * FROM
audit_log).
3. Short Write-Up (Optional):
o Summarize your findings, highlight any interesting insights, and note any
assumptions made.