SQL - 02
SQL - 02
SQL - 02
Problem Statement:
You are a Data Analyst at the Reliance Fresh. You have been tasked to study the
Farmer’s market - Mandis.
Run the query and show that you need to round off those values.
Can we round off the price column to just 2 decimal places. - ROUND
function
● A SQL function is a piece of code that takes inputs that you give it (which are called
parameters), performs some operation on those inputs, and returns a value.
● You can use functions inline in your query to modify the raw values from the database
tables before displaying them in the output.
Function Syntax
ROUND()
● In the last, the “price” field was displayed with four digits after the decimal point.
● Let’s say we wanted to display the number rounded to the nearest penny (in US
dollars), which is two digits after the decimal. That can be accomplished using
the ROUND() function.
SELECT
market_date,
customer_id,
vendor_id,
ROUND(quantity * cost_to_customer_per_qty, 2) AS price
FROM farmers_market.customer_purchases
LIMIT 10
TIP: The ROUND() function can also accept negative numbers for the second parameter, to
round digits that are to the left of the decimal point. For example, SELECT ROUND(1245,
-2) will return a value of 1200.
Transition -> Up until now, we have manipulated just numbers, what about strings?
There are also inline functions that can be used to modify string values in SQL, as well.
In our customer table, there are separate columns for each customer’s first and last
names. Let’s quickly look at the customer table.
SELECT *
FROM farmers_market.customer
LIMIT 5
SELECT
customer_id,
FROM farmers_market.customer
LIMIT 5
● You can now order by names, their first names and their last names:
SELECT
customer_id,
FROM farmers_market.customer
LIMIT 5
● It’s also possible to nest functions inside other functions, which the SQL
interpreter executes from the “inside” to the “outside.”
● UPPER() is a function that capitalizes string values.
● We can enclose the CONCAT() function inside it to uppercase the full name.
● Let’s also change the order of the concatenation parameters to put the last name
first, and add a comma after the last name.
SELECT
customer_id,
FROM farmers_market.customer
LIMIT 5
Transition -> So, now you know how to query data in a straightforward manner, you
can do some inline calculations and sort them.
But one of the most common requirements is to filter rows based on certain conditions.
The WHERE clause goes after the FROM statement and before any GROUP BY,
ORDER BY, or LIMIT statements in the SELECT query
SELECT
FROM farmers_market.product
WHERE product_category_id = 1
LIMIT 5
market_date, customer_id,
vendor_id, product_id,
quantity,
FROM farmers_market.customer_purchases
WHERE customer_id = 4
LIMIT 5
SELECT
product_id,
product_name
FROM farmers_market.product
WHERE
(product_id = 10
OR product_id > 3)
AND product_id < 8
Explanation:
● When the product ID is 10, the WHERE clause in the first query is evaluated as:
TRUE OR (TRUE AND FALSE) = TRUE OR (FALSE) = TRUE
Multi-column filtering - WHERE clauses can also impose conditions using values in multiple
columns.
Question: Find the details of purchases made by customer 4 at vendor 7, we could use the
following query:
SELECT
market_date,
customer_id,
vendor_id,
quantity * cost_to_customer_per_qty AS price
FROM farmers_market.customer_purchases
WHERE
customer_id = 4
AND vendor_id = 7
Filtering based on strings
Question: Find the customer detail with the first name of “Carlos” or the last name of “Diaz,”:
SELECT
customer_id,
customer_first_name,
customer_last_name
FROM farmers_market.customer
WHERE
customer_first_name = 'Carlos'
OR customer_last_name = 'Diaz'
BETWEEN
Last query that we wrote in the previous lecture:
SELECT *
FROM farmers_market.vendor_booth_assignments
WHERE vendor_id = 2
AND market_date <= '2019-03-09'
ORDER BY market_date
We can also use the BETWEEN keyword to see if a value, such as a date, is within a specified
range of values.
Question: Find the booth assignments for vendor 7 for any market
date that occurred between April 3, 2019, and May 16, 2019,
including either of the two dates.
SELECT *
FROM farmers_market.vendor_booth_assignments
WHERE
vendor_id = 7
ORDER BY market_date
IN - Keyword
SELECT
customer_id,
customer_first_name,
customer_last_name
FROM farmers_market.customer
WHERE
customer_last_name = 'Diaz'
OR customer_last_name = 'Edwards'
OR customer_last_name = 'Wilson'
Approach 2: An alternative way to do the same thing, which may come in handy if you
had a long list of names, is to use the IN keyword and provide a comma-separated list
of values to compare against.
The IN keyword will return TRUE for any row with a customer_last_name that is in the
provided list.
SELECT
customer_id,
customer_first_name,
customer_last_name
FROM farmers_market.customer
WHERE
Both queries have given the same output but whenever you have a long list of names or
values that you want to search against, IN is the better and more convenient choice.
For example, if someone asked you to look up a customer ID for someone's name
(Manuel), but you don’t know how to spell the name exactly, you might try searching
against a list with multiple spellings, like this:
SELECT
customer_id,
customer_first_name,
customer_last_name
FROM customer
WHERE
customer_first_name IN ("Mann", "Mannuel", "Manuel")
LIKE keyword
Question: You want to get data about a customer you knew as
“Jerry,” but you aren’t sure if he was listed in the database as
“Jerry” or “Jeremy” or “Jeremiah.”
All you knew for sure was that the first three letters were “Jer.”
In SQL, instead of listing every variation you can think of, you can search for partially
matched strings using a comparison operator called LIKE, and wildcard
characters, which serve as a placeholder for unknown characters in a string.
% wildcard
The wildcard character % (percent sign) can serve as a stand-in for any number of
characters (including none).
So the comparison LIKE ‘Jer%’ will search for strings that start with “Jer” and have any
(or no) additional characters after the “r”:
SELECT
customer_id,
customer_first_name,
customer_last_name
FROM farmers_market.customer
WHERE
customer_first_name LIKE 'Jer%'
IS NULL
You have learnt how to find and treat missing values in Python but what happens when
your database contains missing values.
● In the product table, the product_size field is optional, so it’s possible to add a
record for a product with no size.
Question: Find all of the products from the product table without sizes.
SELECT *
FROM farmers_market.product
WHERE product_size IS NULL
Blank vs NULL
Note: Keep in mind that “blank” and NULL are not the same thing in database
terms.
● If someone asked you to find all products that didn’t have product sizes, you
might also want to check for blank strings, which would equal ‘’, or rows where
someone entered a space or any number of spaces into that field.
● The TRIM() function removes excess spaces from the beginning or end of a
string value, so if you use a combination of the TRIM() function and blank string
comparison, you can find any row that is blank or contains only spaces.
● In this case, the “Red Potatoes - Small” row, shown in Figure 3.14, has a
product_size with one space in it, ' ', so could be found using the following query:
SELECT *
FROM farmers_market.product
WHERE
product_size IS NULL
OR TRIM(product_size) = “”
Warning about NULL values
● You might wonder why the comparison operator IS NULL is used instead of
equals NULL just like numbers.
● NULL is not actually a value, it’s the absence of a value, so it can’t be compared to
any existing value.
● If your query were filtered to WHERE product_size = NULL, no rows would be returned,
even though there is a record with a NULL product_size, because nothing “equals”
NULL, even NULL.
● If you wanted to return all records that don’t have NULL values in a field, you
could use the condition “[field name] IS NOT NULL” in the WHERE clause.
When the IN list comparison was demonstrated earlier, it used a hard-coded list of
values.
● What if you wanted to filter to a list of values that was returned by another query?
● In other words, you wanted a dynamic list. There is a way to do that in SQL,
using a subquery (a query inside a query).
There is a value in the market_date_info table called market_rain_flag that has a value
of 0 if it didn’t rain while the market was open and a value of 1 if it did.
● 0 - it didn’t rain
● 1 - it did
First, let’s write a query that gets a list of market dates when it rained, using this
query:
● Now let’s use the list of dates generated by that query to return purchases made
on those dates.
● Note that when using a query in an IN comparison, you can only return the field
you’re comparing to, so we will not include the market_rain_flag field in the
following subquery.
● Therefore, the query inside the parentheses just returns the dates.
● The “outer” query looks for customer_purchases records with a market_date
value in that list of dates.
SELECT
market_date,
customer_id,
vendor_id,
quantity * cost_to_customer_per_qty price
FROM farmers_market.customer_purchases
WHERE
market_date IN
(
SELECT market_date
FROM farmers_market.market_date_info
WHERE market_rain_flag = 1
)
LIMIT 5
So, WHERE works when you have conditional statements to filter out your rows.
But what if, instead of using conditional statements to filter rows, you want a column or
value in your dataset to be based on a conditional statement?
For example, instead of filtering your results for transactions over $50, say you just want
to return all rows and create a new column that flags each transaction as being above
or below $50?
CASE Statements
Note: If you’re familiar with scripting languages like Python that use “if ” statements,
you’ll find that SQL handles conditional logic somewhat similarly, with different syntax.
Conditional flow: “If [one condition] is true, then [take this action].
Otherwise, [take this other action].”
For example: “If the weather forecast predicts it will rain today, then I’ll take an umbrella
with me. Otherwise, I’ll leave the umbrella at home.”
In SQL, the code to describe this type of logic is called a CASE statement, which
uses the following syntax:
CASE
WHEN [first conditional statement]
THEN [value or calculation]
WHEN [second conditional statement]
THEN [value or calculation]
ELSE [value or calculation]
END
This statement indicates that you want a column to contain different values under
different conditions.
CASE
WHEN weather_forecast = 'rain'
THEN 'take umbrella'
ELSE 'leave umbrella at home'
END
Execution Method
● The WHENs are evaluated in order, from top to bottom, and the first time a
condition evaluates to TRUE, the corresponding THEN part of the statement is
executed, and no other WHEN conditions are evaluated.
Question: Find out which vendors primarily sell fresh produce and
which don’t.
The vendors we want to label as “Fresh Produce” have the word “Fresh” in the
vendor_type column.
We can use a CASE statement and the LIKE operator to create a new column, which
we’ll alias vendor_type_condensed, that condenses the vendor types to just “Fresh
Produce” or “Other”:
SELECT
vendor_id,
vendor_name,
vendor_type,
CASE
WHEN LOWER(vendor_type) LIKE '%fresh%'
THEN 'Fresh Produce'
ELSE 'Other'
END AS vendor_type_condensed
FROM farmers_market.vendor
We’re using the LOWER() function to lowercase the vendor type string, because we
don’t want the comparison to fail because of capitalization.
Creating binary flags using CASE
The Farmer’s Markets in our database all occur on Wednesday evenings or Saturday mornings.
Many machine learning algorithms won’t know what to do with the words “Wednesday” and
“Saturday” that appear in our database:
Binary Encoding: But, the algorithm could use a numeric value as an input.
One approach we can take to including the market day in our dataset is to generate a
binary flag field that indicates whether it’s a weekday or weekend market.
We can do this with a CASE statement, making a new column that contains a 1 if the
market occurs on a Saturday or Sunday, and a 0 if it doesn’t,
We’ll call this field “weekend_flag,”
SELECT
Market_date,
CASE
WHEN market_day = 'Saturday' OR market_day = 'Sunday'
THEN 1 ELSE 0
END AS weekend_flag
FROM farmers_market.market_date_info
LIMIT 5
Grouping or Binning Continuous Values Using CASE
We had a query that filtered to only customer purchases where an item or quantity of an
item cost over $50, by putting a conditional statement in the WHERE clause.
But let’s say we wanted to return all rows, and instead of using that value as a filter, only
indicate whether the cost was over $50 or not.
SELECT
market_date,
customer_id,
vendor_id,
ROUND(quantity * cost_to_customer_per_qty, 2) AS price,
CASE
WHEN quantity * cost_to_customer_per_qty > 50
THEN 1 ELSE 0
END AS price_over_50
FROM farmers_market.customer_purchases
LIMIT 10
● under $5.00,
● $5.00–$9.99,
● $10.00–$19.99, or
● $20.00 and over.
We could accomplish that with a CASE statement in which we surround the values after
the THENs in single quotes to generate a column that contains a string label,
SELECT
market_date,
customer_id,
vendor_id,
ROUND(quantity * cost_to_customer_per_qty, 2) AS price,
CASE
WHEN quantity * cost_to_customer_per_qty < 5.00
THEN 'Under $5'
WHEN quantity * cost_to_customer_per_qty < 10.00
THEN '$5-$9.99'
WHEN quantity * cost_to_customer_per_qty < 20.00
THEN '$10-$19.99'
WHEN quantity * cost_to_customer_per_qty >= 20.00
THEN '$20 and Up'
END AS price_bin
FROM farmers_market.customer_purchases
LIMIT 10
SELECT
booth_number,
booth_price_level,
CASE
WHEN booth_price_level = 'A' THEN 1
WHEN booth_price_level = 'B' THEN 2
WHEN booth_price_level = 'C' THEN 3
END AS booth_price_level_numeric
FROM farmers_market.booth
LIMIT 5
SUMMARY
In this lecture, you learned
● How to filter rows using keywords such as LIKE, IN,
● How to filter out rows with NULL values using IS NULL
● How to use Subqueries.
● SQL CASE statement syntax for creating new columns with values based on
conditions.
● You also learned how to consolidate categorical values into fewer categories,
create binary flags, bin continuous values, and encode categorical values.
You should now be able to describe what the following two queries do.