Hive Part02 1682422243
Hive Part02 1682422243
Hive Part02 1682422243
OPTIMIZATION TECHNIQUES
PART 02
QUERY LEVEL OPTIMIZATION:
Joins Optimization:
TYPES OF JOIN:
5. Inner Join : To output the matching records from both tables
6. Left Outer Join: Matching records + all records from left table with null
values on right
7. Right Outer Join: Matching records + all records from right table with
null values on left
8. Full Outer Join: Left Outer Join + Right Outer Join
-->when we perform INNER JOIN keeping the small table to be on the
Left that fits into memory then we can consider Inner Join as Map side
join as we get the results directly from the mapper and no need to
involve reducer.
-->Right Outer Join: If big table is on right, we get all matching records
and all right table records with null values on left. Because every
partition has the same left table data so we know if matching records
are present or not in that partition. Right Join can be MAP SIDE JOIN
when right table is big
-->Full Outer Join: Left outer join (NO) + Right outer join (YES) ==> NO
so full outer can't be MAP side join
11. Perform a join operation (inner join) and observe it has invoke
map side join without using reducers
12. Perform Left/ full outer join to see it has invoked reducers as
well (small table on left)
-->Left join when large table is on left. It works now. Refer above for
explanation. The hint /*MAPJOIN(c ) suggests customers table("c" ie
large table) is on left
TABLE 01:
0 - 4,8,12,16,20
1 – 1,5,12,13,17
2 – 2,6,10,14
3 - 3,7,11,9,19
TABLE 02:
0– 8,16,24,32
1- 1, 9, 17, 25
2-2,10,14,18
3-3, 11, 19,27
1
-->if buckets are not integral multiples, then we can't predict specifically
in which buckets we have to check for
-->id=3 bucket1(table1) and bucket0 (table2)
-->The main difference between map-side and bucket map join is that
once the hash tables are loaded into all the nodes, only 1 bucket at a
time is loaded into memory as we are dealing with 2 big tables.
Steps to follow to perform bucket map join:
15. Bucket the tables on join column
16. set the bucketing property in hive to true (set
hive.enforce.bucketing=true, set hive.optimize.bucketmapjoin=true)
-->Here 1-to-1 mapping of buckets happen as both tables have same no.
of buckets and they are even sorted.
-->mapper 1 works on bucket 0 of table1 and bucket0 of table2
https://www.youtube.com/watch?v=TzsrO4zTQj8
Window functions operate on a set of rows and return a single
value for each row from the underlying query.
The term window describes the set of rows on which
the function operates
OVER CLAUSE:
-->OVER clause is used in partitioning or ordering of rows before the
window function is applied onto it. It defines the
Window or user defined rows onto which window function are applied
and computes the value for each row in the window.
Practiced in : https://www.hackerrank.com/challenges/average-
population/problem
Hacker rank doesn't support this operation so couldn't get result set
-->It is used to return the row number of the records we have in the
table
-->It needs to be combined with ORDER BY CLAUSE
-->when PARTITION BY clause is used, row number changes to 1 with
each new partition
RANKING FUNCTIONS:
How to rank the data?
-->It is used to rank the data records based on the condition given.
-->order by clause is required
Rank():
-->it skips the rankings if there are records with same rank
Dense_rank():
-->It doesn't skip any rank
ANALYTICAL FUNCTIONS:
-->They are used to perform the analytical operations on the data
Lead() Function:
-->It is used to access subsequent(after) row data along with current
row data
Lead(column_name, offset_value, default_value)
Column_name: which column u want to lead
Offset value: how many rows u want to lead
Default value: if there are no rows to lead give the default value
Lag() Function:
-->It is used to access before row data with current row data