Spark SQL String Functions
Spark SQL String Functions
Spark SQL String Functions
Click on each link from below table for more explanation and working examples of
String Function with Scala example
Show entries
Search:
STRING FUNCTION
STRING FUNCTION DESCRIPTION
SIGNATURE
ascii(e: Column): Column Computes the numeric value of the fi rst character of
the string column, and returns the result as an int
column.
base64(e: Column): Column Computes the BASE64 encoding of a binary column
and returns it as a string column.This is the reverse
of unbase64.
concat_ws(sep: String, exprs: Concatenates multiple input string columns
Column*): Column together into a single string column, using the given
separator.
decode(value: Column, charset: Computes the fi rst argument into a string from a
String): Column binary using the provided character set (one of 'US-
ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE',
'UTF-16').
encode(value: Column, charset: Computes the fi rst argument into a binary from a
String): Column string using the provided character set (one of 'US-
ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE',
'UTF-16').
format_number(x: Column, d: Formats numeric column x to a format like
Int): Column '#,###,###.##', rounded to d decimal places with
HALF_EVEN round mode, and returns the result as a
string column.
format_string(format: String, Formats the arguments in printf-style and returns
arguments: Column*): Column the result as a string column.
initcap(e: Column): Column Returns a new string column by converting the fi rst
letter of each word to uppercase. Words are
delimited by whitespace. For example, "hello world"
will become "Hello World".
instr(str: Column, substring: Locate the position of the fi rst occurrence of substr
String): Column column in the given string. Returns null if either of
the arguments are null.
length(e: Column): Column Computes the character length of a given string or
number of bytes of a binary string. The length of
character strings include the trailing spaces. The
length of binary strings includes binary zeros.
lower(e: Column): Column Converts a string column to lower case.
levenshtein ( l : Column , r : Computes the Levenshtein distance of the two given
STRING FUNCTION
STRING FUNCTION DESCRIPTION
SIGNATURE
Column ) : Column string columns.
locate(substr: String, str: Locate the position of the fi rst occurrence of substr.
Column): Column
locate(substr: String, str: Locate the position of the fi rst occurrence of substr
Column, pos: Int): Column in a string column, after position pos.
lpad(str: Column, len: Int, pad: Left-pad the string column with pad to a length of
String): Column len. If the string column is longer than len, the
return value is shortened to len characters.
ltrim(e: Column): Column Trim the spaces from left end for the specifi ed
string value.
regexp_extract(e: Column, exp: Extract a specifi c group matched by a Java regex,
String, groupIdx: Int): Column from the specifi ed string column. If the regex did
not match, or the specifi ed group did not match, an
empty string is returned.
regexp_replace(e: Column, Replace all substrings of the specifi ed string value
pattern: String, replacement: that match regexp with rep.
String): Column
regexp_replace(e: Column, Replace all substrings of the specifi ed string value
pattern: Column, replacement: that match regexp with rep.
Column): Column
unbase64(e: Column): Column Decodes a BASE64 encoded string column and
returns it as a binary column. This is the reverse of
base64.
rpad(str: Column, len: Int, pad: Right-pad the string column with pad to a length of
String): Column len. If the string column is longer than len, the
return value is shortened to len characters.
repeat(str: Column, n: Int): Repeats a string column n times, and returns it as a
Column new string column.
rtrim(e: Column): Column Trim the spaces from right end for the specifi ed
string value.
rtrim(e: Column, trimString: Trim the specifi ed character string from right end
String): Column for the specifi ed string column.
soundex(e: Column): Column Returns the soundex code for the specifi ed
expression
split(str: Column, regex: String): Splits str around matches of the given regex.
Column
split(str: Column, regex: String, Splits str around matches of the given regex.
limit: Int): Column
substring(str: Column, pos: Int, Substring starts at `pos` and is of length `len` when
len: Int): Column str is String type or returns the slice of byte array
that starts at `pos` in byte and is of length `len`
when str is Binary type
substring_index(str: Column, Returns the substring from string str before count
delim: String, count: Int): occurrences of the delimiter delim.
STRING FUNCTION
STRING FUNCTION DESCRIPTION
SIGNATURE
Column * If count is positive, everything the left of the fi nal
delimiter (counting from left) is
* returned. If count is negative, every to the right of
the fi nal delimiter (counting from the
* right) is returned. substring_index performs a
case-sensitive match when searching for delim.
overlay(src: Column, Overlay the specifi ed portion of `src` with
replaceString: String, pos: Int, `replaceString`,
len: Int): Column * starting from byte position `pos` of `inputString`
and proceeding for `len` bytes.
overlay(src: Column, Overlay the specifi ed portion of `src` with
replaceString: String, pos: Int): `replaceString`,
Column * starting from byte position `pos` of `inputString`.
translate(src: Column, Translate any character in the src by a character in
matchingString: String, replaceString.
replaceString: String): Column * The characters in replaceString correspond to the
characters in matchingString.
* The translate will happen when any character in
the string matches the character
* in the `matchingString`.
trim(e: Column): Column Trim the spaces from both ends for the specifi ed
string column.
trim(e: Column, trimString: Trim the specifi ed character from both ends for the
String): Column specifi ed string column.
upper(e: Column): Column Converts a string column to upper case.
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.sqlContext.implicits. _
import org.apache.spark.sql.functions. _
Copy
Spark SQL Date Functions
Click on each link from below table for more explanation and working examples in
Scala.
Show entries
Search:
DATE FUNCTION
DATE FUNCTION DESCRIPTION
SIGNATURE
current_date () : Column Returns the current date as a date column.
date_format(dateExpr: Converts a date/timestamp/string to a value of string
Column, format: String): in the format specifi ed by the date format given by the
Column second argument.
to_date(e: Column): Column Converts the column into `DateType` by casting rules
to `DateType`.
to_date(e: Column, fmt: Converts the column into a `DateType` with a specifi ed
String): Column format
add_months(startDate: Returns the date that is `numMonths` after `startDate`.
Column, numMonths: Int):
Column
date_add(start: Column, days: Returns the date that is `days` days after `start`
Int): Column
date_sub(start: Column, days:
Int): Column
datediff (end: Column, start: Returns the number of days from `start` to `end`.
Column): Column
months_between(end: Returns number of months between dates `start` and
Column, start: Column): `end`. A whole number is returned if both inputs have
Column the same day of month or both are the last day of their
respective months. Otherwise, the diff erence is
DATE FUNCTION
DATE FUNCTION DESCRIPTION
SIGNATURE
calculated assuming 31 days per month.
months_between(end: Returns number of months between dates `end` and
Column, start: Column, `start`. If `roundOff ` is set to true, the result is
roundOff : Boolean): Column rounded off to 8 digits; it is not rounded otherwise.
next_day(date: Column, Returns the fi rst date which is later than the value of
dayOfWeek: String): Column the `date` column that is on the specifi ed day of the
week.
For example, `next_day('2015-07-27', "Sunday")`
returns 2015-08-02 because that is the fi rst Sunday
after 2015-07-27.
trunc(date: Column, format: Returns date truncated to the unit specifi ed by the
String): Column format.
For example, `trunc("2018-11-19 12:01:19", "year")`
returns 2018-01-01
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month
date_trunc(format: String, Returns timestamp truncated to the unit specifi ed by
timestamp: Column): Column the format.
For example, `date_trunc("year", "2018-11-19 12:01:19")`
returns 2018-01-01 00:00:00
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month,
'day', 'dd' to truncate by day,
Other options are: 'second', 'minute', 'hour', 'week',
'month', 'quarter'
year(e: Column): Column Extracts the year as an integer from a given
date/timestamp/string
quarter(e: Column): Column Extracts the quarter as an integer from a given
date/timestamp/string.
month(e: Column): Column Extracts the month as an integer from a given
date/timestamp/string
dayofweek(e: Column): Extracts the day of the week as an integer from a
Column given date/timestamp/string. Ranges from 1 for a
Sunday through to 7 for a Saturday
dayofmonth(e: Column): Extracts the day of the month as an integer from a
Column given date/timestamp/string.
dayofyear(e: Column): Column Extracts the day of the year as an integer from a given
date/timestamp/string.
weekofyear(e: Column): Extracts the week number as an integer from a given
Column date/timestamp/string. A week is considered to start
on a Monday and week 1 is the fi rst week with more
than 3 days, as defi ned by ISO 8601
last_day(e: Column): Column Returns the last day of the month which the given date
belongs to. For example, input "2015-07-27" returns
DATE FUNCTION
DATE FUNCTION DESCRIPTION
SIGNATURE
"2015-07-31" since July 31 is the last day of the month
in July 2015.
from_unixtime(ut: Column): Converts the number of seconds from unix epoch
Column (1970-01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: Converts the number of seconds from unix epoch
String): Column (1970-01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the given format.
unix_timestamp(): Column Returns the current Unix timestamp (in seconds) as a
long
unix_timestamp(s: Column): Converts time string in format yyyy-MM-dd HH:mm:ss
Column to Unix timestamp (in seconds), using the default
timezone and the default locale.
unix_timestamp(s: Column, p: Converts time string with given pattern to Unix
String): Column timestamp (in seconds).
Showing 1 to 25 of 25 entries
PreviousNext
Spark SQL Timestamp Functions
Below are some of the Spark SQL Timestamp functions, these functions operate on
both date and timestamp values. Select each link for a description and example of
each function.
The default format of the Spark Timestamp is yyyy-MM-dd HH:mm:ss.SSSS
Show entries
Search:
TIMESTAMP FUNCTION
TIMESTAMP FUNCTION DESCRIPTION
SIGNATURE
current_timestamp () : Column Returns the current timestamp as a timestamp
column
hour(e: Column): Column Extracts the hours as an integer from a given
date/timestamp/string.
minute(e: Column): Column Extracts the minutes as an integer from a given
date/timestamp/string.
second(e: Column): Column Extracts the seconds as an integer from a given
date/timestamp/string.
to_timestamp(s: Column): Column Converts to a timestamp by casting rules to
`TimestampType`.
to_timestamp(s: Column, fmt: Converts time string with the given pattern to
String): Column timestamp.
Showing 1 to 6 of 6 entries
PreviousNext
Spark Date and Timestamp Window Functions
Below are Data and Timestamp window functions.
Show entries
Search:
DATE & TIME WINDOW
DATE & TIME WINDOW FUNCTION DESCRIPTION
FUNCTION SYNTAX
window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are
slideDuration: String, inclusive but the window ends are exclusive, e.g. 12:05
startTime: String): Column will be in the window [12:05,12:10) but not in
[12:00,12:05). Windows can support microsecond
precision. Windows in the order of months are not
supported.
window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are
slideDuration: String): Column inclusive but the window ends are exclusive, e.g. 12:05
will be in the window [12:05,12:10) but not in
[12:00,12:05). Windows can support microsecond
precision. Windows in the order of months are not
supported. The windows start beginning at 1970-01-01
00:00:00 UTC
window(timeColumn: Column, Generates tumbling time windows given a timestamp
windowDuration: String): specifying column. Window starts are inclusive but the
Column window ends are exclusive, e.g. 12:05 will be in the
window [12:05,12:10) but not in [12:00,12:05). Windows
can support microsecond precision. Windows in the
order of months are not supported. The windows start
beginning at 1970-01-01 00:00:00 UTC.
Showing 1 to 3 of 3 entries
PreviousNext
Spark Date Functions Examples
Below are most used examples of Date Functions.
current_date() and date_format()
We will see how to get the current date and convert date into a specifi c date format
using date_format() with Scala example. Below example parses the date and
converts from ‘yyyy-dd-mm’ to ‘MM-dd-yyyy’ format.
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"))
.toDF("Input")
.select(
current_date()as("current_date"),
col("Input"),
date_format(col("Input"), "MM-dd-yyyy").as("format")
).show()
Copy
+------------+----------+-----------+
|current_date| Input |format |
+------------+----------+-----------+
| 2019-07-23 |2019-01-23| 01-23-2019 |
+------------+----------+-----------+
Copy
to_date()
Below example converts string in date format ‘MM/dd/yyyy’ to a DateType ‘yyyy-
MM-dd’ using to_date() with Scala example.
import org.apache.spark.sql.functions. _
Seq(("04/13/2019"))
.toDF("Input")
.select( col("Input"),
to_date(col("Input"), "MM/dd/yyyy").as("to_date")
).show()
Copy
+----------+----------+
|Input |to_date |
+----------+----------+
|04/13/2019|2019-04-13|
+----------+----------+
Copy
datediff ()
Below example returns the diff erence between two dates using datediff () with Scala
example.
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), current_date(),
datediff (current_date(),col("input")).as("diff ")
).show()
Copy
+----------+--------------+--------+
| input |current_date()| diff |
+----------+--------------+--------+
|2019-01-23| 2019-07-23 | 181 |
|2019-06-24| 2019-07-23 | 29 |
|2019-09-20| 2019-07-23 | -59 |
+----------+--------------+--------+
Copy
months_between()
Below example returns the months between two dates
using months_between() with Scala language.
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("date")
.select( col("date"), current_date(),
datediff (current_date(),col("date")).as("datediff "),
months_between(current_date(),col("date")).as("months_between")
).show()
Copy
+----------+--------------+--------+--------------+
| date |current_date()|datediff |months_between|
+----------+--------------+--------+--------------+
|2019-01-23| 2019-07-23 | 181| 6.0|
|2019-06-24| 2019-07-23 | 29| 0.96774194|
|2019-09-20| 2019-07-23 | -59| -1.90322581|
+----------+--------------+--------+--------------+
Copy
trunc()
Below example truncates date at a specifi ed unit using trunc() with Scala language.
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"),
trunc(col("input"),"Month").as("Month_Trunc"),
trunc(col("input"),"Year").as("Month_Year"),
trunc(col("input"),"Month").as("Month_Trunc")
).show()
Copy
+----------+-----------+----------+-----------+
| input |Month_Trunc|Month_Year|Month_Trunc|
+----------+-----------+----------+-----------+
|2019-01-23| 2019-01-01|2019-01-01| 2019-01-01|
|2019-06-24| 2019-06-01|2019-01-01| 2019-06-01|
|2019-09-20| 2019-09-01|2019-01-01| 2019-09-01|
+----------+-----------+----------+-----------+
Copy
add_months() , date_add(), date_sub()
Here we are adding and subtracting date and month from a given input.
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20")). toDF("input")
.select( col("input"),
add_months(col("input"),3).as("add_months"),
add_months(col("input"),-3).as("sub_months"),
date_add(col("input"),4).as("date_add"),
date_sub(col("input"),4).as("date_sub")
).show()
Copy
+----------+----------+----------+----------+----------+
| input |add_months|sub_months| date_add | date_sub |
+----------+----------+----------+----------+----------+
|2019-01-23|2019-04-23|2018-10-23|2019-01-27|2019-01-19|
|2019-06-24|2019-09-24|2019-03-24|2019-06-28|2019-06-20|
|2019-09-20|2019-12-20|2019-06-20|2019-09-24|2019-09-16|
+----------+----------+----------+----------+----------+
Copy
year(), month(), month()
dayofweek(), dayofmonth(), dayofyear()
next_day(), weekofyear()
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), year(col("input")).as("year"),
month(col("input")).as("month"),
dayofweek(col("input")).as("dayofweek"),
dayofmonth(col("input")).as("dayofmonth"),
dayofyear(col("input")).as("dayofyear"),
next_day(col("input"),"Sunday").as("next_day"),
weekofyear(col("input")).as("weekofyear")
).show()
Copy
+----------+----+-----+---------+----------+---------+----------+----------+
| input|year|month|dayofweek|dayofmonth|dayofyear| next_day|weekofyear|
+----------+----+-----+---------+----------+---------+----------+----------+
|2019-01-23|2019| 1| 4| 23| 23|2019-01-27| 4|
|2019-06-24|2019| 6| 2| 24| 175|2019-06-30| 26|
|2019-09-20|2019| 9| 6| 20| 263|2019-09-22| 38|
+----------+----+-----+---------+----------+---------+----------+----------+
Copy
Spark Timestamp Functions Examples
Below are most used examples of Timestamp Functions.
current_timestamp()
Returns the current timestamp in spark default format yyyy-MM-dd HH:mm:ss
import org.apache.spark.sql.functions. _
val df = Seq((1)).toDF("seq")
val curDate = df.withColumn("current_date",current_date().as("current_date"))
.withColumn("current_timestamp",current_timestamp().as("current_timestamp"))
curDate.show(false)
Copy
Yields below output.
+---+------------+-----------------------+
|seq|current_date|current_timestamp |
+---+------------+-----------------------+
|1 |2019-11-16 |2019-11-16 21:00:55.349|
+---+------------+-----------------------+
Copy
to_timestamp()
Converts string timestamp to Timestamp type format.
import org.apache.spark.sql.functions. _
val dfDate = Seq(("07-01-2019 12 01 19 406"),
("06-24-2019 12 01 19 406"),
("11-16-2019 16 44 55 406"),
("11-16-2019 16 50 59 406")). toDF("input_timestamp")
dfDate.withColumn("datetype_timestamp",
to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH mm ss SSS"))
.show(false)
Copy
Yields below output
+-----------------------+-------------------+
|input_timestamp |datetype_timestamp |
+-----------------------+-------------------+
|07-01-2019 12 01 19 406|2019-07-01 12:01:19|
|06-24-2019 12 01 19 406|2019-06-24 12:01:19|
|11-16-2019 16 44 55 406|2019-11-16 16:44:55|
|11-16-2019 16 50 59 406|2019-11-16 16:50:59|
+-----------------------+-------------------+
Copy
hour(), Minute() and second()
import org.apache.spark.sql.functions. _
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")). toDF("input_timestamp")
df.withColumn("hour", hour(col("input_timestamp")))
.withColumn("minute", minute(col("input_timestamp")))
.withColumn("second", second(col("input_timestamp")))
.show(false)
Copy
Yields below output
+-----------------------+----+------+------+
|input_timestamp |hour|minute|second|
+-----------------------+----+------+------+
|2019-07-01 12:01:19.000|12 |1 |19 |
|2019-06-24 12:01:19.000|12 |1 |19 |
|2019-11-16 16:44:55.406|16 |44 |55 |
|2019-11-16 16:50:59.406|16 |50 |59 |
+-----------------------+----+------+------+
Copy
Conclusion:
In this post, I’ve consolidated the complete list of Spark Date and Timestamp
Functions with a description and example of some commonly used. You can fi nd
more information about these at the following blog
Happy Learning !!
Spark SQL Array Functions Complete List
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard array functions defi nes in DataFrame API,
these come in handy when we need to make operations on array ( ArrayType )
column. All these accept input as, array column and several other arguments based
on the function.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
Spark SQL array functions are grouped as collection functions “collection_funcs” in
spark SQL along with several map functions. These array functions come handy
when we want to perform some operations and transformations on array columns.
Though I’ve explained here with Scala, a similar methods could be used to work
Spark SQL array function with PySpark and if time permits I will cover it in the
future. If you are looking for PySpark, I would still recommend reading through this
article as it would give you an Idea on Spark array functions and usage.
Spark SQL Array Functions:
Show entries
Search:
ARRAY FUNCTION SYNTAX ARRAY FUNCTION DESCRIPTION
array_contains(column: Check if a value presents in an array column. Return
Column, value: Any) below values.
true - Returns if value presents in an array.
false - When valu eno presents.
null - when array is null.
array_distinct(e: Column) Return distinct values from the array after removing
duplicates.
array_except(col1: Column, Returns all elements from col1 array but not in col2
col2: Column) array.
array_intersect(col1: Column, Returns all elements that are present in col1 and col2
col2: Column) arrays.
array_join(column: Column, Concatenates all elments of array column with using
delimiter: String, provided delimeter. When Null valeus are present,
nullReplacement: String) they replaced with 'nullReplacement' string
array_join(column: Column,
delimiter: String)
array_max(e: Column) Return maximum values in an array
array_min(e: Column) Return minimum values in an array
array_position(column: Returns a position/index of fi rst occurrence of the
Column, value: Any) 'value' in the given array. Returns position as long
type and the position is not zero based instead starts
with 1.
Returns zero when value is not found.
Returns null when any of the arguments are null.
array_remove(column: Returns an array after removing all provided 'value'
Column, element: Any) from the given array.
array_repeat(e: Column, Creates an array containing the fi rst argument
count: Int) repeated the number of times given by the second
argument.
array_repeat(left: Column, Creates an array containing the fi rst argument
right: Column) repeated the number of times given by the second
argument.
array_sort(e: Column) Returns the sorted array of the given input array. All
null values are placed at the end of the array.
array_union(col1: Column, Returns an array of elements that are present in both
col2: Column) arrays (all elements from both arrays) with out
duplicates.
arrays_overlap(a1: Column, a2: true - if `a1` and `a2` have at least one non-null
ARRAY FUNCTION SYNTAX ARRAY FUNCTION DESCRIPTION
Column) element in common
false - if `a1` and `a2` have completely diff erent
elements.
null - if both the arrays are non-empty and any of them
contains a `null`
arrays_zip(e: Column*) Returns a merged array of structs in which the N-th
struct contains all N-th values of input
concat(exprs: Column*) Concatenates all elements from a given columns
element_at(column: Column, Returns an element of an array located at the 'value'
value: Any) input position.
exists(column: Column, f: Checks if the column presents in an array column.
Column => Column)
explode(e: Column) Create a row for each element in the array column
explode_outer ( e : Column ) Create a row for each element in the array column.
Unlike explode, if the array is null or empty, it returns
null.
fi lter(column: Column, f: Returns an array of elements for which a predicate
Column => Column) holds in a given array
fi lter(column: Column, f:
(Column, Column) => Column)
fl atten(e: Column) Creates a single array from an array of arrays column.
forall(column: Column, f: Returns whether a predicate holds for every element
Column => Column) in the array.
posexplode(e: Column) Creates a row for each element in the array and creaes
a two columns "pos' to hold the position of the array
element and the 'col' to hold the actual array value.
posexplode_outer(e: Column) Creates a row for each element in the array and creaes
a two columns "pos' to hold the position of the array
element and the 'col' to hold the actual array value.
Unlike posexplode, if the array is null or empty, it
returns null,null for pos and col columns.
reverse(e: Column) Returns the array of elements in a reverse order.
sequence(start: Column, stop: Generate the sequence of numbers from start to stop
Column) number.
sequence ( start : Column , Generate the sequence of numbers from start to stop
stop : Column , step : Column ) number by incrementing with given step value.
shuffl e(e: Column) Shuffl e the given array
size(e: Column) Return the length of an array.
slice(x: Column, start: Int, Returns an array of elements from position 'start' and
length: Int) the given length.
sort_array(e: Column) Sorts the array in an ascending order. Null values are
placed at the beginning.
sort_array(e: Column, asc: Sorts the array in an ascending or descending order
Boolean) based of the boolean parameter. For assending, Null
values are placed at the beginning. And for desending
ARRAY FUNCTION SYNTAX ARRAY FUNCTION DESCRIPTION
they are places at the end.
transform(column: Column, f: Returns an array of elments after applying
Column => Column) transformation.
transform(column: Column, f:
(Column, Column) => Column)
zip_with(left: Column, right: Merges two input arrays.
Column, f: (Column, Column)
=> Column)
aggregate( Aggregates
expr: Column,
zero: Column,
merge: (Column, Column) =>
Column,
fi nish: Column => Column)
Showing 1 to 36 of 36 entries
PreviousNext
Array function Examples
Before we start, let’s create a DataFrame with some sample data to work with.
root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- salary: integer (nullable = true)
| |-- location: string (nullable = true)
+-----+---------+-----------+
|id |dept |properties |
+-----+---------+-----------+
|36636|Finance |[3000, USA]|
|40288|Finance |[5000, IND]|
|42114|Sales |[3900, USA]|
|39192|Marketing|[2500, CAN]|
|34534|Sales |[6500, USA]|
+-----+---------+-----------+
map() – Spark SQL map functions
root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- propertiesMap: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+-----+---------+---------------------------------+
|id |dept |propertiesMap |
+-----+---------+---------------------------------+
|36636|Finance |[salary -> 3000, location -> USA]|
|40288|Finance |[salary -> 5000, location -> IND]|
|42114|Sales |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+
map_keys() – Returns map keys from a Spark SQL DataFrame
df.select(col("id"),map_keys(col("propertiesMap"))).show(false)
Outputs all map keys from a Spark DataFrame
+-----+-----------------------+
|id |map_keys(propertiesMap)|
+-----+-----------------------+
|36636|[salary, location] |
|40288|[salary, location] |
|42114|[salary, location] |
|39192|[salary, location] |
|34534|[salary, location] |
+-----+-----------------------+
map_values() – Returns map values from a Spark DataFrame
df.select(col("id"),map_values(col("propertiesMap")))
.show(false)
Outputs following.
+-----+-------------------------+
|id |map_values(propertiesMap)|
+-----+-------------------------+
|36636|[3000, USA] |
|40288|[5000, IND] |
|42114|[3900, USA] |
|39192|[2500, CAN] |
|34534|[6500, USA] |
+-----+-------------------------+
map_concat() – Concatenating two or more maps on DataFrame
+-------+---------------------------------------------+
|name |mapConcat |
+-------+---------------------------------------------+
|James |[hair -> black, eye -> brown, height -> 5.9] |
|Michael|[hair -> brown, eye -> black, height -> 6] |
|Robert |[hair -> red, eye -> gray, height -> 6.3] |
|Maria |[hair -> blond, eye -> red, height -> 5.6] |
|Jen |[white -> black, eye -> black, height -> 5.2]|
+-------+---------------------------------------------+
map_from_entries() – convert array of StructType entries to map
Use map_from_entries() SQL functions to convert array of StructType entries to
map (MapType ) on Spark DataFrame. This function take DataFrame column
ArrayType[StructType] as an argument, passing any other type results an error.
concatDF.withColumn("mapFromEntries",map_from_entries(col("addresses")))
.select("name","mapFromEntries")
.show(false)
Output:
+-------+-------------------------------+
|name |mapFromEntries |
+-------+-------------------------------+
|James |[Newark -> NY, Brooklyn -> NY] |
|Michael|[SanJose -> CA, Sandiago -> CA]|
|Robert |[LasVegas -> NV] |
|Maria |null |
|Jen |[LAX -> CA, Orange -> CA] |
+-------+-------------------------------+
map_entries() – convert map of StructType to array of StructType
package com.sparkbyexamples.spark.dataframe.functions.collection
import org.apache.spark.sql.functions.{col, explode, lit, map, map_concat,
map_from_entries, map_keys, map_values}
import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StringType,
StructType}
import org.apache.spark.sql.{Column, Row, SparkSession}
import scala.collection.mutable
object MapFunctions extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import spark.implicits._
val structureData = Seq(
Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA"))
)
val structureSchema = new StructType()
.add("id",StringType)
.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)
.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)
// Convert to Map
val index = df.schema.fi eldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fi elds.foreach(fi eld =>{
columns.add(lit(fi eld.name))
columns.add(col("properties." + fi eld.name))
})
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
//Retrieve all keys from a Map
val keys =
df.select(explode(map_keys(<pre></pre>quot;propertiesMap"))).as[String].distinct.
collect
print(keys.mkString(","))
// map_keys
df.select(col("id"),map_keys(col("propertiesMap")))
.show(false)
//map_values
df.select(col("id"),map_values(col("propertiesMap")))
.show(false)
//Creating DF with MapType
val arrayStructureData = Seq(
Row("James",List(Row("Newark","NY"),Row("Brooklyn","NY")),Map("hair"-
>"black","eye"->"brown"), Map("height"->"5.9")),
Row("Michael",List(Row("SanJose","CA"),Row("Sandiago","CA")),Map("hair"-
>"brown","eye"->"black"),Map("height"->"6")),
Row("Robert",List(Row("LasVegas","NV")),Map("hair"->"red","eye"-
>"gray"),Map("height"->"6.3")),
Row("Maria",null,Map("hair"->"blond","eye"->"red"),Map("height"->"5.6")),
Row("Jen",List(Row("LAX","CA"),Row("Orange","CA")),Map("white"-
>"black","eye"->"black"),Map("height"->"5.2"))
)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("addresses", ArrayType(new StructType()
.add("city",StringType)
.add("state",StringType)))
.add("properties", MapType(StringType,StringType))
.add("secondProp", MapType(StringType,StringType))
val concatDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
concatDF.printSchema()
concatDF.show()
concatDF.withColumn("mapConcat",map_concat(col("properties"),col("secondProp
")))
.select("name","mapConcat")
.show(false)
concatDF.withColumn("mapFromEntries",map_from_entries(col("addresses")))
.select("name","mapFromEntries")
.show(false)
}
Conclusion
In this article, you have learned how to convert an array of StructType to map and
Map of StructType to array and concatenating several maps using SQL map
functions on the Spark DataFrame column.
Spark SQL Sort functions – complete list
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard sort functions defi ne in DataFrame API, these
come in handy when we need to make sorting on the DataFrame column. All these
accept input as, column name in String and returns a Column type.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
UDF does not guarantee performance.
Spark SQL sort functions are grouped as “sort_funcs” in spark SQL, these sort
functions come handy when we want to perform any ascending and descending
operations on columns.
These are primarily used on the Sort function of the Dataframe or Dataset.
Show entries
Search:
SPARK SQL SORT FUNCTION
SPARK FUNCTION DESCRIPTION
SYNTAX
asc(columnName: String): Column asc function is used to specify the ascending
order of the sorting column on DataFrame or
DataSet
asc_nulls_fi rst(columnName: Similar to asc function but null values return fi rst
String): Column and then non-null values
asc_nulls_last(columnName: String): Similar to asc function but non-null values return
Column fi rst and then null values
desc(columnName: String): Column desc function is used to specify the descending
SPARK SQL SORT FUNCTION
SPARK FUNCTION DESCRIPTION
SYNTAX
order of the DataFrame or DataSet sorting
column.
desc_nulls_fi rst(columnName: Similar to desc function but null values return
String): Column fi rst and then non-null values.
desc_nulls_last(columnName: Similar to desc function but non-null values
String): Column return fi rst and then null values.
Showing 1 to 6 of 6 entries
PreviousNext
asc() – ascending function
asc function is used to specify the ascending order of the sorting column on
DataFrame or DataSet.
Syntax: asc(columnName: String): Column
Copy
asc_nulls_fi rst() – ascending with nulls fi rst
Similar to asc function but null values return fi rst and then non-null values.
asc_nulls_fi rst(columnName: String): Column
Copy
asc_nulls_last() – ascending with nulls last
Similar to asc function but non-null values return fi rst and then null values.
asc_nulls_last(columnName: String): Column
Copy
desc() – descending function
desc function is used to specify the descending order of the DataFrame or
DataSet sorting column.
desc(columnName: String): Column
Copy
desc_nulls_fi rst() – descending with nulls fi rst
Similar to desc function but null values return fi rst and then non-null values.
desc_nulls_fi rst(columnName: String): Column
Copy
desc_nulls_last() – descending with nulls last
Similar to desc function but non-null values return fi rst and then null values.
desc_nulls_last(columnName: String): Column
Copy
Reference : Spark Functions scala code
Related Articles
Spark SQL Aggregate Functions
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard Aggregate functions defi nes in DataFrame API,
these come in handy when we need to make aggregate operations on DataFrame
columns. Aggregate functions operate on a group of rows and calculate a single
return value for every group.
All these aggregate functions accept input as, Column type or column name in a
string and several other arguments based on the function and return Column type.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
Spark Aggregate Functions
Spark SQL Aggregate functions are grouped as “agg_funcs” in spark SQL. Below is a
list of functions defi ned under this group. Click on each link to learn with a Scala
example.
Note that each and every below function has another signature which takes String
as a column name instead of Column.
Show entries
Search:
AGGREGATE FUNCTION
AGGREGATE FUNCTION DESCRIPTION
SYNTAX
approx_count_distinct(e: Returns the count of distinct items in a group.
Column)
approx_count_distinct(e: Returns the count of distinct items in a group.
Column, rsd: Double)
avg(e: Column) Returns the average of values in the input column.
collect_list(e: Column) Returns all values from an input column with
duplicates.
collect_set(e: Column) Returns all values from an input column with
duplicate values .eliminated.
corr(column1: Column, column2: Returns the Pearson Correlation Coeffi cient for two
Column) columns.
count(e: Column) Returns number of elements in a column.
countDistinct(expr: Column, Returns number of distinct elements in the columns.
exprs: Column*)
covar_pop(column1: Column, Returns the population covariance for two columns.
column2: Column)
covar_samp(column1: Column, Returns the sample covariance for two columns.
column2: Column)
fi rst(e: Column, ignoreNulls: Returns the fi rst element in a column when
Boolean) ignoreNulls is set to true, it returns fi rst non null
element.
fi rst(e: Column): Column Returns the fi rst element in a column.
grouping(e: Column) Indicates whether a specifi ed column in a GROUP BY
list is aggregated or not, returns 1 for aggregated or
0 for not aggregated in the result set.
kurtosis(e: Column) Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Returns the last element in a column. when
AGGREGATE FUNCTION
AGGREGATE FUNCTION DESCRIPTION
SYNTAX
Boolean) ignoreNulls is set to true, it returns last non null
element.
last(e: Column) Returns the last element in a column.
max(e: Column) Returns the maximum value in a column.
mean(e: Column) Alias for Avg. Returns the average of the values in a
column.
min(e: Column) Returns the minimum value in a column.
skewness(e: Column) Returns the skewness of the values in a group.
stddev(e: Column) alias for `stddev_samp`.
stddev_samp(e: Column) Returns the sample standard deviation of values in a
column.
stddev_pop(e: Column) Returns the population standard deviation of the
values in a column.
sum(e: Column) Returns the sum of all values in a column.
sumDistinct(e: Column) Returns the sum of all distinct values in a column.
variance(e: Column) alias for `var_samp`.
var_samp(e: Column) Returns the unbiased variance of the values in a
column.
var_pop(e: Column) returns the population variance of the values in a
column.
Showing 1 to 28 of 28 entries
PreviousNext
Aggregate Functions Examples
First, let’s create a DataFrame to work with aggregate functions. All example
provided here is also available at GitHub project.
import spark.implicits._
//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))
//Prints approx_count_distinct: 6
Copy
avg (average) Aggregate Function
avg() function returns the average of values in the input column.
//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))
//collect_list
df.select(collect_list("salary")).show(false)
+------------------------------------------------------------+
|collect_list(salary) |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+
Copy
collect_set Aggregate Function
collect_set() function returns all values from an input column with duplicate values
eliminated.
//collect_set
df.select(collect_set("salary")).show(false)
+------------------------------------+
|collect_set(salary) |
+------------------------------------+
|[4600, 3000, 3900, 4100, 3300, 2000]|
+------------------------------------+
Copy
countDistinct Aggregate Function
countDistinct() function returns the number of distinct elements in a columns
//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+ df2.collect()(0)(0))
Copy
count function()
count() function returns number of elements in a column.
println("count: "+
df.select(count("salary")).collect()(0))
Prints county: 10
Copy
grouping function()
grouping() Indicates whether a given input column is aggregated or not. returns 1
for aggregated or 0 for not aggregated in the result. If you try grouping directly on
the salary column you will get below error.
//fi rst
df.select(fi rst("salary")).show(false)
+--------------------+
|fi rst(salary, false)|
+--------------------+
|3000 |
+--------------------+
Copy
last()
last() function returns the last element in a column. when ignoreNulls is set to true,
it returns the last non-null element.
//last
df.select(last("salary")).show(false)
+-------------------+
|last(salary, false)|
+-------------------+
|4100 |
+-------------------+
Copy
kurtosis()
kurtosis() function returns the kurtosis of the values in a group.
df.select(kurtosis("salary")).show(false)
+-------------------+
|kurtosis(salary) |
+-------------------+
|-0.6467803030303032|
+-------------------+
Copy
max()
max() function returns the maximum value in a column.
df.select(max("salary")).show(false)
+-----------+
|max(salary)|
+-----------+
|4600 |
+-----------+
Copy
min()
min() function
df.select(min("salary")).show(false)
+-----------+
|min(salary)|
+-----------+
|2000 |
+-----------+
Copy
mean()
mean() function returns the average of the values in a column. Alias for Avg
df.select(mean("salary")).show(false)
+-----------+
|avg(salary)|
+-----------+
|3400.0 |
+-----------+
Copy
skewness()
skewness() function returns the skewness of the values in a group.
df.select(skewness("salary")).show(false)
+--------------------+
|skewness(salary) |
+--------------------+
|-0.12041791181069571|
+--------------------+
Copy
stddev(), stddev_samp() and stddev_pop()
stddev() alias for stddev_samp.
stddev_samp() function returns the sample standard deviation of values in a
column.
stddev_pop() function returns the population standard deviation of the values in a
column.
df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)
+-------------------+-------------------+------------------+
|stddev_samp(salary)|stddev_samp(salary)|stddev_pop(salary)|
+-------------------+-------------------+------------------+
|765.9416862050705 |765.9416862050705 |726.636084983398 |
+-------------------+-------------------+------------------+
Copy
sum()
sum() function Returns the sum of all values in a column.
df.select(sum("salary")).show(false)
+-----------+
|sum(salary)|
+-----------+
|34000 |
+-----------+
Copy
sumDistinct()
sumDistinct() function returns the sum of all distinct values in a column.
df.select(sumDistinct("salary")).show(false)
+--------------------+
|sum(DISTINCT salary)|
+--------------------+
|20900 |
+--------------------+
Copy
variance(), var_samp(), var_pop()
variance() alias for var_samp
var_samp() function returns the unbiased variance of the values in a column.
var_pop() function returns the population variance of the values in a column.
df.select(variance("salary"),var_samp("salary"),var_pop("salary"))
.show(false)
+-----------------+-----------------+---------------+
|var_samp(salary) |var_samp(salary) |var_pop(salary)|
+-----------------+-----------------+---------------+
|586666.6666666666|586666.6666666666|528000.0 |
+-----------------+-----------------+---------------+
Copy
Source code of Spark SQL Aggregate Functions examples
package com.sparkbyexamples.spark.dataframe.functions.aggregate
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions. _
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))
//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))
//collect_list
df.select(collect_list("salary")).show(false)
//collect_set
df.select(collect_set("salary")).show(false)
//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+ df2.collect()(0)(0))
println("count: "+
df.select(count("salary")).collect()(0))
//fi rst
df.select(fi rst("salary")).show(false)
//last
df.select(last("salary")).show(false)
df.select(kurtosis("salary")).show(false)
df.select(max("salary")).show(false)
df.select(min("salary")).show(false)
df.select(mean("salary")).show(false)
df.select(skewness("salary")).show(false)
df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)
df.select(sum("salary")).show(false)
df.select(sumDistinct("salary")).show(false)
df.select(variance("salary"),var_samp("salary"),
var_pop("salary")).show(false)
}
Copy
Conclusion
In this article, I’ve consolidated and listed all Spark SQL Aggregate functions with
scala examples and also learned the benefi ts of using Spark SQL functions.
Happy Learning !!
Spark Window Functions with Examples
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:January 17, 2023
Spread the love
Spark Window functions are used to calculate results such as the rank, row number
e.t.c over a range of input rows and these are available to you by
importing org.apache.spark.sql.functions._ , this article explains the concept of
window functions, it’s usage, syntax and fi nally how to use them with Spark SQL
and Spark’s DataFrame API. These come in handy when we need to make aggregate
operations in a specifi c window frame on DataFrame columns.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
1. Spark Window Functions
Spark Window functions operate on a group of rows (like frame, partition) and
return a single value for every input row. Spark SQL supports three kinds of window
functions:
ranking functions
analytic functions
aggregate functions
import spark.implicits._
import org.apache.spark.sql.functions. _
import org.apache.spark.sql.expressions. Window
//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()
Copy
Yields below output.
+-------------+----------+------+----------+
|employee_name|department|salary|row_number|
+-------------+----------+------+----------+
| James| Sales| 3000| 1|
| James| Sales| 3000| 2|
| Robert| Sales| 4100| 3|
| Saif| Sales| 4100| 4|
| Michael| Sales| 4600| 5|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 2|
| Jen| Finance| 3900| 3|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+----------+
Copy
2.2 rank Window Function
rank() window function is used to provide a rank to the result within a window
partition. This function leaves gaps in rank when there are ties.
import org.apache.spark.sql.functions. _
//rank
df.withColumn("rank",rank().over(windowSpec))
.show()
Copy
Yields below output.
+-------------+----------+------+----+
|employee_name|department|salary|rank|
+-------------+----------+------+----+
| James| Sales| 3000| 1|
| James| Sales| 3000| 1|
| Robert| Sales| 4100| 3|
| Saif| Sales| 4100| 3|
| Michael| Sales| 4600| 5|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 2|
| Jen| Finance| 3900| 3|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+----+
Copy
This is the same as the RANK function in SQL.
2.3 dense_rank Window Function
dense_rank() window function is used to get the result with rank of rows within a
window partition without any gaps. This is similar to rank() function diff erence
being rank function leaves gaps in rank when there are ties.
import org.apache.spark.sql.functions. _
//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()
Copy
Yields below output.
+-------------+----------+------+----------+
|employee_name|department|salary|dense_rank|
+-------------+----------+------+----------+
| James| Sales| 3000| 1|
| James| Sales| 3000| 1|
| Robert| Sales| 4100| 2|
| Saif| Sales| 4100| 2|
| Michael| Sales| 4600| 3|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 2|
| Jen| Finance| 3900| 3|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+----------+
Copy
This is the same as the DENSE_RANK function in SQL.
2.4 percent_rank Window Function
import org.apache.spark.sql.functions. _
//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()
Copy
Yields below output.
+-------------+----------+------+------------+
|employee_name|department|salary|percent_rank|
+-------------+----------+------+------------+
| James| Sales| 3000| 0.0|
| James| Sales| 3000| 0.0|
| Robert| Sales| 4100| 0.5|
| Saif| Sales| 4100| 0.5|
| Michael| Sales| 4600| 1.0|
| Maria| Finance| 3000| 0.0|
| Scott| Finance| 3300| 0.5|
| Jen| Finance| 3900| 1.0|
| Kumar| Marketing| 2000| 0.0|
| Jeff | Marketing| 3000| 1.0|
+-------------+----------+------+------------+
Copy
This is the same as the PERCENT_RANK function in SQL.
2.5 ntile Window Function
ntile() window function returns the relative rank of result rows within a window
partition. In below example we have used 2 as an argument to ntile hence it returns
ranking between 2 values (1 and 2)
//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()
Copy
Yields below output.
+-------------+----------+------+-----+
|employee_name|department|salary|ntile|
+-------------+----------+------+-----+
| James| Sales| 3000| 1|
| James| Sales| 3000| 1|
| Robert| Sales| 4100| 1|
| Saif| Sales| 4100| 2|
| Michael| Sales| 4600| 2|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 1|
| Jen| Finance| 3900| 2|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+-----+
Copy
This is the same as the NTILE function in SQL.
3. Spark Window Analytic functions
3.1 cume_dist Window Function
cume_dist() window function is used to get the cumulative distribution of values
within a window partition.
This is the same as the DENSE_RANK function in SQL.
//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()
Copy
+-------------+----------+------+------------------+
|employee_name|department|salary| cume_dist|
+-------------+----------+------+------------------+
| James| Sales| 3000| 0.4|
| James| Sales| 3000| 0.4|
| Robert| Sales| 4100| 0.8|
| Saif| Sales| 4100| 0.8|
| Michael| Sales| 4600| 1.0|
| Maria| Finance| 3000|0.3333333333333333|
| Scott| Finance| 3300|0.6666666666666666|
| Jen| Finance| 3900| 1.0|
| Kumar| Marketing| 2000| 0.5|
| Jeff | Marketing| 3000| 1.0|
+-------------+----------+------+------------------+
Copy
3.2 lag Window Function
This is the same as the LAG function in SQL.
//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()
Copy
+-------------+----------+------+----+
|employee_name|department|salary| lag|
+-------------+----------+------+----+
| James| Sales| 3000|null|
| James| Sales| 3000|null|
| Robert| Sales| 4100|3000|
| Saif| Sales| 4100|3000|
| Michael| Sales| 4600|4100|
| Maria| Finance| 3000|null|
| Scott| Finance| 3300|null|
| Jen| Finance| 3900|3000|
| Kumar| Marketing| 2000|null|
| Jeff | Marketing| 3000|null|
+-------------+----------+------+----+
Copy
3.3 lead Window Function
This is the same as the LEAD function in SQL.
//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()
Copy
+-------------+----------+------+----+
|employee_name|department|salary|lead|
+-------------+----------+------+----+
| James| Sales| 3000|4100|
| James| Sales| 3000|4100|
| Robert| Sales| 4100|4600|
| Saif| Sales| 4100|null|
| Michael| Sales| 4600|null|
| Maria| Finance| 3000|3900|
| Scott| Finance| 3300|null|
| Jen| Finance| 3900|null|
| Kumar| Marketing| 2000|null|
| Jeff | Marketing| 3000|null|
+-------------+----------+------+----+
Copy
4. Spark Window Aggregate Functions
In this section, I will explain how to calculate sum, min, max for each department
using Spark SQL Aggregate window functions and WindowSpec. When working with
Aggregate functions, we don’t need to use order by clause.
+----------+------+-----+----+----+
|department| avg| sum| min| max|
+----------+------+-----+----+----+
| Sales|3760.0|18800|3000|4600|
| Finance|3400.0|10200|3000|3900|
| Marketing|2500.0| 5000|2000|3000|
+----------+------+-----+----+----+
Copy
Please refer for more Aggregate Spark Functions
5. Source Code of Window Functions Example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions. _
import org.apache.spark.sql.expressions. Window
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()
//rank
df.withColumn("rank",rank().over(windowSpec))
.show()
//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()
//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()
//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()
//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()
//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()
//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()
//Aggregate Functions
val windowSpecAgg = Window.partitionBy("department")
val aggDF = df.withColumn("row",row_number.over(windowSpec))
.withColumn("avg", avg(col("salary")).over(windowSpecAgg))
.withColumn("sum", sum(col("salary")).over(windowSpecAgg))
.withColumn("min", min(col("salary")).over(windowSpecAgg))
.withColumn("max", max(col("salary")).over(windowSpecAgg))
.where(col("row")===1).select("department","avg","sum","min","max")
.show()
}
Copy
The complete source code is available at GitHub for reference.
6. Conclusion
In this tutorial, you have learned what are Spark SQL Window functions their syntax
and how to use them with aggregate function along with several examples in Scala.
Spark Most Used JSON Functions with Examples
Post author:Naveen (NNK)
Post category:Apache Spark
Post last modifi ed:January 31, 2023
Spread the love
Spark SQL provides a set of JSON functions to parse JSON string, query to extract
specifi c values from JSON. In this article, I will explain the most used JSON
functions with Scala examples.
1. Spark JSON Functions
from_json() – Converts JSON string into Struct type or Map type.
to_json() – Converts MapType or Struct type to JSON string.
json_tuple() – Extract the Data from JSON and create them as a new columns.
get_json_object() – Extracts JSON element from a JSON string based on json path
specifi ed.
schema_of_json() – Create schema string from JSON string
2. Create DataFrame with Column contains JSON String
In order to explain these functions fi rst, let’s create DataFrame with a column
contains JSON string.
val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
df.show(false)
//+---+--------------------------------------------------------------------------+
//|id |value |
//+---+--------------------------------------------------------------------------+
//|1 |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}|
//+---+--------------------------------------------------------------------------+
Copy
3. from_json() – Converts JSON string into Struct type or Map type
The below example converts JSON string to Map key-value pair. I will leave it to you
to convert to struct type. Refer, Convert JSON string to Struct type column .
//root
// |-- id: integer (nullable = false)
// |-- value: map (nullable = true)
// | |-- key: string
// | |-- value: string (valueContainsNull = true)
//+---+---------------------------------------------------------------------------+
//|id |value |
//+---+---------------------------------------------------------------------------+
//|1 |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
//+---+---------------------------------------------------------------------------+
Copy
4. to_json() – Converts MapType or Struct type to JSON string
Here, I am using df2 that created from above from_json() example.
//+---+----------------------------------------------------------------------------+
//|id |value |
//+---+----------------------------------------------------------------------------+
//|1 |{"Zipcode":"704","ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}|
//+---+----------------------------------------------------------------------------+
Copy
5. json_tuple() – Extract the Data from JSON and create them as new columns
//+---+-------+-----------+-----------+
//|id |Zipcode|ZipCodeType|City |
//+---+-------+-----------+-----------+
//|1 |704 |STANDARD |PARC PARQUE|
//+---+-------+-----------+-----------+
Copy
6. get_json_object() – Extracts JSON element from a JSON string based on json path
specifi ed
//+---+-----------+
//|id |ZipCodeType|
//+---+-----------+
//|1 |STANDARD |
//+---+-----------+
Copy
7. schema_of_json() – Create schema string from JSON string.
import org.apache.spark.sql.SparkSession
df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").as("ZipCodeType
"))
.show(false)