0% found this document useful (0 votes)

123 views46 pages

Spark SQL String Functions

The document summarizes Spark SQL string functions and date/timestamp functions. It provides tables listing the functions with their signatures and descriptions. Some key string functions listed include concat_ws(), lower(), regexp_replace(), split(), and trim(). Some key date/timestamp functions include current_date(), date_format(), datediff(), and from_unixtime(). The document encourages using built-in SQL functions over UDFs for better performance.

Uploaded by

ravikumar lanka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views46 pages

Spark SQL String Functions

Uploaded by

ravikumar lanka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Spark SQL String Functions:

Click on each link from below table for more explanation and working examples of
String Function with Scala example
Show entries
Search:
STRING FUNCTION
STRING FUNCTION DESCRIPTION
SIGNATURE
ascii(e: Column): Column Computes the numeric value of the fi rst character of
the string column, and returns the result as an int
column.
base64(e: Column): Column Computes the BASE64 encoding of a binary column
and returns it as a string column.This is the reverse
of unbase64.
concat_ws(sep: String, exprs: Concatenates multiple input string columns
Column*): Column together into a single string column, using the given
separator.
decode(value: Column, charset: Computes the fi rst argument into a string from a
String): Column binary using the provided character set (one of 'US-
ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE',
'UTF-16').
encode(value: Column, charset: Computes the fi rst argument into a binary from a
String): Column string using the provided character set (one of 'US-
ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE',
'UTF-16').
format_number(x: Column, d: Formats numeric column x to a format like
Int): Column '#,###,###.##', rounded to d decimal places with
HALF_EVEN round mode, and returns the result as a
string column.
format_string(format: String, Formats the arguments in printf-style and returns
arguments: Column*): Column the result as a string column.
initcap(e: Column): Column Returns a new string column by converting the fi rst
letter of each word to uppercase. Words are
delimited by whitespace. For example, "hello world"
will become "Hello World".
instr(str: Column, substring: Locate the position of the fi rst occurrence of substr
String): Column column in the given string. Returns null if either of
the arguments are null.
length(e: Column): Column Computes the character length of a given string or
number of bytes of a binary string. The length of
character strings include the trailing spaces. The
length of binary strings includes binary zeros.
lower(e: Column): Column Converts a string column to lower case.
levenshtein ( l : Column , r : Computes the Levenshtein distance of the two given
STRING FUNCTION
STRING FUNCTION DESCRIPTION
SIGNATURE
Column ) : Column string columns.
locate(substr: String, str: Locate the position of the fi rst occurrence of substr.
Column): Column
locate(substr: String, str: Locate the position of the fi rst occurrence of substr
Column, pos: Int): Column in a string column, after position pos.
lpad(str: Column, len: Int, pad: Left-pad the string column with pad to a length of
String): Column len. If the string column is longer than len, the
return value is shortened to len characters.
ltrim(e: Column): Column Trim the spaces from left end for the specifi ed
string value.
regexp_extract(e: Column, exp: Extract a specifi c group matched by a Java regex,
String, groupIdx: Int): Column from the specifi ed string column. If the regex did
not match, or the specifi ed group did not match, an
empty string is returned.
regexp_replace(e: Column, Replace all substrings of the specifi ed string value
pattern: String, replacement: that match regexp with rep.
String): Column
regexp_replace(e: Column, Replace all substrings of the specifi ed string value
pattern: Column, replacement: that match regexp with rep.
Column): Column
unbase64(e: Column): Column Decodes a BASE64 encoded string column and
returns it as a binary column. This is the reverse of
base64.
rpad(str: Column, len: Int, pad: Right-pad the string column with pad to a length of
String): Column len. If the string column is longer than len, the
return value is shortened to len characters.
repeat(str: Column, n: Int): Repeats a string column n times, and returns it as a
Column new string column.
rtrim(e: Column): Column Trim the spaces from right end for the specifi ed
string value.
rtrim(e: Column, trimString: Trim the specifi ed character string from right end
String): Column for the specifi ed string column.
soundex(e: Column): Column Returns the soundex code for the specifi ed
expression
split(str: Column, regex: String): Splits str around matches of the given regex.
Column
split(str: Column, regex: String, Splits str around matches of the given regex.
limit: Int): Column
substring(str: Column, pos: Int, Substring starts at `pos` and is of length `len` when
len: Int): Column str is String type or returns the slice of byte array
that starts at `pos` in byte and is of length `len`
when str is Binary type
substring_index(str: Column, Returns the substring from string str before count
delim: String, count: Int): occurrences of the delimiter delim.
STRING FUNCTION
STRING FUNCTION DESCRIPTION
SIGNATURE
Column * If count is positive, everything the left of the fi nal
delimiter (counting from left) is
* returned. If count is negative, every to the right of
the fi nal delimiter (counting from the
* right) is returned. substring_index performs a
case-sensitive match when searching for delim.
overlay(src: Column, Overlay the specifi ed portion of `src` with
replaceString: String, pos: Int, `replaceString`,
len: Int): Column * starting from byte position `pos` of `inputString`
and proceeding for `len` bytes.
overlay(src: Column, Overlay the specifi ed portion of `src` with
replaceString: String, pos: Int): `replaceString`,
Column * starting from byte position `pos` of `inputString`.
translate(src: Column, Translate any character in the src by a character in
matchingString: String, replaceString.
replaceString: String): Column * The characters in replaceString correspond to the
characters in matchingString.
* The translate will happen when any character in
the string matches the character
* in the `matchingString`.
trim(e: Column): Column Trim the spaces from both ends for the specifi ed
string column.
trim(e: Column, trimString: Trim the specifi ed character from both ends for the
String): Column specifi ed string column.
upper(e: Column): Column Converts a string column to upper case.

Spark SQL Date and Timestamp Functions

Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard Date and Timestamp (includes date and time)
Functions defi nes in DataFrame API, these come in handy when we need to make
operations on date and time. All these accept input as, Date type, Timestamp type
or String. If a String, it should be in a format that can be cast to date, such as yyyy-
MM-dd and timestamp in yyyy-MM-dd HH:mm:ss.SSSS and returns date and
timestamp respectively; also returns null if the input data was a string that could
not be cast to date and timestamp.
When possible try to leverage standard library as they are a little bit more compile-
time safe, handles null, and perform better when compared to Spark UDF . If
your application is critical on performance try to avoid using custom UDF at all costs
as these are not guarantee performance.
For the readable purpose, I’ve grouped Date and Timestamp functions into the
following.
Spark SQL Date Functions
Spark SQL Timestamp Functions
Date and Timestamp Window Functions
Before you use any examples below, make sure you create sparksession and import
SQL functions.

import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.sqlContext.implicits. _
import org.apache.spark.sql.functions. _
Copy
Spark SQL Date Functions
Click on each link from below table for more explanation and working examples in
Scala.
Show entries
Search:
DATE FUNCTION
DATE FUNCTION DESCRIPTION
SIGNATURE
current_date () : Column Returns the current date as a date column.
date_format(dateExpr: Converts a date/timestamp/string to a value of string
Column, format: String): in the format specifi ed by the date format given by the
Column second argument.
to_date(e: Column): Column Converts the column into `DateType` by casting rules
to `DateType`.
to_date(e: Column, fmt: Converts the column into a `DateType` with a specifi ed
String): Column format
add_months(startDate: Returns the date that is `numMonths` after `startDate`.
Column, numMonths: Int):
Column
date_add(start: Column, days: Returns the date that is `days` days after `start`
Int): Column
date_sub(start: Column, days:
Int): Column
datediff (end: Column, start: Returns the number of days from `start` to ènd`.
Column): Column
months_between(end: Returns number of months between dates `start` and
Column, start: Column): ènd`. A whole number is returned if both inputs have
Column the same day of month or both are the last day of their
respective months. Otherwise, the diff erence is
DATE FUNCTION
DATE FUNCTION DESCRIPTION
SIGNATURE
calculated assuming 31 days per month.
months_between(end: Returns number of months between dates ènd` and
Column, start: Column, `start`. If `roundOff ` is set to true, the result is
roundOff : Boolean): Column rounded off to 8 digits; it is not rounded otherwise.
next_day(date: Column, Returns the fi rst date which is later than the value of
dayOfWeek: String): Column the `date` column that is on the specifi ed day of the
week.
For example, `next_day('2015-07-27', "Sunday")`
returns 2015-08-02 because that is the fi rst Sunday
after 2015-07-27.
trunc(date: Column, format: Returns date truncated to the unit specifi ed by the
String): Column format.
For example, `trunc("2018-11-19 12:01:19", "year")`
returns 2018-01-01
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month
date_trunc(format: String, Returns timestamp truncated to the unit specifi ed by
timestamp: Column): Column the format.
For example, `date_trunc("year", "2018-11-19 12:01:19")`
returns 2018-01-01 00:00:00
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month,
'day', 'dd' to truncate by day,
Other options are: 'second', 'minute', 'hour', 'week',
'month', 'quarter'
year(e: Column): Column Extracts the year as an integer from a given
date/timestamp/string
quarter(e: Column): Column Extracts the quarter as an integer from a given
date/timestamp/string.
month(e: Column): Column Extracts the month as an integer from a given
date/timestamp/string
dayofweek(e: Column): Extracts the day of the week as an integer from a
Column given date/timestamp/string. Ranges from 1 for a
Sunday through to 7 for a Saturday
dayofmonth(e: Column): Extracts the day of the month as an integer from a
Column given date/timestamp/string.
dayofyear(e: Column): Column Extracts the day of the year as an integer from a given
date/timestamp/string.
weekofyear(e: Column): Extracts the week number as an integer from a given
Column date/timestamp/string. A week is considered to start
on a Monday and week 1 is the fi rst week with more
than 3 days, as defi ned by ISO 8601
last_day(e: Column): Column Returns the last day of the month which the given date
belongs to. For example, input "2015-07-27" returns
DATE FUNCTION
DATE FUNCTION DESCRIPTION
SIGNATURE
"2015-07-31" since July 31 is the last day of the month
in July 2015.
from_unixtime(ut: Column): Converts the number of seconds from unix epoch
Column (1970-01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: Converts the number of seconds from unix epoch
String): Column (1970-01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the given format.
unix_timestamp(): Column Returns the current Unix timestamp (in seconds) as a
long
unix_timestamp(s: Column): Converts time string in format yyyy-MM-dd HH:mm:ss
Column to Unix timestamp (in seconds), using the default
timezone and the default locale.
unix_timestamp(s: Column, p: Converts time string with given pattern to Unix
String): Column timestamp (in seconds).
Showing 1 to 25 of 25 entries
PreviousNext
Spark SQL Timestamp Functions
Below are some of the Spark SQL Timestamp functions, these functions operate on
both date and timestamp values. Select each link for a description and example of
each function.
The default format of the Spark Timestamp is yyyy-MM-dd HH:mm:ss.SSSS
Show entries
Search:
TIMESTAMP FUNCTION
TIMESTAMP FUNCTION DESCRIPTION
SIGNATURE
current_timestamp () : Column Returns the current timestamp as a timestamp
column
hour(e: Column): Column Extracts the hours as an integer from a given
date/timestamp/string.
minute(e: Column): Column Extracts the minutes as an integer from a given
date/timestamp/string.
second(e: Column): Column Extracts the seconds as an integer from a given
date/timestamp/string.
to_timestamp(s: Column): Column Converts to a timestamp by casting rules to
`TimestampType`.
to_timestamp(s: Column, fmt: Converts time string with the given pattern to
String): Column timestamp.
Showing 1 to 6 of 6 entries
PreviousNext
Spark Date and Timestamp Window Functions
Below are Data and Timestamp window functions.
Show entries
Search:
DATE & TIME WINDOW
DATE & TIME WINDOW FUNCTION DESCRIPTION
FUNCTION SYNTAX
window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are
slideDuration: String, inclusive but the window ends are exclusive, e.g. 12:05
startTime: String): Column will be in the window [12:05,12:10) but not in
[12:00,12:05). Windows can support microsecond
precision. Windows in the order of months are not
supported.
window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are
slideDuration: String): Column inclusive but the window ends are exclusive, e.g. 12:05
will be in the window [12:05,12:10) but not in
[12:00,12:05). Windows can support microsecond
precision. Windows in the order of months are not
supported. The windows start beginning at 1970-01-01
00:00:00 UTC
window(timeColumn: Column, Generates tumbling time windows given a timestamp
windowDuration: String): specifying column. Window starts are inclusive but the
Column window ends are exclusive, e.g. 12:05 will be in the
window [12:05,12:10) but not in [12:00,12:05). Windows
can support microsecond precision. Windows in the
order of months are not supported. The windows start
beginning at 1970-01-01 00:00:00 UTC.
Showing 1 to 3 of 3 entries
PreviousNext
Spark Date Functions Examples
Below are most used examples of Date Functions.
current_date() and date_format()
We will see how to get the current date and convert date into a specifi c date format
using date_format() with Scala example. Below example parses the date and
converts from ‘yyyy-dd-mm’ to ‘MM-dd-yyyy’ format.

import org.apache.spark.sql.functions. _
Seq(("2019-01-23"))
.toDF("Input")
.select(
current_date()as("current_date"),
col("Input"),
date_format(col("Input"), "MM-dd-yyyy").as("format")
).show()
Copy
+------------+----------+-----------+
|current_date| Input |format |
+------------+----------+-----------+
| 2019-07-23 |2019-01-23| 01-23-2019 |
+------------+----------+-----------+
Copy
to_date()
Below example converts string in date format ‘MM/dd/yyyy’ to a DateType ‘yyyy-
MM-dd’ using to_date() with Scala example.

import org.apache.spark.sql.functions. _
Seq(("04/13/2019"))
.toDF("Input")
.select( col("Input"),
to_date(col("Input"), "MM/dd/yyyy").as("to_date")
).show()
Copy

+----------+----------+
|Input |to_date |
+----------+----------+
|04/13/2019|2019-04-13|
+----------+----------+
Copy
datediff ()
Below example returns the diff erence between two dates using datediff () with Scala
example.

import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), current_date(),
datediff (current_date(),col("input")).as("diff ")
).show()
Copy

+----------+--------------+--------+
| input |current_date()| diff |
+----------+--------------+--------+
|2019-01-23| 2019-07-23 | 181 |
|2019-06-24| 2019-07-23 | 29 |
|2019-09-20| 2019-07-23 | -59 |
+----------+--------------+--------+
Copy
months_between()
Below example returns the months between two dates
using months_between() with Scala language.

import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("date")
.select( col("date"), current_date(),
datediff (current_date(),col("date")).as("datediff "),
months_between(current_date(),col("date")).as("months_between")
).show()
Copy

+----------+--------------+--------+--------------+
| date |current_date()|datediff |months_between|
+----------+--------------+--------+--------------+
|2019-01-23| 2019-07-23 | 181| 6.0|
|2019-06-24| 2019-07-23 | 29| 0.96774194|
|2019-09-20| 2019-07-23 | -59| -1.90322581|
+----------+--------------+--------+--------------+
Copy
trunc()
Below example truncates date at a specifi ed unit using trunc() with Scala language.

import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"),
trunc(col("input"),"Month").as("Month_Trunc"),
trunc(col("input"),"Year").as("Month_Year"),
trunc(col("input"),"Month").as("Month_Trunc")
).show()
Copy

+----------+-----------+----------+-----------+
| input |Month_Trunc|Month_Year|Month_Trunc|
+----------+-----------+----------+-----------+
|2019-01-23| 2019-01-01|2019-01-01| 2019-01-01|
|2019-06-24| 2019-06-01|2019-01-01| 2019-06-01|
|2019-09-20| 2019-09-01|2019-01-01| 2019-09-01|
+----------+-----------+----------+-----------+
Copy
add_months() , date_add(), date_sub()
Here we are adding and subtracting date and month from a given input.
import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20")). toDF("input")
.select( col("input"),
add_months(col("input"),3).as("add_months"),
add_months(col("input"),-3).as("sub_months"),
date_add(col("input"),4).as("date_add"),
date_sub(col("input"),4).as("date_sub")
).show()
Copy

+----------+----------+----------+----------+----------+
| input |add_months|sub_months| date_add | date_sub |
+----------+----------+----------+----------+----------+
|2019-01-23|2019-04-23|2018-10-23|2019-01-27|2019-01-19|
|2019-06-24|2019-09-24|2019-03-24|2019-06-28|2019-06-20|
|2019-09-20|2019-12-20|2019-06-20|2019-09-24|2019-09-16|
+----------+----------+----------+----------+----------+
Copy
year(), month(), month()
dayofweek(), dayofmonth(), dayofyear()
next_day(), weekofyear()

import org.apache.spark.sql.functions. _
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), year(col("input")).as("year"),
month(col("input")).as("month"),
dayofweek(col("input")).as("dayofweek"),
dayofmonth(col("input")).as("dayofmonth"),
dayofyear(col("input")).as("dayofyear"),
next_day(col("input"),"Sunday").as("next_day"),
weekofyear(col("input")).as("weekofyear")
).show()
Copy

+----------+----+-----+---------+----------+---------+----------+----------+
| input|year|month|dayofweek|dayofmonth|dayofyear| next_day|weekofyear|
+----------+----+-----+---------+----------+---------+----------+----------+
|2019-01-23|2019| 1| 4| 23| 23|2019-01-27| 4|
|2019-06-24|2019| 6| 2| 24| 175|2019-06-30| 26|
|2019-09-20|2019| 9| 6| 20| 263|2019-09-22| 38|
+----------+----+-----+---------+----------+---------+----------+----------+
Copy
Spark Timestamp Functions Examples
Below are most used examples of Timestamp Functions.
current_timestamp()
Returns the current timestamp in spark default format yyyy-MM-dd HH:mm:ss

import org.apache.spark.sql.functions. _
val df = Seq((1)).toDF("seq")
val curDate = df.withColumn("current_date",current_date().as("current_date"))
.withColumn("current_timestamp",current_timestamp().as("current_timestamp"))
curDate.show(false)
Copy
Yields below output.

+---+------------+-----------------------+
|seq|current_date|current_timestamp |
+---+------------+-----------------------+
|1 |2019-11-16 |2019-11-16 21:00:55.349|
+---+------------+-----------------------+
Copy
to_timestamp()
Converts string timestamp to Timestamp type format.

import org.apache.spark.sql.functions. _
val dfDate = Seq(("07-01-2019 12 01 19 406"),
("06-24-2019 12 01 19 406"),
("11-16-2019 16 44 55 406"),
("11-16-2019 16 50 59 406")). toDF("input_timestamp")

dfDate.withColumn("datetype_timestamp",
to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH mm ss SSS"))
.show(false)
Copy
Yields below output

+-----------------------+-------------------+
|input_timestamp |datetype_timestamp |
+-----------------------+-------------------+
|07-01-2019 12 01 19 406|2019-07-01 12:01:19|
|06-24-2019 12 01 19 406|2019-06-24 12:01:19|
|11-16-2019 16 44 55 406|2019-11-16 16:44:55|
|11-16-2019 16 50 59 406|2019-11-16 16:50:59|
+-----------------------+-------------------+
Copy
hour(), Minute() and second()

import org.apache.spark.sql.functions. _
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")). toDF("input_timestamp")

df.withColumn("hour", hour(col("input_timestamp")))
.withColumn("minute", minute(col("input_timestamp")))
.withColumn("second", second(col("input_timestamp")))
.show(false)
Copy
Yields below output

+-----------------------+----+------+------+
|input_timestamp |hour|minute|second|
+-----------------------+----+------+------+
|2019-07-01 12:01:19.000|12 |1 |19 |
|2019-06-24 12:01:19.000|12 |1 |19 |
|2019-11-16 16:44:55.406|16 |44 |55 |
|2019-11-16 16:50:59.406|16 |50 |59 |
+-----------------------+----+------+------+
Copy
Conclusion:
In this post, I’ve consolidated the complete list of Spark Date and Timestamp
Functions with a description and example of some commonly used. You can fi nd
more information about these at the following blog
Happy Learning !!
Spark SQL Array Functions Complete List
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard array functions defi nes in DataFrame API,
these come in handy when we need to make operations on array ( ArrayType )
column. All these accept input as, array column and several other arguments based
on the function.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
Spark SQL array functions are grouped as collection functions “collection_funcs” in
spark SQL along with several map functions. These array functions come handy
when we want to perform some operations and transformations on array columns.

Though I’ve explained here with Scala, a similar methods could be used to work
Spark SQL array function with PySpark and if time permits I will cover it in the
future. If you are looking for PySpark, I would still recommend reading through this
article as it would give you an Idea on Spark array functions and usage.

Spark SQL Array Functions:
Show entries
Search:
ARRAY FUNCTION SYNTAX ARRAY FUNCTION DESCRIPTION
array_contains(column: Check if a value presents in an array column. Return
Column, value: Any) below values.
true - Returns if value presents in an array.
false - When valu eno presents.
null - when array is null.
array_distinct(e: Column) Return distinct values from the array after removing
duplicates.
array_except(col1: Column, Returns all elements from col1 array but not in col2
col2: Column) array.
array_intersect(col1: Column, Returns all elements that are present in col1 and col2
col2: Column) arrays.
array_join(column: Column, Concatenates all elments of array column with using
delimiter: String, provided delimeter. When Null valeus are present,
nullReplacement: String) they replaced with 'nullReplacement' string
array_join(column: Column,
delimiter: String)
array_max(e: Column) Return maximum values in an array
array_min(e: Column) Return minimum values in an array
array_position(column: Returns a position/index of fi rst occurrence of the
Column, value: Any) 'value' in the given array. Returns position as long
type and the position is not zero based instead starts
with 1.
Returns zero when value is not found.
Returns null when any of the arguments are null.
array_remove(column: Returns an array after removing all provided 'value'
Column, element: Any) from the given array.
array_repeat(e: Column, Creates an array containing the fi rst argument
count: Int) repeated the number of times given by the second
argument.
array_repeat(left: Column, Creates an array containing the fi rst argument
right: Column) repeated the number of times given by the second
argument.
array_sort(e: Column) Returns the sorted array of the given input array. All
null values are placed at the end of the array.
array_union(col1: Column, Returns an array of elements that are present in both
col2: Column) arrays (all elements from both arrays) with out
duplicates.
arrays_overlap(a1: Column, a2: true - if à1` and à2` have at least one non-null
ARRAY FUNCTION SYNTAX ARRAY FUNCTION DESCRIPTION
Column) element in common
false - if à1` and à2` have completely diff erent
elements.
null - if both the arrays are non-empty and any of them
contains a `null`
arrays_zip(e: Column*) Returns a merged array of structs in which the N-th
struct contains all N-th values of input
concat(exprs: Column*) Concatenates all elements from a given columns
element_at(column: Column, Returns an element of an array located at the 'value'
value: Any) input position.
exists(column: Column, f: Checks if the column presents in an array column.
Column => Column)
explode(e: Column) Create a row for each element in the array column
explode_outer ( e : Column ) Create a row for each element in the array column.
Unlike explode, if the array is null or empty, it returns
null.
fi lter(column: Column, f: Returns an array of elements for which a predicate
Column => Column) holds in a given array
fi lter(column: Column, f:
(Column, Column) => Column)
fl atten(e: Column) Creates a single array from an array of arrays column.
forall(column: Column, f: Returns whether a predicate holds for every element
Column => Column) in the array.
posexplode(e: Column) Creates a row for each element in the array and creaes
a two columns "pos' to hold the position of the array
element and the 'col' to hold the actual array value.
posexplode_outer(e: Column) Creates a row for each element in the array and creaes
a two columns "pos' to hold the position of the array
element and the 'col' to hold the actual array value.
Unlike posexplode, if the array is null or empty, it
returns null,null for pos and col columns.
reverse(e: Column) Returns the array of elements in a reverse order.
sequence(start: Column, stop: Generate the sequence of numbers from start to stop
Column) number.
sequence ( start : Column , Generate the sequence of numbers from start to stop
stop : Column , step : Column ) number by incrementing with given step value.
shuffl e(e: Column) Shuffl e the given array
size(e: Column) Return the length of an array.
slice(x: Column, start: Int, Returns an array of elements from position 'start' and
length: Int) the given length.
sort_array(e: Column) Sorts the array in an ascending order. Null values are
placed at the beginning.
sort_array(e: Column, asc: Sorts the array in an ascending or descending order
Boolean) based of the boolean parameter. For assending, Null
values are placed at the beginning. And for desending
ARRAY FUNCTION SYNTAX ARRAY FUNCTION DESCRIPTION
they are places at the end.
transform(column: Column, f: Returns an array of elments after applying
Column => Column) transformation.
transform(column: Column, f:
(Column, Column) => Column)
zip_with(left: Column, right: Merges two input arrays.
Column, f: (Column, Column)
=> Column)
aggregate( Aggregates
expr: Column,
zero: Column,
merge: (Column, Column) =>
Column,
fi nish: Column => Column)
Showing 1 to 36 of 36 entries
PreviousNext
Array function Examples

Spark SQL Map functions – complete list

Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 7, 2023
Spread the love
In this article, I will explain the usage of the Spark SQL map
functions map() , map_keys() , map_values() , map_contact() , map_from_entries() on
DataFrame column using Scala example.
Though I’ve explained here with Scala, a similar method could be used to work
Spark SQL map functions with PySpark and if time permits I will cover it in the
future. If you are looking for PySpark, I would still recommend reading through this
article as it would give you an Idea on Spark map functions and its usage.
Spark SQL provides built-in standard map functions defi nes in DataFrame API, these
come in handy when we need to make operations on map ( MapType ) columns. All
these functions accept input as, map column and several other arguments based on
the functions.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
Spark SQL map Functions
Spark SQL map functions are grouped as “collection_funcs” in spark SQL along with
several array functions. These map functions are useful when we want to
concatenate two or more map columns, convert arrays of StructType entries to map
column e.t.c

map Creates a new map column.

map_keys Returns an array containing the keys of the map.

map_values Returns an array containing the values of the

map.

map_concat Merges maps specifi ed in arguments.

map_from_entries Returns a map from the given array of

StructType entries.

map_entries Returns an array of all StructType in the given

map.

explode(e: Column) Creates a new row for every key-value pair in

the map by ignoring null & empty. It creates two
new columns one for key and one for value.

explode_outer(e: Column) Creates a new row for every key-value pair in

the map including null & empty. It creates two
new columns one for key and one for value.

posexplode(e: Column) Creates a new row for each key-value pair in a

map by ignoring null & empty. It also creates 3
columns “pos” to hold the position of the map
element, “key” and “value” columns for every
row.

posexplode_outer(e: Column) Creates a new row for each key-value pair in a

map including null & empty. It also creates 3
columns “pos” to hold the position of the map
element, “key” and “value” columns for every
row.

transform_keys(expr: Column, Transforms map by applying functions to every

f: (Column, Column) => Column) key-value pair and returns a transformed map.

transform_values(expr: Transforms map by applying functions to every

Column, f: (Column, Column) => key-value pair and returns a transformed map.
Column)

map_zip_with( Merges two maps into a single map.

left: Column,
right: Column,
f: (Column, Column, Column) =>
Column)

element_at(column: Column, Returns a value of a key in a map.

value: Any)

size(e: Column) Returns length of a map column.

Before we start, let’s create a DataFrame with some sample data to work with.

val structureData = Seq(

Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA"))
)
val structureSchema = new StructType()
.add("id",StringType)
.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)
.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)
Outputs below schemas and data.

root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- salary: integer (nullable = true)
| |-- location: string (nullable = true)
+-----+---------+-----------+
|id |dept |properties |
+-----+---------+-----------+
|36636|Finance |[3000, USA]|
|40288|Finance |[5000, IND]|
|42114|Sales |[3900, USA]|
|39192|Marketing|[2500, CAN]|
|34534|Sales |[6500, USA]|
+-----+---------+-----------+
map() – Spark SQL map functions

Syntax - map(cols: Column*): Column

org.apache.spark.sql.functions.map() SQL function is used to create a map column
of MapType on DataFrame. The input columns to the map function must be grouped
as key-value pairs. e.g. (key1, value1, key2, value2, …).
Note: All key columns must have the same data type, and can’t be null and All value
columns must have the same data type. Below snippet converts all columns from
“properties” struct into map key, value pairs “propertiesmap” column.

val index = df.schema.fi eldIndex("properties")

val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fi elds.foreach(fi eld =>{
columns.add(lit(fi eld.name))
columns.add(col("properties." + fi eld.name))
})
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
First, we fi nd “properties” column on Spark DataFrame using
df.schema.fi eldIndex(“properties”) and retrieves all columns and it’s values to a
LinkedHashSet. we need LinkedHashSet in order to maintain the insertion order of
key and value pair. and fi nally use map() function with a key, value set pair.

Syntax - map_keys(e: Column): Column

use map_keys() spark function in order to retrieve all keys from a Spark
DataFrame MapType column. Note that map_keys takes an argument of MapType
while passing any other type returns an error at run time.

df.select(col("id"),map_keys(col("propertiesMap"))).show(false)
Outputs all map keys from a Spark DataFrame

+-----+-----------------------+
|id |map_keys(propertiesMap)|
+-----+-----------------------+
|36636|[salary, location] |
|40288|[salary, location] |
|42114|[salary, location] |
|39192|[salary, location] |
|34534|[salary, location] |
+-----+-----------------------+
map_values() – Returns map values from a Spark DataFrame

Syntax - map_values(e: Column): Column

use map_values() spark function in order to retrieve all values from a Spark
DataFrame MapType column. Note that map_values takes an argument of MapType
while passing any other type returns an error at run time.

df.select(col("id"),map_values(col("propertiesMap")))
.show(false)
Outputs following.

+-----+-------------------------+
|id |map_values(propertiesMap)|
+-----+-------------------------+
|36636|[3000, USA] |
|40288|[5000, IND] |
|42114|[3900, USA] |
|39192|[2500, CAN] |
|34534|[6500, USA] |
+-----+-------------------------+
map_concat() – Concatenating two or more maps on DataFrame

Syntax - map_concat(cols: Column*): Column

Use Spark SQL map_concat() function in order to concatenate keys and values from
more than one map to a single map. All arguments to this function should
be MapType , passing any other type results a run time error.

val arrayStructureData = Seq(

Row("James",List(Row("Newark","NY"),Row("Brooklyn","NY")),Map("hair"-
>"black","eye"->"brown"), Map("height"->"5.9")),
Row("Michael",List(Row("SanJose","CA"),Row("Sandiago","CA")),Map("hair"-
>"brown","eye"->"black"),Map("height"->"6")),
Row("Robert",List(Row("LasVegas","NV")),Map("hair"->"red","eye"-
>"gray"),Map("height"->"6.3")),
Row("Maria",null,Map("hair"->"blond","eye"->"red"),Map("height"->"5.6")),
Row("Jen",List(Row("LAX","CA"),Row("Orange","CA")),Map("white"-
>"black","eye"->"black"),Map("height"->"5.2"))
)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("addresses", ArrayType(new StructType()
.add("city",StringType)
.add("state",StringType)))
.add("properties", MapType(StringType,StringType))
.add("secondProp", MapType(StringType,StringType))
val concatDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
concatDF.withColumn("mapConcat",map_concat(col("properties"),col("secondProp
")))
.select("name","mapConcat")
.show(false)
Output:

+-------+---------------------------------------------+
|name |mapConcat |
+-------+---------------------------------------------+
|James |[hair -> black, eye -> brown, height -> 5.9] |
|Michael|[hair -> brown, eye -> black, height -> 6] |
|Robert |[hair -> red, eye -> gray, height -> 6.3] |
|Maria |[hair -> blond, eye -> red, height -> 5.6] |
|Jen |[white -> black, eye -> black, height -> 5.2]|
+-------+---------------------------------------------+
map_from_entries() – convert array of StructType entries to map
Use map_from_entries() SQL functions to convert array of StructType entries to
map (MapType ) on Spark DataFrame. This function take DataFrame column
ArrayType[StructType] as an argument, passing any other type results an error.

Syntax - map_from_entries(e: Column): Column

Syntax - map_entries(e: Column): Column

Use Spark SQL map_entries() function to convert map of StructType to array of
StructType column on DataFrame.
Complete Spark SQL map functions example

package com.sparkbyexamples.spark.dataframe.functions.collection
import org.apache.spark.sql.functions.{col, explode, lit, map, map_concat,
map_from_entries, map_keys, map_values}
import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StringType,
StructType}
import org.apache.spark.sql.{Column, Row, SparkSession}
import scala.collection.mutable
object MapFunctions extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import spark.implicits._
val structureData = Seq(
Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA"))
)
val structureSchema = new StructType()
.add("id",StringType)
.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)
.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)
// Convert to Map
val index = df.schema.fi eldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fi elds.foreach(fi eld =>{
columns.add(lit(fi eld.name))
columns.add(col("properties." + fi eld.name))
})
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
//Retrieve all keys from a Map
val keys =
df.select(explode(map_keys(<pre></pre>quot;propertiesMap"))).as[String].distinct.
collect
print(keys.mkString(","))
// map_keys
df.select(col("id"),map_keys(col("propertiesMap")))
.show(false)
//map_values
df.select(col("id"),map_values(col("propertiesMap")))
.show(false)
//Creating DF with MapType
val arrayStructureData = Seq(
Row("James",List(Row("Newark","NY"),Row("Brooklyn","NY")),Map("hair"-
>"black","eye"->"brown"), Map("height"->"5.9")),
Row("Michael",List(Row("SanJose","CA"),Row("Sandiago","CA")),Map("hair"-
>"brown","eye"->"black"),Map("height"->"6")),
Row("Robert",List(Row("LasVegas","NV")),Map("hair"->"red","eye"-
>"gray"),Map("height"->"6.3")),
Row("Maria",null,Map("hair"->"blond","eye"->"red"),Map("height"->"5.6")),
Row("Jen",List(Row("LAX","CA"),Row("Orange","CA")),Map("white"-
>"black","eye"->"black"),Map("height"->"5.2"))
)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("addresses", ArrayType(new StructType()
.add("city",StringType)
.add("state",StringType)))
.add("properties", MapType(StringType,StringType))
.add("secondProp", MapType(StringType,StringType))
val concatDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
concatDF.printSchema()
concatDF.show()

concatDF.withColumn("mapConcat",map_concat(col("properties"),col("secondProp
")))
.select("name","mapConcat")
.show(false)
concatDF.withColumn("mapFromEntries",map_from_entries(col("addresses")))
.select("name","mapFromEntries")
.show(false)
}
Conclusion
In this article, you have learned how to convert an array of StructType to map and
Map of StructType to array and concatenating several maps using SQL map
functions on the Spark DataFrame column.
Spark SQL Sort functions – complete list
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard sort functions defi ne in DataFrame API, these
come in handy when we need to make sorting on the DataFrame column. All these
accept input as, column name in String and returns a Column type.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
UDF does not guarantee performance.
Spark SQL sort functions are grouped as “sort_funcs” in spark SQL, these sort
functions come handy when we want to perform any ascending and descending
operations on columns.
These are primarily used on the Sort function of the Dataframe or Dataset.
Show entries
Search:
SPARK SQL SORT FUNCTION
SPARK FUNCTION DESCRIPTION
SYNTAX
asc(columnName: String): Column asc function is used to specify the ascending
order of the sorting column on DataFrame or
DataSet
asc_nulls_fi rst(columnName: Similar to asc function but null values return fi rst
String): Column and then non-null values
asc_nulls_last(columnName: String): Similar to asc function but non-null values return
Column fi rst and then null values
desc(columnName: String): Column desc function is used to specify the descending
SPARK SQL SORT FUNCTION
SPARK FUNCTION DESCRIPTION
SYNTAX
order of the DataFrame or DataSet sorting
column.
desc_nulls_fi rst(columnName: Similar to desc function but null values return
String): Column fi rst and then non-null values.
desc_nulls_last(columnName: Similar to desc function but non-null values
String): Column return fi rst and then null values.
Showing 1 to 6 of 6 entries
PreviousNext
asc() – ascending function
asc function is used to specify the ascending order of the sorting column on
DataFrame or DataSet.
Syntax: asc(columnName: String): Column
Copy
asc_nulls_fi rst() – ascending with nulls fi rst
Similar to asc function but null values return fi rst and then non-null values.
asc_nulls_fi rst(columnName: String): Column
Copy
asc_nulls_last() – ascending with nulls last
Similar to asc function but non-null values return fi rst and then null values.
asc_nulls_last(columnName: String): Column
Copy
desc() – descending function
desc function is used to specify the descending order of the DataFrame or
DataSet sorting column.
desc(columnName: String): Column
Copy
desc_nulls_fi rst() – descending with nulls fi rst
Similar to desc function but null values return fi rst and then non-null values.
desc_nulls_fi rst(columnName: String): Column
Copy
desc_nulls_last() – descending with nulls last
Similar to desc function but non-null values return fi rst and then null values.
desc_nulls_last(columnName: String): Column
Copy
Reference : Spark Functions scala code
Related Articles
Spark SQL Aggregate Functions
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:February 14, 2023
Spread the love
Spark SQL provides built-in standard Aggregate functions defi nes in DataFrame API,
these come in handy when we need to make aggregate operations on DataFrame
columns. Aggregate functions operate on a group of rows and calculate a single
return value for every group.
All these aggregate functions accept input as, Column type or column name in a
string and several other arguments based on the function and return Column type.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
Spark Aggregate Functions
Spark SQL Aggregate functions are grouped as “agg_funcs” in spark SQL. Below is a
list of functions defi ned under this group. Click on each link to learn with a Scala
example.
Note that each and every below function has another signature which takes String
as a column name instead of Column.
Show entries
Search:
AGGREGATE FUNCTION
AGGREGATE FUNCTION DESCRIPTION
SYNTAX
approx_count_distinct(e: Returns the count of distinct items in a group.
Column)
approx_count_distinct(e: Returns the count of distinct items in a group.
Column, rsd: Double)
avg(e: Column) Returns the average of values in the input column.
collect_list(e: Column) Returns all values from an input column with
duplicates.
collect_set(e: Column) Returns all values from an input column with
duplicate values .eliminated.
corr(column1: Column, column2: Returns the Pearson Correlation Coeffi cient for two
Column) columns.
count(e: Column) Returns number of elements in a column.
countDistinct(expr: Column, Returns number of distinct elements in the columns.
exprs: Column*)
covar_pop(column1: Column, Returns the population covariance for two columns.
column2: Column)
covar_samp(column1: Column, Returns the sample covariance for two columns.
column2: Column)
fi rst(e: Column, ignoreNulls: Returns the fi rst element in a column when
Boolean) ignoreNulls is set to true, it returns fi rst non null
element.
fi rst(e: Column): Column Returns the fi rst element in a column.
grouping(e: Column) Indicates whether a specifi ed column in a GROUP BY
list is aggregated or not, returns 1 for aggregated or
0 for not aggregated in the result set.
kurtosis(e: Column) Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Returns the last element in a column. when
AGGREGATE FUNCTION
AGGREGATE FUNCTION DESCRIPTION
SYNTAX
Boolean) ignoreNulls is set to true, it returns last non null
element.
last(e: Column) Returns the last element in a column.
max(e: Column) Returns the maximum value in a column.
mean(e: Column) Alias for Avg. Returns the average of the values in a
column.
min(e: Column) Returns the minimum value in a column.
skewness(e: Column) Returns the skewness of the values in a group.
stddev(e: Column) alias for `stddev_samp`.
stddev_samp(e: Column) Returns the sample standard deviation of values in a
column.
stddev_pop(e: Column) Returns the population standard deviation of the
values in a column.
sum(e: Column) Returns the sum of all values in a column.
sumDistinct(e: Column) Returns the sum of all distinct values in a column.
variance(e: Column) alias for `var_samp`.
var_samp(e: Column) Returns the unbiased variance of the values in a
column.
var_pop(e: Column) returns the population variance of the values in a
column.
Showing 1 to 28 of 28 entries
PreviousNext
Aggregate Functions Examples
First, let’s create a DataFrame to work with aggregate functions. All example
provided here is also available at GitHub project.

import spark.implicits._

val simpleData = Seq(("James", "Sales", 3000),

//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))

//Prints approx_count_distinct: 6
Copy
avg (average) Aggregate Function
avg() function returns the average of values in the input column.

//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))

//Prints avg: 3400.0

Copy
collect_list Aggregate Function
collect_list() function returns all values from an input column with duplicates.

//collect_list
df.select(collect_list("salary")).show(false)

+------------------------------------------------------------+
|collect_list(salary) |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+
Copy
collect_set Aggregate Function
collect_set() function returns all values from an input column with duplicate values
eliminated.

//collect_set
df.select(collect_set("salary")).show(false)

+------------------------------------+
|collect_set(salary) |
+------------------------------------+
|[4600, 3000, 3900, 4100, 3300, 2000]|
+------------------------------------+
Copy
countDistinct Aggregate Function
countDistinct() function returns the number of distinct elements in a columns

//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+ df2.collect()(0)(0))
Copy
count function()
count() function returns number of elements in a column.

println("count: "+
df.select(count("salary")).collect()(0))

Prints county: 10
Copy
grouping function()
grouping() Indicates whether a given input column is aggregated or not. returns 1
for aggregated or 0 for not aggregated in the result. If you try grouping directly on
the salary column you will get below error.

Exception in thread "main" org.apache.spark.sql.AnalysisException:

// grouping() can only be used with GroupingSets/Cube/Rollup
Copy
fi rst function()
fi rst() function returns the fi rst element in a column when ignoreNulls is set to true,
it returns the fi rst non-null element.

//fi rst
df.select(fi rst("salary")).show(false)

+--------------------+
|fi rst(salary, false)|
+--------------------+
|3000 |
+--------------------+
Copy
last()
last() function returns the last element in a column. when ignoreNulls is set to true,
it returns the last non-null element.

//last
df.select(last("salary")).show(false)

+-------------------+
|last(salary, false)|
+-------------------+
|4100 |
+-------------------+
Copy
kurtosis()
kurtosis() function returns the kurtosis of the values in a group.

df.select(kurtosis("salary")).show(false)

+-------------------+
|kurtosis(salary) |
+-------------------+
|-0.6467803030303032|
+-------------------+
Copy
max()
max() function returns the maximum value in a column.

df.select(max("salary")).show(false)

+-----------+
|max(salary)|
+-----------+
|4600 |
+-----------+
Copy
min()
min() function

df.select(min("salary")).show(false)
+-----------+
|min(salary)|
+-----------+
|2000 |
+-----------+
Copy
mean()
mean() function returns the average of the values in a column. Alias for Avg

df.select(mean("salary")).show(false)

+-----------+
|avg(salary)|
+-----------+
|3400.0 |
+-----------+
Copy
skewness()
skewness() function returns the skewness of the values in a group.

df.select(skewness("salary")).show(false)

+--------------------+
|skewness(salary) |
+--------------------+
|-0.12041791181069571|
+--------------------+
Copy
stddev(), stddev_samp() and stddev_pop()
stddev() alias for stddev_samp.
stddev_samp() function returns the sample standard deviation of values in a
column.
stddev_pop() function returns the population standard deviation of the values in a
column.

df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)

+-------------------+-------------------+------------------+
|stddev_samp(salary)|stddev_samp(salary)|stddev_pop(salary)|
+-------------------+-------------------+------------------+
|765.9416862050705 |765.9416862050705 |726.636084983398 |
+-------------------+-------------------+------------------+
Copy
sum()
sum() function Returns the sum of all values in a column.

df.select(sum("salary")).show(false)

+-----------+
|sum(salary)|
+-----------+
|34000 |
+-----------+
Copy
sumDistinct()
sumDistinct() function returns the sum of all distinct values in a column.

df.select(sumDistinct("salary")).show(false)

+--------------------+
|sum(DISTINCT salary)|
+--------------------+
|20900 |
+--------------------+
Copy
variance(), var_samp(), var_pop()
variance() alias for var_samp
var_samp() function returns the unbiased variance of the values in a column.
var_pop() function returns the population variance of the values in a column.

df.select(variance("salary"),var_samp("salary"),var_pop("salary"))
.show(false)

+-----------------+-----------------+---------------+
|var_samp(salary) |var_samp(salary) |var_pop(salary)|
+-----------------+-----------------+---------------+
|586666.6666666666|586666.6666666666|528000.0 |
+-----------------+-----------------+---------------+
Copy
Source code of Spark SQL Aggregate Functions examples

package com.sparkbyexamples.spark.dataframe.functions.aggregate

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions. _

object AggregateFunctions extends App {

val spark: SparkSession = SparkSession.builder()

.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

import spark.implicits._

val simpleData = Seq(("James", "Sales", 3000),

//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))

//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))

//collect_list

df.select(collect_list("salary")).show(false)

//collect_set

df.select(collect_set("salary")).show(false)

//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+ df2.collect()(0)(0))

println("count: "+
df.select(count("salary")).collect()(0))
//fi rst
df.select(fi rst("salary")).show(false)

//last
df.select(last("salary")).show(false)

//Exception in thread "main" org.apache.spark.sql.AnalysisException:

// grouping() can only be used with GroupingSets/Cube/Rollup;
//df.select(grouping("salary")).show(false)

df.select(kurtosis("salary")).show(false)

df.select(max("salary")).show(false)

df.select(min("salary")).show(false)

df.select(mean("salary")).show(false)

df.select(skewness("salary")).show(false)

df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)

df.select(sum("salary")).show(false)

df.select(sumDistinct("salary")).show(false)

df.select(variance("salary"),var_samp("salary"),
var_pop("salary")).show(false)
}
Copy
Conclusion
In this article, I’ve consolidated and listed all Spark SQL Aggregate functions with
scala examples and also learned the benefi ts of using Spark SQL functions.
Happy Learning !!
Spark Window Functions with Examples
Post author:Naveen (NNK)
Post category:Apache Spark / Spark SQL Functions
Post last modifi ed:January 17, 2023
Spread the love
Spark Window functions are used to calculate results such as the rank, row number
e.t.c over a range of input rows and these are available to you by
importing org.apache.spark.sql.functions._ , this article explains the concept of
window functions, it’s usage, syntax and fi nally how to use them with Spark SQL
and Spark’s DataFrame API. These come in handy when we need to make aggregate
operations in a specifi c window frame on DataFrame columns.
When possible try to leverage standard library as they are little bit more compile-
time safety, handles null and perform better when compared to UDF’s. If your
application is critical on performance try to avoid using custom UDF at all costs as
these are not guarantee on performance.
1. Spark Window Functions
Spark Window functions operate on a group of rows (like frame, partition) and
return a single value for every input row. Spark SQL supports three kinds of window
functions:
ranking functions
analytic functions
aggregate functions

Spark Window Functions

The below table defi nes Ranking and Analytic functions and for aggregate
functions, we can use any existing aggregate functions as a window function.
To perform an operation on a group fi rst, we need to partition the data
using Window.partitionBy() , and for row number and rank function we need to
additionally order by on partition data using orderBy clause.
Click on each link to know more about these functions along with the Scala
examples.
Show entries
Search:
WINDOW FUNCTIONS USAGE
PYSPARK WINDOW FUNCTIONS DESCRIPTION
& SYNTAX
row_number(): Column Returns a sequential number starting from 1 within
WINDOW FUNCTIONS USAGE
PYSPARK WINDOW FUNCTIONS DESCRIPTION
& SYNTAX
a window partition
rank(): Column Returns the rank of rows within a window partition,
with gaps.
percent_rank(): Column Returns the percentile rank of rows within a
window partition.
dense_rank(): Column Returns the rank of rows within a window partition
without any gaps. Where as Rank() returns rank
with gaps.
ntile(n: Int): Column Returns the ntile id in a window partition
cume_dist(): Column Returns the cumulative distribution of values within
a window partition
lag(e: Column, off set: Int): returns the value that is òff set` rows before the
Column current row, and `null` if there is less than òff set`
lag(columnName: String, off set: rows before the current row.
Int): Column
lag(columnName: String, off set:
Int, defaultValue: Any): Column
lead(columnName: String, off set: returns the value that is òff set` rows after the
Int): Column current row, and `null` if there is less than òff set`
lead(columnName: String, off set: rows after the current row.
Int): Column
lead(columnName: String, off set:
Int, defaultValue: Any): Column
Showing 1 to 8 of 8 entries
PreviousNext
Before we start with an example, fi rst let’s create a DataFrame to work with.

import spark.implicits._

val simpleData = Seq(("James", "Sales", 3000),

import org.apache.spark.sql.functions. _
import org.apache.spark.sql.expressions. Window

//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()
Copy
Yields below output.

+-------------+----------+------+----------+
|employee_name|department|salary|row_number|
+-------------+----------+------+----------+
| James| Sales| 3000| 1|
| James| Sales| 3000| 2|
| Robert| Sales| 4100| 3|
| Saif| Sales| 4100| 4|
| Michael| Sales| 4600| 5|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 2|
| Jen| Finance| 3900| 3|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+----------+
Copy
2.2 rank Window Function
rank() window function is used to provide a rank to the result within a window
partition. This function leaves gaps in rank when there are ties.

import org.apache.spark.sql.functions. _
//rank
df.withColumn("rank",rank().over(windowSpec))
.show()
Copy
Yields below output.

+-------------+----------+------+----+
|employee_name|department|salary|rank|
+-------------+----------+------+----+
| James| Sales| 3000| 1|
| James| Sales| 3000| 1|
| Robert| Sales| 4100| 3|
| Saif| Sales| 4100| 3|
| Michael| Sales| 4600| 5|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 2|
| Jen| Finance| 3900| 3|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+----+
Copy
This is the same as the RANK function in SQL.
2.3 dense_rank Window Function
dense_rank() window function is used to get the result with rank of rows within a
window partition without any gaps. This is similar to rank() function diff erence
being rank function leaves gaps in rank when there are ties.

import org.apache.spark.sql.functions. _
//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()
Copy
Yields below output.

+-------------+----------+------+----------+
|employee_name|department|salary|dense_rank|
+-------------+----------+------+----------+
| James| Sales| 3000| 1|
| James| Sales| 3000| 1|
| Robert| Sales| 4100| 2|
| Saif| Sales| 4100| 2|
| Michael| Sales| 4600| 3|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 2|
| Jen| Finance| 3900| 3|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+----------+
Copy
This is the same as the DENSE_RANK function in SQL.
2.4 percent_rank Window Function

import org.apache.spark.sql.functions. _
//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()
Copy
Yields below output.

+-------------+----------+------+------------+
|employee_name|department|salary|percent_rank|
+-------------+----------+------+------------+
| James| Sales| 3000| 0.0|
| James| Sales| 3000| 0.0|
| Robert| Sales| 4100| 0.5|
| Saif| Sales| 4100| 0.5|
| Michael| Sales| 4600| 1.0|
| Maria| Finance| 3000| 0.0|
| Scott| Finance| 3300| 0.5|
| Jen| Finance| 3900| 1.0|
| Kumar| Marketing| 2000| 0.0|
| Jeff | Marketing| 3000| 1.0|
+-------------+----------+------+------------+
Copy
This is the same as the PERCENT_RANK function in SQL.
2.5 ntile Window Function
ntile() window function returns the relative rank of result rows within a window
partition. In below example we have used 2 as an argument to ntile hence it returns
ranking between 2 values (1 and 2)

//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()
Copy
Yields below output.

+-------------+----------+------+-----+
|employee_name|department|salary|ntile|
+-------------+----------+------+-----+
| James| Sales| 3000| 1|
| James| Sales| 3000| 1|
| Robert| Sales| 4100| 1|
| Saif| Sales| 4100| 2|
| Michael| Sales| 4600| 2|
| Maria| Finance| 3000| 1|
| Scott| Finance| 3300| 1|
| Jen| Finance| 3900| 2|
| Kumar| Marketing| 2000| 1|
| Jeff | Marketing| 3000| 2|
+-------------+----------+------+-----+
Copy
This is the same as the NTILE function in SQL.
3. Spark Window Analytic functions
3.1 cume_dist Window Function
cume_dist() window function is used to get the cumulative distribution of values
within a window partition.
This is the same as the DENSE_RANK function in SQL.

//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()
Copy

+-------------+----------+------+------------------+
|employee_name|department|salary| cume_dist|
+-------------+----------+------+------------------+
| James| Sales| 3000| 0.4|
| James| Sales| 3000| 0.4|
| Robert| Sales| 4100| 0.8|
| Saif| Sales| 4100| 0.8|
| Michael| Sales| 4600| 1.0|
| Maria| Finance| 3000|0.3333333333333333|
| Scott| Finance| 3300|0.6666666666666666|
| Jen| Finance| 3900| 1.0|
| Kumar| Marketing| 2000| 0.5|
| Jeff | Marketing| 3000| 1.0|
+-------------+----------+------+------------------+
Copy
3.2 lag Window Function
This is the same as the LAG function in SQL.

//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()
Copy

//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()
Copy

val windowSpecAgg = Window.partitionBy("department")

val aggDF = df.withColumn("row",row_number.over(windowSpec))

.withColumn("avg", avg(col("salary")).over(windowSpecAgg))
.withColumn("sum", sum(col("salary")).over(windowSpecAgg))
.withColumn("min", min(col("salary")).over(windowSpecAgg))
.withColumn("max", max(col("salary")).over(windowSpecAgg))
.where(col("row")===1).select("department","avg","sum","min","max")
.show()
Copy
This yields below output

+----------+------+-----+----+----+
|department| avg| sum| min| max|
+----------+------+-----+----+----+
| Sales|3760.0|18800|3000|4600|
| Finance|3400.0|10200|3000|3900|
| Marketing|2500.0| 5000|2000|3000|
+----------+------+-----+----+----+
Copy
Please refer for more Aggregate Spark Functions
5. Source Code of Window Functions Example

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions. _
import org.apache.spark.sql.expressions. Window

object WindowFunctions extends App {

val spark: SparkSession = SparkSession.builder()

.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

import spark.implicits._

val simpleData = Seq(("James", "Sales", 3000),

//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()

//rank
df.withColumn("rank",rank().over(windowSpec))
.show()

//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()

//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()

//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()

//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()

//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()

//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()
//Aggregate Functions
val windowSpecAgg = Window.partitionBy("department")
val aggDF = df.withColumn("row",row_number.over(windowSpec))
.withColumn("avg", avg(col("salary")).over(windowSpecAgg))
.withColumn("sum", sum(col("salary")).over(windowSpecAgg))
.withColumn("min", min(col("salary")).over(windowSpecAgg))
.withColumn("max", max(col("salary")).over(windowSpecAgg))
.where(col("row")===1).select("department","avg","sum","min","max")
.show()
}
Copy
The complete source code is available at GitHub for reference.
6. Conclusion
In this tutorial, you have learned what are Spark SQL Window functions their syntax
and how to use them with aggregate function along with several examples in Scala.
Spark Most Used JSON Functions with Examples
Post author:Naveen (NNK)
Post category:Apache Spark
Post last modifi ed:January 31, 2023
Spread the love
Spark SQL provides a set of JSON functions to parse JSON string, query to extract
specifi c values from JSON. In this article, I will explain the most used JSON
functions with Scala examples.
1. Spark JSON Functions
from_json() – Converts JSON string into Struct type or Map type.
to_json() – Converts MapType or Struct type to JSON string.
json_tuple() – Extract the Data from JSON and create them as a new columns.
get_json_object() – Extracts JSON element from a JSON string based on json path
specifi ed.
schema_of_json() – Create schema string from JSON string
2. Create DataFrame with Column contains JSON String
In order to explain these functions fi rst, let’s create DataFrame with a column
contains JSON string.

val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
df.show(false)

//+---+--------------------------------------------------------------------------+
//|id |value |
//+---+--------------------------------------------------------------------------+
//|1 |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}|
//+---+--------------------------------------------------------------------------+
Copy
3. from_json() – Converts JSON string into Struct type or Map type
The below example converts JSON string to Map key-value pair. I will leave it to you
to convert to struct type. Refer, Convert JSON string to Struct type column .

import org.apache.spark.sql.functions.{ from_json,col}

import org.apache.spark.sql.types.{MapType, StringType}
val
df2=df.withColumn("value",from_json(col("value"),MapType(StringType,StringType
)))
df2.printSchema()
df2.show(false)

//+---+---------------------------------------------------------------------------+
//|id |value |
//+---+---------------------------------------------------------------------------+
//|1 |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
//+---+---------------------------------------------------------------------------+
Copy
4. to_json() – Converts MapType or Struct type to JSON string
Here, I am using df2 that created from above from_json() example.

import org.apache.spark.sql.functions.{ to_json}

df2.withColumn("value",to_json(col("value")))
.show(false)

//+---+----------------------------------------------------------------------------+
//|id |value |
//+---+----------------------------------------------------------------------------+
//|1 |{"Zipcode":"704","ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}|
//+---+----------------------------------------------------------------------------+
Copy
5. json_tuple() – Extract the Data from JSON and create them as new columns

import org.apache.spark.sql.functions.{ json_tuple}

df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City"))
.toDF("id","Zipcode","ZipCodeType","City")
.show(fa
lse)

//+---+-------+-----------+-----------+
//|id |Zipcode|ZipCodeType|City |
//+---+-------+-----------+-----------+
//|1 |704 |STANDARD |PARC PARQUE|
//+---+-------+-----------+-----------+
Copy
6. get_json_object() – Extracts JSON element from a JSON string based on json path
specifi ed

import org.apache.spark.sql.functions.{ get_json_object}

df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").as("ZipCodeType
"))
.show(false)

//+---+-----------+
//|id |ZipCodeType|
//+---+-----------+
//|1 |STANDARD |
//+---+-----------+
Copy
7. schema_of_json() – Create schema string from JSON string.

import org.apache.spark.sql.functions.{ schema_of_json,lit}

val schemaStr=spark.range(1)
.select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD",
"City":"PARC PARQUE","State":"PR"}""")))
.collect()(0)(0)
println(schemaStr)
//struct<City:string,State:string,ZipCodeType:string,Zipcode:bigint>
Copy
8. Complete Example

import org.apache.spark.sql.SparkSession

object JsonFunctions extends App{

val spark: SparkSession = SparkSession.builder()

.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC
PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
df.show(false)

import org.apache.spark.sql.functions.{ from_json,col}

import org.apache.spark.sql.types.{MapType, StringType}
val
df2=df.withColumn("value",from_json(col("value"),MapType(StringType,StringType
)))
df2.printSchema()
df2.show(false)

import org.apache.spark.sql.functions.{ to_json}

df2.withColumn("value",to_json(col("value")))
.show(false)

import org.apache.spark.sql.functions.{ json_tuple}

df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City"))
.toDF("id","Zipcode","ZipCodeType","City")
.show(false)

import org.apache.spark.sql.functions.{ get_json_object}

df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").as("ZipCodeType
"))
.show(false)

import org.apache.spark.sql.functions.{ schema_of_json,lit}

val schemaStr=spark.range(1)
.select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City"
:"PARC PARQUE","State":"PR"}""")))
.collect()(0)(0)
println(schemaStr)
}
Copy

Java String Methods Cheat Sheet
No ratings yet
Java String Methods Cheat Sheet
18 pages
Unit-3_Collections and Sequences
No ratings yet
Unit-3_Collections and Sequences
157 pages
Exp2.1
No ratings yet
Exp2.1
64 pages
ENGR 1204 Lecture 6
No ratings yet
ENGR 1204 Lecture 6
40 pages
Abu Hamour Branch, Doha - Qatar: M.E.S Indian School (Mesis)
No ratings yet
Abu Hamour Branch, Doha - Qatar: M.E.S Indian School (Mesis)
24 pages
Chapter 11 Strings
No ratings yet
Chapter 11 Strings
65 pages
Lecture 3@StringDataStructure
No ratings yet
Lecture 3@StringDataStructure
25 pages
Profile Parameters
No ratings yet
Profile Parameters
46 pages
Python_Unit2_Part1
No ratings yet
Python_Unit2_Part1
32 pages
String
No ratings yet
String
19 pages
Chapter - 5 Python Collective String
No ratings yet
Chapter - 5 Python Collective String
14 pages
PYTHON (UNIT II)
No ratings yet
PYTHON (UNIT II)
36 pages
Xi Cs Unit-2 Strings
No ratings yet
Xi Cs Unit-2 Strings
27 pages
Lec11A Strings
No ratings yet
Lec11A Strings
16 pages
Python String
No ratings yet
Python String
22 pages
strings
No ratings yet
strings
79 pages
String Notes
No ratings yet
String Notes
4 pages
Strings 11
No ratings yet
Strings 11
14 pages
Chapter 5 Engstring Manipulation Converted
No ratings yet
Chapter 5 Engstring Manipulation Converted
24 pages
APP Connect Functions
No ratings yet
APP Connect Functions
18 pages
Fun With Ccsids: Working With Unicode and Other Types of Data in RPG
No ratings yet
Fun With Ccsids: Working With Unicode and Other Types of Data in RPG
31 pages
SQL Functions For Calculated Column in SAP HANA
100% (2)
SQL Functions For Calculated Column in SAP HANA
8 pages
Chapter 11-1 SQL Functions - Character
No ratings yet
Chapter 11-1 SQL Functions - Character
62 pages
Strings in Python
No ratings yet
Strings in Python
61 pages
Complete Python Notes
No ratings yet
Complete Python Notes
173 pages
Strings 11
No ratings yet
Strings 11
13 pages
string
No ratings yet
string
68 pages
PPS Unit-4
No ratings yet
PPS Unit-4
120 pages
2 Lecture 2 - Strings and Functions
No ratings yet
2 Lecture 2 - Strings and Functions
33 pages
Chapter 5
No ratings yet
Chapter 5
21 pages
Datastage STRING Functions
No ratings yet
Datastage STRING Functions
14 pages
Strings
No ratings yet
Strings
24 pages
Ch 8 (1)
No ratings yet
Ch 8 (1)
18 pages
PYTHON - String
No ratings yet
PYTHON - String
30 pages
Mathematical Functions
No ratings yet
Mathematical Functions
16 pages
Strings in Python (1)_231116_111357
No ratings yet
Strings in Python (1)_231116_111357
15 pages
Strings in Python
No ratings yet
Strings in Python
26 pages
Slides8 Strings Nup
No ratings yet
Slides8 Strings Nup
11 pages
String Datatype in Python
No ratings yet
String Datatype in Python
7 pages
String Manipulation
No ratings yet
String Manipulation
29 pages
Different Ways To Define Object
No ratings yet
Different Ways To Define Object
10 pages
ch-8 strings 24-25 (1) (2)
No ratings yet
ch-8 strings 24-25 (1) (2)
13 pages
Python Strings
No ratings yet
Python Strings
13 pages
Unit 3-v2
No ratings yet
Unit 3-v2
27 pages
Python Strings
No ratings yet
Python Strings
34 pages
String Complete
No ratings yet
String Complete
30 pages
XI-STRING
No ratings yet
XI-STRING
12 pages
STRING DATA TYPE
No ratings yet
STRING DATA TYPE
13 pages
Some Common Matlab String Functions: Category Function Description Example
No ratings yet
Some Common Matlab String Functions: Category Function Description Example
5 pages
String
No ratings yet
String
5 pages
05 String - Python
No ratings yet
05 String - Python
30 pages
6 Strings11
No ratings yet
6 Strings11
14 pages
11 Strings
No ratings yet
11 Strings
67 pages
inf_9
No ratings yet
inf_9
4 pages
The JavaScript String Handbook – How to Work With Strings in JS
No ratings yet
The JavaScript String Handbook – How to Work With Strings in JS
39 pages
11-unit1chapter03datahandling-180823043256
No ratings yet
11-unit1chapter03datahandling-180823043256
92 pages
Python Strings
No ratings yet
Python Strings
35 pages
empty character copy - Google Search
No ratings yet
empty character copy - Google Search
3 pages
Python - String
No ratings yet
Python - String
18 pages
String Functions: Tableau Desktop Reference Guide
No ratings yet
String Functions: Tableau Desktop Reference Guide
2 pages
OWASPLondon20161124 JSON Hijacking Gareth Heyes
No ratings yet
OWASPLondon20161124 JSON Hijacking Gareth Heyes
44 pages
Text and Text Compression
No ratings yet
Text and Text Compression
28 pages
String Functions in Abintio
100% (1)
String Functions in Abintio
20 pages
SQL - String Functions
No ratings yet
SQL - String Functions
4 pages
Changelog
No ratings yet
Changelog
24 pages
C++ - STD - Wstring Vs STD - String - Stack Overflow
No ratings yet
C++ - STD - Wstring Vs STD - String - Stack Overflow
16 pages
jBASE Internationalization
No ratings yet
jBASE Internationalization
57 pages
Class 11 Strings (Complete)
No ratings yet
Class 11 Strings (Complete)
75 pages
ADF Course Deck V2
No ratings yet
ADF Course Deck V2
216 pages
(Ms-Xwdcal) : Web Distributed Authoring and Versioning (Webdav) Extensions For Calendar Support
No ratings yet
(Ms-Xwdcal) : Web Distributed Authoring and Versioning (Webdav) Extensions For Calendar Support
76 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Delphi XE2 Foundations Part 2 Rolliston Chris PDF
No ratings yet
Delphi XE2 Foundations Part 2 Rolliston Chris PDF
174 pages
Iso 11783 6 2018
No ratings yet
Iso 11783 6 2018
15 pages
New Text Documentccc
No ratings yet
New Text Documentccc
5 pages
Registry
No ratings yet
Registry
27 pages
Java and Unicode: The Confusion About String and Char in Java
No ratings yet
Java and Unicode: The Confusion About String and Char in Java
15 pages
Unicode & Character Encodings in Python - A Painless Guide - Real Python
No ratings yet
Unicode & Character Encodings in Python - A Painless Guide - Real Python
20 pages
AppleGlot 4 User's Guide
No ratings yet
AppleGlot 4 User's Guide
41 pages
Fundamentals of Multimedia
100% (2)
Fundamentals of Multimedia
53 pages
Pegarulesxml
No ratings yet
Pegarulesxml
52 pages
Test
No ratings yet
Test
18 pages
Input Output Perl
No ratings yet
Input Output Perl
13 pages
GFX 3.3 Drawtext API
No ratings yet
GFX 3.3 Drawtext API
20 pages
Computer Graphics and Multimedia
No ratings yet
Computer Graphics and Multimedia
19 pages
What Is Data Modelling
100% (1)
What Is Data Modelling
12 pages
Introduction To Snowflake Warehouses
No ratings yet
Introduction To Snowflake Warehouses
40 pages
Utf-8 - Wikipedia, The Free Encyclopedia
No ratings yet
Utf-8 - Wikipedia, The Free Encyclopedia
10 pages
PEP-8 Tutorial - Code Standards in Python PDF
No ratings yet
PEP-8 Tutorial - Code Standards in Python PDF
20 pages
Python Programming: Reema Thareja
No ratings yet
Python Programming: Reema Thareja
27 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Unit-Iii Chapter-1: Python Strings Revisited
100% (2)
Unit-Iii Chapter-1: Python Strings Revisited
49 pages
SQL Subquery
100% (1)
SQL Subquery
57 pages
Revival v3 (Beta)
No ratings yet
Revival v3 (Beta)
232 pages
How To Master Apache Spark Interview Questions
No ratings yet
How To Master Apache Spark Interview Questions
14 pages
The Language of Bits: Computer Organisation and Architecture
No ratings yet
The Language of Bits: Computer Organisation and Architecture
72 pages
Practical Mlops Ebook
No ratings yet
Practical Mlops Ebook
61 pages
HowToCrackInterview Udemy
No ratings yet
HowToCrackInterview Udemy
58 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Azure Databricks Course Slide Deck V4
100% (4)
Azure Databricks Course Slide Deck V4
308 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Spark SQL String Functions

Uploaded by

Spark SQL String Functions

Uploaded by

Spark SQL String Functions:

Spark SQL Date and Timestamp Functions

Spark SQL Map functions – complete list

map Creates a new map column.

map_keys Returns an array containing the keys of the map.

map_values Returns an array containing the values of the

map_concat Merges maps specifi ed in arguments.

map_from_entries Returns a map from the given array of

map_entries Returns an array of all StructType in the given

explode(e: Column) Creates a new row for every key-value pair in

explode_outer(e: Column) Creates a new row for every key-value pair in

posexplode(e: Column) Creates a new row for each key-value pair in a

posexplode_outer(e: Column) Creates a new row for each key-value pair in a

transform_keys(expr: Column, Transforms map by applying functions to every

transform_values(expr: Transforms map by applying functions to every

map_zip_with( Merges two maps into a single map.

element_at(column: Column, Returns a value of a key in a map.

size(e: Column) Returns length of a map column.

val structureData = Seq(

Syntax - map(cols: Column*): Column

val index = df.schema.fi eldIndex("properties")

Syntax - map_keys(e: Column): Column

Syntax - map_values(e: Column): Column

Syntax - map_concat(cols: Column*): Column

val arrayStructureData = Seq(

Syntax - map_from_entries(e: Column): Column

Syntax - map_entries(e: Column): Column

val simpleData = Seq(("James", "Sales", 3000),

//Prints avg: 3400.0

Exception in thread "main" org.apache.spark.sql.AnalysisException:

object AggregateFunctions extends App {

val spark: SparkSession = SparkSession.builder()

val simpleData = Seq(("James", "Sales", 3000),

//Exception in thread "main" org.apache.spark.sql.AnalysisException:

Spark Window Functions

val simpleData = Seq(("James", "Sales", 3000),

val windowSpecAgg = Window.partitionBy("department")

val aggDF = df.withColumn("row",row_number.over(windowSpec))

object WindowFunctions extends App {

val spark: SparkSession = SparkSession.builder()

val simpleData = Seq(("James", "Sales", 3000),

import org.apache.spark.sql.functions.{ from_json,col}

import org.apache.spark.sql.functions.{ to_json}

import org.apache.spark.sql.functions.{ json_tuple}

import org.apache.spark.sql.functions.{ get_json_object}

import org.apache.spark.sql.functions.{ schema_of_json,lit}

object JsonFunctions extends App{

val spark: SparkSession = SparkSession.builder()

import org.apache.spark.sql.functions.{ from_json,col}

import org.apache.spark.sql.functions.{ to_json}

import org.apache.spark.sql.functions.{ json_tuple}

import org.apache.spark.sql.functions.{ get_json_object}

import org.apache.spark.sql.functions.{ schema_of_json,lit}

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.