Time Series Analysis in Spark SQL
Time Series Analysis in Spark SQL
I presented a script to process data using Time Series Analysis algorithm in sql
and pl/sql earlier.
In this article, I am using the same code(90%) from my previous article and 10%
pyspark code.
I used the following urls to install spark and setup spark on my desktop.
https://www.youtube.com/watch?v=IQfG0faDrzE
https://www.youtube.com/watch?v=WQErwxRTiW0
http://media.sundog-soft.com/spark-python-install.pdf
There are slight variations in the way, Time Series Analysis is performed, from
presentation to presentation.
I followed mostly the below mentioned vedio presentation, on Time Series Analysis
for programming in sql and pl/sql,
and spark sql presented in this article.
https://www.youtube.com/watch?v=HIWXdHlDSFs --TIME SERIES ANALYSIS
Questions to be answered:
01) Using the ratio to moving average method calculate seasonally adjusted indicies
for each quarter.
02) Obtain a regression trend line representing the above data.
03) Obtain a seasonally adjusted trend estimate for the 4th quarter of 2011.
Pls modify the code with the location of the csv file on your machine.
"""
#spark-submit.cmd python/pysparkTimeSeriesAnalysis.py
from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf
from pyspark.sql.functions import *
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("TimeSeriesAnalysis") \
.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.debug.maxToStringFields",100)
spark.conf.set("spark.sql.crossJoin.enabled", "true") #To enable cartesian product
in sql
df = spark.read.csv("e:/data/TimeSeries.csv",inferSchema=True,header=True)
print(df.printSchema())
print(df.columns)
df.show()
#trim alternative. When spaces are there in DataFrame columns' names
#select the specific column with df.columns[<position number>] method
df.select(df.columns[3]).show()
#unpivot table, rotate rows as columns
#crosstable function can be implemented with "explode" option
df =
df.select(array(col(df.columns[1]),col(df.columns[2]),col(df.columns[3]),col(df.col
umns[4])).alias("val"))
df = df.withColumn("val",explode(col("val")))
df.show()
w = Window().orderBy("val")
df.select("*",row_number().over(w).alias("id")).show() #add rownum/row_num/rowid to
output data starts from "1"
df2 = df.withColumn
("rowid",row_number().over(Window.orderBy(monotonically_increasing_id())) + 0)
#rownum starts from "1"
df2.show()
df = df.repartition(1).withColumn("rnum",monotonically_increasing_id() + 1) #add
rownum to output data
#by default monotonically_increasing_id starts with "0", add "+ 1" to start with
"1"
df.select("*").show() #add rownum to output data starts with "0"
df.registerTempTable("DF")
spark.catalog.cacheTable("DF")
spark.sql("select rnum + 1 as id,val from DF").show() #you can add "+ 1" while
selecting data from the DataFrame also
#######################################
spark.sql("""
with
frqma01 as (select round(avg(val),2) frqma_val from DF where rnum>=1 and rnum<=4 ),
spark.stop()
#References:
#http://www.orafaq.com/node/3187 "TIME SERIES ANALYSIS IN SQL AND PL/SQL"
#http://www.orafaq.com/node/3204
#https://stackoverflow.com/questions/33742895/how-to-show-full-column-content-in-a-
spark-dataframe