4220 6 (DataFormat)
4220 6 (DataFormat)
4220 6 (DataFormat)
Instructor: Li Yang
RDD (resilient distributed dataset) or Dataframe ?
• Dataframe
• Spark streaming is moving towards structured streaming that is heavily based on dataframe API
• RDD
• RDD
• Dataframe
• Transformations
• Actions
rdd_new = sc.textFile(“/mnt/S3data/sample.text”)
• Two common transformations on single RDD (for example, our RDD = {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’, (22,1.55,90))}):
• Filter : select some data in RDD to create a new RDD (similar to select in SQL)
• rdd.map(lambda x: (x[0], x[1][1]*0.393)) : a new RDD = {(‘Lily’, 64.9), (‘John’, 72.8), (‘Ann’, 61)}, name+height in inches
• union, intersection
• Action
• collect: return a list of all the data in RDD (shouldn’t be used for large dataset)
• reduce : apply a function to any two elements of RDD to create a new element and
continue until there is only one element in RDD
Python Function
• Definition 1:
return output
return x+2
• Defintion 2:
• Slicing dataframe:
• Rows(python): pd[0:3]
• Accessing dataframe:
• (python): pd.loc[0,’column’]
• (python): pd.iloc[0,1]
Pandas Dataframe
• pd.dropna(axis=0, inplace=True)
• pd.drop_duplicates(subset=‘column’, inplace=True)
Numpy
• A numpy array is a grid of values, all of the same type
• A simpler data format of dataframe. Operations on numpy are similar to Matlab matrix operation
• np.array([[1,2,3],[4,5,6]])
• np.zeros((2,2)), np.ones((2,2))
• np.arange(0,3,0.1)
Advantages Easy I/O; easy to transfer Very powerful to Easy to access and Easy to work on matrix
to other data formats ; easy manipulate the data manipulate data manipulations
for sql query
Disadvantage Hard to access and Hard to master it since you Something wrong with I/O Not fit for string type data
manipulate the data need to be very familiar on databricks
with it
• My suggestion:
1. Construct dataframe from dataset; Transform it to pandas dataframe for further operations
2. If you need simple matrix operations, transfer it to numpy; if you need complicated data operations, transfer it to rdd
Other modules
• re: module (package) to work on strings