4220 6 (DataFormat)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

RDD, pandas dataframe, numpy

Instructor: Li Yang
RDD (resilient distributed dataset) or Dataframe ?
• Dataframe

• New trend in spark

• MLlib is on a shift to dataframe based API

• Spark streaming is moving towards structured streaming that is heavily based on dataframe API

• RDD

• Out of date? No!

• RDD is the basic building block of spark

• Dataframe is build on top of RDD


When to use RDD or Dataframe ?

• RDD

• Low-level transformation, action and control on datasets

• Unstructured data, i.e., media streams or streams of text

• need to manipulate data with complicated functional programming

• Dataframe

• Simple processing, i.e., sum, average, SQL queries

• High-level optimization, i.e., MLlib


RDD (resilient distributed datasets)

• RDD examples: {‘a’,’b’,’c’,’d’}, {0,-1.5,2.5,3}, {(‘a’,1), (‘c’,-2),(‘d’,10)}

• RDD can contain any types of objects, including user-defined one

• RDD is a capsulation of a large dataset. In Spark all the work is expressed


as manipulations on RDD (create a new RDD; transform a RDD; call
operations on RDD)

• A special RDD is pair RDD. It has a key-value pair: (key, value)

• For example: {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’,


(22,1.55,90))}. Here name is key, (age, height, weight) is value.

Remark: spark distributes the data contained in RDD across clusters


Operations on RDD

• Transformations

• Apply some functions to data in RDD to create a new RDD

• Actions

• Compute a result based on RDD


Create RDD

• Create RDD directly from external dataset file

• (python code): sc = SparkContext(“local”, “texfile”)

rdd_new = sc.textFile(“/mnt/S3data/sample.text”)

• Create RDD from existing Dataframe

• (python code) rdd_new = df.rdd

• Conversion from rdd to dataframe: rdd_new.toDF()


Transforms on RDD
• Transformations

• Apply some functions to data in RDD to create a new RDD

• Two common transformations on single RDD (for example, our RDD = {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’, (22,1.55,90))}):

• Filter : select some data in RDD to create a new RDD (similar to select in SQL)

• rdd.filter(lambda x: x[0]==`John’) : choose John’s data

• rdd.filter(lambda x: x[1][0]<=20) : choose students’ data whose age <=20

• Map: apply a function to each data in RDD (very useful)

• rdd.map(lambda x: (x[0], x[1][1]*0.393)) : a new RDD = {(‘Lily’, 64.9), (‘John’, 72.8), (‘Ann’, 61)}, name+height in inches

• Transformations on two RDD:

• union, intersection

• Transformations on two pair RDD:

• join: merge two RDD by key values to generate a new RDD


Actions on RDD

• Action

• return a final value based on RDD

• Several common actions on single RDD:

• collect: return a list of all the data in RDD (shouldn’t be used for large dataset)

• take(n) : return the n rows of data in RDD

• count : return the number of rows

• reduce : apply a function to any two elements of RDD to create a new element and
continue until there is only one element in RDD
Python Function

• Definition 1:

• (python): def fun_name(input) :

return output

• Example: def Addition(x) :

return x+2

• Defintion 2:

• (python): lambda input: output

• Example: lambda x: x+2


Pandas Dataframe

• Pandas dataframe is a 2-dimensional labeled data structure with columns


of potentially different types

• Difference between dataframe and pandas dataframe?

• Pandas dataframe is build on top of dataframe. It is a higher level.

• Pandas dataframe is easier to be manipulated

• Dataframe → pandas dataframe: df.toPandas()

• Pandas dataframe → dataframe: spark.createDataFrame(df)


Pandas Dataframe
• Call pandas module (python): import pandas as pd

• Construct pandas dataframe (python): pd_new = df.toPandas()

• Display pandas dataframe (python): pd.head(2), df.tail(2)

• Basic manipulations of pandas dataframe (python): pd.columns, pd.describe()

• Slicing dataframe:

• Columns(python): pd[[‘column1’, ‘column2’]]

• Rows(python): pd[0:3]

• Accessing dataframe:

• (python): pd.loc[0,’column’]

• (python): pd.iloc[0,1]
Pandas Dataframe

• Apply function on columns (python): pd[‘colum1’,’colum2’].apply(fun)

• Merging two pandas dataframes (python): pd.merge(pd1, pd2, on=‘column’)

• Concatenate two pandas dataframe (python): pd.concat([pd1,pd2])

• Other basic manipulations of pandas dataframe (python):

• pd.dropna(axis=0, inplace=True)

• pd.drop_duplicates(subset=‘column’, inplace=True)
Numpy
• A numpy array is a grid of values, all of the same type

• A simpler data format of dataframe. Operations on numpy are similar to Matlab matrix operation

• Call a numpy (python): import numpy as np

• Construct a numpy (python):

• np.array([[1,2,3],[4,5,6]])

• np1 = pd.to_numpy(), (conversion from pandas dataframe)

• np.zeros((2,2)), np.ones((2,2))

• np.arange(0,3,0.1)

• Access data (python): np1[0,1], np1[0,:], np1[:,0]

• Matrix operations (elementwise) (python): np1+np2, np1-np2, np1*np2, np1/np2

• Real Matrix operations (python): np1.dot(np2), np1.T

• Conversion between pandas dataframe and numpy (python):

• Pandas → numpy: pd.to_numpy()

• numpy → pandas: pd.Dataframe(N1) (here N1 is a numpy)


Comparison data formats
Dataframe RDD Pandas Dataframe Numpy

Advantages Easy I/O; easy to transfer Very powerful to Easy to access and Easy to work on matrix
to other data formats ; easy manipulate the data manipulate data manipulations
for sql query

Disadvantage Hard to access and Hard to master it since you Something wrong with I/O Not fit for string type data
manipulate the data need to be very familiar on databricks
with it

• My suggestion:

1. Construct dataframe from dataset; Transform it to pandas dataframe for further operations

2. If you need simple matrix operations, transfer it to numpy; if you need complicated data operations, transfer it to rdd
Other modules
• re: module (package) to work on strings

• Call re (python): import re

• Substitute a string in a text(python): re.sub(‘<br>’, text)

• Search a string in a text(python): re.search(‘<br>’, text)

• matplotlib: plotting module similar to Matlab plot

• Call it (python): import matplotlib.pyplot as plt

• Plot a function with x as horizontal axis and y as vertical axis: plt.plot(x,y)

• imageio: basic operations on images

• Call it (python): from imageio import imread, imsave

• Read an image (python): img = imread(‘input image file.jpeg’)

• Show an image (python): plt.imshow(img)

• Save an array as image (python): imsave(‘output image file.jpeg’, img)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy