0% found this document useful (0 votes)
4 views3 pages

Py Spark Samples

The document provides examples of using PySpark for data manipulation, including creating DataFrames, selecting the first row of each group, and performing SQL queries on the data. It demonstrates how to convert a DataFrame to a Pandas DataFrame and save it as an Excel file, as well as how to work with JSON data. Additionally, it shows how to create DataFrames with explicit schemas and how to display their contents and schema.

Uploaded by

Manas Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views3 pages

Py Spark Samples

The document provides examples of using PySpark for data manipulation, including creating DataFrames, selecting the first row of each group, and performing SQL queries on the data. It demonstrates how to convert a DataFrame to a Pandas DataFrame and save it as an Excel file, as well as how to work with JSON data. Additionally, it shows how to create DataFrames with explicit schemas and how to display their contents and schema.

Uploaded by

Manas Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

https://sparkbyexamples.

com/pyspark/pyspark-select-first-row-of-each-group/

from pyspark.sql import SparkSession,Row


spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]

df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()

pandas_df = df.toPandas()
pandas_df.to_excel("tmp.xlsx")

df.createOrReplaceTempView("EMP")
spark.sql("select Name, Department, Salary from "+
" (select *, row_number() OVER (PARTITION BY department ORDER BY salary DESC)
as rn " +
" FROM EMP) tmp where rn <= 1").show()

spark.sql("select Department, SUM(Salary) Sal FROM EMP GROUP BY Department").show()


--------------------------------

df = spark.read.json("1mb.json")

----------------------------------

# Need to import to use date time


from datetime import datetime, date

# need to import for working with pandas


import pandas as pd

# need to import to use pyspark


from pyspark.sql import Row

# need to import for session creation


from pyspark.sql import SparkSession

# creating the session


spark = SparkSession.builder.getOrCreate()

# pyspark dataframe
rdd = spark.sparkContext.parallelize([
(1, 4., 'GFG1', date(2000, 8, 1), datetime(2000, 8, 1, 12, 0)),
(2, 8., 'GFG2', date(2000, 6, 2), datetime(2000, 6, 2, 12, 0)),
(3, 5., 'GFG3', date(2000, 5, 3), datetime(2000, 5, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df

# show table
df.show()

# show schema
df.printSchema()

---------------------------------------------------
# Need to import to use date time
from datetime import datetime, date

# need to import for working with pandas


import pandas as pd

# need to import to use pyspark


from pyspark.sql import Row

# need to import for session creation


from pyspark.sql import SparkSession

# creating the session


spark = SparkSession.builder.getOrCreate()

# PySpark DataFrame with Explicit Schema


df = spark.createDataFrame([
(1, 4., 'GFG1', date(2000, 8, 1),
datetime(2000, 8, 1, 12, 0)),

(2, 8., 'GFG2', date(2000, 6, 2),


datetime(2000, 6, 2, 12, 0)),

(3, 5., 'GFG3', date(2000, 5, 3),


datetime(2000, 5, 3, 12, 0))
], schema='a long, b double, c string, d date, e timestamp')

# show table
df.show()

# show schema
df.printSchema()

-------------------------------------------------------

from datetime import datetime, date


import pandas as pd

# need to import to use pyspark


from pyspark.sql import Row

# need to import for session creation


from pyspark.sql import SparkSession

# creating the session


spark = SparkSession.builder.getOrCreate()

val data = Seq(('James','','Smith','1991-04-01','M',3000),


('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
)

val columns = Seq("firstname","middlename","lastname","dob","gender","salary")


df = spark.createDataFrame(data), schema = columns).toDF(columns:_*)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy