Skip to content

Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset

Notifications You must be signed in to change notification settings

matthieuvion/spark-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

License: MIT made-with-python

Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]

Deploying a local spark cluster (standalone) can be tricky
Most of online ressources focus on single driver installation w/ Spark in a custom env or using jupyter-docker-stacks
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
You might also want to do it the easy way --not local though, using Databricks community (free)

1. Prerequisites


  • Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
  • Make sure Docker Compose is installed or install it.
  • Ressources :
    Medium article install and basic use of Docker . Docker official ressources should be enough though.
    Jupyter-Docker-Stacks. "A set of ready-to-run Docker images containing Jupyter applications".
    The source article I (very slightly) adapted the docker-compose file from.
    Install Docker engine (apt get), official ressource.
    Install Docker compose (apt get), official ressource.

2. How to


After Docker Engine/compose installation, on linux, do not forget the post-installation steps

  1. Git clone this repository or create a new one (name of your choice)
  2. Open terminal, cd into custom directory, make sure docker-compose.yml file is present (copy it in if needed)
    version: '3'
    services:
    spark:
    image: bitnami/spark:3.3.1
    environment:
    - SPARK_MODE=master
    ports:
    - '8080:8080'
    - '7077:7077'
    volumes:
    - $PWD:/home/jovyan/work
    spark-worker:
    image: bitnami/spark:3.3.1
    environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark:7077
    - SPARK_WORKER_MEMORY=4G
    - SPARK_EXECUTOR_MEMORY=4G
    - SPARK_WORKER_CORES=4
    ports:
    - '8081:8081'
    volumes:
    - $PWD:/home/jovyan/work
    jupyter:
    image: jupyter/pyspark-notebook:spark-3.3.1
    ports:
    - '8888:8888'
    volumes:
    - $PWD:/home/jovyan/work

Basically, the yml file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command :

  1. Run docker compose
cd my-directory
docker compose up
# or depending of your Docker Compose install:   
docker-compose up

Docker compose will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. The next times, it will only run them.

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Access the different interfaces with :

Jupyter lab interface : http://localhost:8888
Spark Master : http://localhost:8080
Spark Worker : http://localhost:8081

You can use the demo notebook spark-cluster.ipynb for a ready-to-use PySpark notebook, or simply create a new one and run this as a SparkSession:

from pyspark.sql import SparkSession

# SparkSession
URL_SPARK = "spark://spark:7077"

spark = (
    SparkSession.builder
    .appName("spark-ml")
    .config("executor.memory", "4g")
    .master(URL_SPARK)
    .getOrCreate()
)

Bonus : Notebook, predict using spark.ml Pipeline()


If you use spark-cluster.ipynb, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.

About

Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy