Skip to content

    On-Demand Spark clusters with GPU acceleration

    on March 29, 2021

    Apache Spark has become the de facto standard for processing large amounts of stationary and streaming data in a distributed fashion. The addition of the MLlib library, consisting of common learning algorithms and utilities, opened up Spark for a wide range of machine learning tasks and paved the way for running complex machine learning workflows on top of Apache Spark clusters. Some of the key benefits of using Spark for machine learning include: 

    • Distributed Learning - Parallelize compute-heavy workloads such as distributed training or hyper-parameter tuning 
    • Interactive Exploratory Analysis - Efficiently load large data sets in a distributed manner. Explore and understand the data using a familiar interface with Spark SQL 
    • Featurization and Transformation - Sample, aggregate, and re-label large data sets.

    At the same time, the use of Spark by Data Scientists presents its own set of challenges: 

    • Complexity – Apache Spark uses a layered architecture that mandates a master node, a cluster manager, and a set of worker nodes. Quite often Spark is not deployed in isolation but sits on top of a virtualized infrastructure (e.g. virtual machines or OS-level virtualization). Maintaining the cluster and the underlying infrastructure configuration can be a complex and time-consuming task 
    • Lack of GPU acceleration – Complex machine workloads, especially the ones involving Deep Learning, benefit from GPU architectures that are well adapted for vector and matrix operations. The Spark provided executor level and CPU-centric parallelization is typically no match for the large and fast registers and optimized bandwidth of the GPU architecture 
    • Cost – Keeping a Spark cluster up and running and using it intermittently, can quickly become a costly exercise (especially if Spark is running in the cloud). Quite often Spark is only needed for a fraction of the ML pipeline (e.g. data pre-processing) as the result set it produces fits comfortably in something like a cuDF DataFrame

    To address the challenges associated with complexity and costs Domino offers the ability to dynamically provision and orchestrate a Spark cluster directly on the infrastructure backing the Domino instance. This allows Domino users to get quick access to Spark without having to rely on their IT team to create and manage one for them. The Spark workloads are fully containerized on the Domino Kubernetes cluster and users can access Spark interactively through a Domino workspace (e.g. JupyterLab) or in batch mode through a Domino job or spark-submit. Moreover, because Domino can provision and de-provision clusters automatically, and can spin up Spark clusters on-demand, use them as part of a complex pipeline, and tear them down once the stage they were needed for is complete.

    Spark driver interacting with worker nodes within Domino

    To solve the need for GPU accelerated Spark, Domino has teamed up with Nvidia. The Domino platform has been capable of leveraging GPU-accelerated hardware (both in the cloud and on-premises) for quite some time, and thanks to its underlying Kubernetes architecture can natively deploy and use NGC containers out of the box. This, for example, enables the Data Scientists to natively use NVIDIA RAPIDS  - a suite of software libraries, built on CUDA-X AI, that gives them the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. In addition, Domino supports the integration of RAPIDS Accelerator for Apache Spark, which combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. These capabilities allow Domino to provide streamlined access to GPU accelerated ML/DL frameworks and GPU accelerated Apache Spark components through a unified and Data Scientist-friendly UI. 

    Showing GPU accelerated architecture - Spark components with Nvidia SQL/DF plugin and accelerated ML/DL frameworks on top of Spark 3.0 core

    Configuring Spark clusters with RAPIDS Accelerator in Domino 

    By default, Domino does not come with a Spark-compatible Compute Environment (Docker image), so our first task is to create one. Creating a new Compute Environment is a well-documented process, so feel free to check the official documentation if you need a refresher.  

    The key steps are to give the new environment a name (e.g. Spark 3.0.0 GPU) and use bitnami/spark:2.4.6 as the base image. Domino’s on-demand Spark functionality has been developed and tested using open-source Spark images from Bitnami (this is why, in case you are interested). However, you could also use the bitnami/spark:3.0.0 image, as we are replacing the Spark installation within so it doesn't really matter.  

    Screenshot of the Domino New Environment UI

    Next, we need to edit the Compute Environment’s Dockerfile to bring Spark up to 3.0.0, add the NVIDIA CUDA drivers, the RAPIDS accelerator, and the GPU discovery script. Adding the code below to the Dockerfile instructions triggers a compute environment rebuild.

    USER root


    RUN apt-get update apt-get install -y wget rm -r /var/lib/apt/lists /var/cache/apt/archives

    ENV HADOOP_HOME=/opt/hadoop
    ENV HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
    ENV SPARK_HOME=/opt/bitnami/spark

    ### Remove the pre-installed Spark since it is pre-bundled with hadoop but preserve the python env
    WORKDIR /opt/bitnami
    RUN rm -rf ${SPARK_HOME}

    ### Install the desired Hadoop-free Spark distribution
    RUN wget -q${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz \
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz \
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz \
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} \
    chmod -R 777 ${SPARK_HOME}/conf

    ### Install the desired Hadoop libraries
    RUN wget -q${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz \
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz \
    rm hadoop-${HADOOP_VERSION}.tar.gz \
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

    ### Setup the Hadoop libraries classpath
    RUN echo 'export SPARK_DIST_CLASSPATH="$(hadoop classpath):${HADOOP_HOME}/share/hadoop/tools/lib/*:/opt/sparkRapidsPlugin"' >> ${SPARK_HOME}/conf/

    ### This is important to maintain compatibility with Bitnami
    RUN /opt/bitnami/scripts/spark/


    RUN apt-get update apt-get install -y --no-install-recommends \
    gnupg2 curl ca-certificates \
    curl -fsSL | apt-key add - \
    echo "deb /" > /etc/apt/sources.list.d/cuda.list \
    echo "deb /" > /etc/apt/sources.list.d/nvidia-ml.list \
    apt-get purge --autoremove -y curl \
    rm -rf /var/lib/apt/lists/*

    ENV CUDA_VERSION 10.1.243


    # For libraries in the cuda-compat-* package:
    RUN apt-get update apt-get install -y --no-install-recommends \
    cuda-cudart-$CUDA_PKG_VERSION \
    cuda-compat-10-1 \
    ln -s cuda-10.1 /usr/local/cuda \
    rm -rf /var/lib/apt/lists/*

    # Required for nvidia-docker v1
    RUN echo "/usr/local/nvidia/lib" >> /etc/ \
    echo "/usr/local/nvidia/lib64" >> /etc/

    ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
    ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

    # nvidia-container-runtime
    ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411"


    RUN apt-get update apt-get install -y --no-install-recommends \
    cuda-libraries-$CUDA_PKG_VERSION \
    cuda-nvtx-$CUDA_PKG_VERSION \
    libcublas10= \
    libnccl2=$NCCL_VERSION-1+cuda10.1 \
    apt-mark hold libnccl2 \
    rm -rf /var/lib/apt/lists/*

    RUN apt-get update apt-get install -y --no-install-recommends \
    cuda-nvml-dev-$CUDA_PKG_VERSION \
    cuda-command-line-tools-$CUDA_PKG_VERSION \
    cuda-libraries-dev-$CUDA_PKG_VERSION \
    cuda-minimal-build-$CUDA_PKG_VERSION \
    libnccl-dev=$NCCL_VERSION-1+cuda10.1 \
    libcublas-dev= \
    rm -rf /var/lib/apt/lists/*

    ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs

    LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

    RUN apt-get update apt-get install -y --no-install-recommends \
    libcudnn7=$CUDNN_VERSION-1+cuda10.1 \
    libcudnn7-dev=$CUDNN_VERSION-1+cuda10.1 \
    apt-mark hold libcudnn7 \
    rm -rf /var/lib/apt/lists/*

    # GPU Discovery Script
    ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
    RUN wget -q -P $SPARK_RAPIDS_DIR
    RUN chmod +x $SPARK_RAPIDS_DIR/
    RUN echo 'export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/"' >> ${SPARK_HOME}/conf/


    RUN wget -q -P $SPARK_RAPIDS_DIR
    RUN wget -q -P $SPARK_RAPIDS_DIR
    ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.14-cuda10-1.jar
    ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.1.0.jar


    We can verify that the new environment has been successfully built by inspecting the Revisions sections and making sure that the active environment is the most recent one.

    Showing Domino's Revisions tab with successfully built version of the GPU compute environment.

    Now that we have a Spark environment with the RAPIDS accelerator in place, we need to create a Workspace environment - an environment that will host the IDE that we'll use to interact with Spark.

    The process of creating a custom PySpark workspace environment is fully covered in the Domino official documentation. It is similar to how we built the Spark environment above, the key differences being that we use a Domino base image (instead of bitnami) and that we also need to configure pluggable workspaces tools. The latter enables access to the web-based tools inside the compute environment (e.g. JupyterLab).

    To build the workspace environment we create a new Compute Environment (Spark 3.0.0 RAPIDS Workspace Py3.6) using dominodatalab/base:Ubuntu18_DAD_Py3.6_R3.6_20200508 as the base image and we add the following contents to the Dockerfile instructions section:


    RUN mkdir -p /opt/domino

    ### Modify the Hadoop and Spark versions below as needed.
    ENV HADOOP_HOME=/opt/domino/hadoop
    ENV HADOOP_CONF_DIR=/opt/domino/hadoop/etc/hadoop
    ENV SPARK_HOME=/opt/domino/spark

    ### Install the desired Hadoop-free Spark distribution
    RUN rm -rf ${SPARK_HOME} \
    wget -q${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz \
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz \
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz \
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} \
    chmod -R 777 ${SPARK_HOME}/conf

    ### Install the desired Hadoop libraries
    RUN rm -rf ${HADOOP_HOME} \
    wget -q${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz \
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz \
    rm hadoop-${HADOOP_VERSION}.tar.gz \
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

    ### Setup the Hadoop libraries classpath and Spark related envars for proper init in Domino
    RUN echo "export SPARK_HOME=${SPARK_HOME}" >> /home/ubuntu/.domino-defaults
    RUN echo "export HADOOP_HOME=${HADOOP_HOME}" >> /home/ubuntu/.domino-defaults
    RUN echo "export HADOOP_CONF_DIR=${HADOOP_CONF_DIR}" >> /home/ubuntu/.domino-defaults
    RUN echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native" >> /home/ubuntu/.domino-defaults
    RUN echo "export PATH=\$PATH:${SPARK_HOME}/bin:${HADOOP_HOME}/bin" >> /home/ubuntu/.domino-defaults
    RUN echo "export SPARK_DIST_CLASSPATH=\"\$(hadoop classpath):${HADOOP_HOME}/share/hadoop/tools/lib/*\"" >> ${SPARK_HOME}/conf/

    ### Complete the PySpark setup from the Spark distribution files
    RUN python install

    ### Optionally copy spark-submit to to be able to run from Domino jobs
    RUN spark_submit_path=$(which spark-submit) \
    cp ${spark_submit_path} ${spark_submit_path}.sh

    ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
    RUN wget -q -P $SPARK_RAPIDS_DIR
    RUN wget -q -P $SPARK_RAPIDS_DIR
    ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.14-cuda10-1.jar
    ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.1.0.jar

    Notice that we also add the RAPIDS accelerator at the end and set a number of environment variables to make the plugin readily available in the preferred IDE (e.g. JupyterLab). We also add the following mapping to the Pluggable Workspaces Tools section in order to make Jupyter and JupyterLab available through the Domino UI.

    title: "Jupyter (Python, R, Julia)"
    iconUrl: "/assets/images/workspace-logos/Jupyter.svg"
    start: [ "/var/opt/workspaces/jupyter/start" ]
    port: 8888
    rewrite: false
    internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
    requireSubdomain: false
    supportedFileExtensions: [ ".ipynb" ]
    title: "JupyterLab"
    iconUrl: "/assets/images/workspace-logos/jupyterlab.svg"
    start: [ /var/opt/workspaces/Jupyterlab/ ]
    internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
    port: 8888
    rewrite: false
    requireSubdomain: false
    title: "vscode"
    iconUrl: "/assets/images/workspace-logos/vscode.svg"
    start: [ "/var/opt/workspaces/vscode/start" ]
    port: 8888
    requireSubdomain: false
    title: "RStudio"
    iconUrl: "/assets/images/workspace-logos/Rstudio.svg"
    start: [ "/var/opt/workspaces/rstudio/start" ]
    port: 8888
    requireSubdomain: false

    After the workspace and Spark environments are made available, everything is in place for launching GPU-accelerated Spark clusters. All we need to do at this point is to go to an arbitrary project and define a new Workspace. We can name the workspace On Demand Spark, select the Spark 3.0.0 RAPIDS Workspace Py3.6 environment, and mark JupyterLab as the desired IDE. The selected hardware tier for the workspace can be relatively small as most of the heavy lifting will be carried out by the Spark cluster.

    The Launch New Workspace screen in Domino. Environment is set to Spark 3.0.0 Workspace, IDE is set to JupyterLab.

    On the Compute Cluster screen, we select Spark, set the number of executors that we want Domino to create for the cluster, and select hardware tiers for the Spark executors and Spark driver. We need to make sure that these hardware tiers have Nvidia GPUs if we are to benefit from using the RAPIDS accelerator.

    Compute Cluster tab of the Launch New Workspace dialog, showing 2 executors, GPU HW tier for the executors, and GPU HW tier for the spark master. Compute environment is set to Spark 3.0.0 GPU

    Once the cluster is up and running we will be presented with an instance of JupyterLab. The workspace will also feature an extra tab - Spark Web UI, which provides access to the web interface of the running Spark application and allows us to monitor and inspect the relevant job executions.

    We can then create a notebook with a minimal example to smoke test the configuration. First, we establish a connection to the on-demand cluster and create an application:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
    .config("spark.task.cpus", 1) \
    .config("spark.driver.extraClassPath", "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar:/opt/sparkRapidsPlugin/cudf-0.14-cuda10-1.jar") \
    .config("spark.executor.extraClassPath", "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar:/opt/sparkRapidsPlugin/cudf-0.14-cuda10-1.jar") \
    .config("spark.executor.resource.gpu.amount", 1) \
    .config("spark.executor.cores", 6) \
    .config("spark.task.resource.gpu.amount", 0.15) \
    .config("spark.rapids.sql.concurrentGpuTasks", 1) \
    .config("spark.rapids.memory.pinnedPool.size", "2G") \
    .config("spark.locality.wait", "0s") \
    .config("spark.sql.files.maxPartitionBytes", "512m") \
    .config("spark.sql.shuffle.partitions", 10) \
    .config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
    .appName("MyGPUAppName") \

    Note that we keep parts of the configuration dynamic, as it will vary based on the specific GPU hardware tier that is running the execution. 

    • spark.task.cpus - number of cores to allocate for each task
    • spark.task.resource.gpu.amount - number of GPUs per task. Note, that this can be a decimal and it can be set in line with the number of CPUs available on the executor hardware tier. In this test, we set it 0.15, which is slightly under 1/6 (6 CPUs sharing a single GPU)
    • spark.executor.resource.gpu.amount - number of GPUs available in the hardware tier (we have 1 V100 here)

    After the application is initialised and connected to the cluster, it appears in the Spark Web UI section of the workspace:

    Spark Web UI tab showing 2 workers, and the MyGPUAppName application using 12 cores and 1 gpu per executor

    We can then run a simple outer join task that looks like this.

    df1 = spark.sparkContext.parallelize(range(1, 100)).map(lambda x: (x, "a" * x)).toDF()
    df2 = spark.sparkContext.parallelize(range(1, 100)).map(lambda x: (x, "b" * x)).toDF()
    df = df1.join(df2, how="outer")

    After the count() action completes, we can inspect the DAG for the first job (for example), and clearly see that Spark is using GPU accelerated operations (e.g. GpuColumnarExchange, GpuHashAggregate etc.)

    Spark DAG visualisation showing 2 stages with standard operations replaced by GPU-accelerated operations (e.g. GpuHashAggregate, GPUColumnarExchange etc.)


    In this post, we showed that configuring an on-demand Apache Spark cluster with RAPIDS Accelerator and GPU backends is a fairly straightforward process in Domino. Besides the benefits around not having to deal with the underlying infrastructure, reducing costs by on-demand provisioning, and out-of-the-box reproducibility provided by the Domino platform, this setup also significantly reduces the processing times, making data science teams more efficient and enabling them to achieve higher model velocity.

    Benchmark plot showing 3.8x speed-up and 50% cost savings on ETL workloads.

    A benchmark published by Nvidia shows 3.8x speed up and 50% cost reduction for an ETL workload executed on the FannieMae Mortgage Dataset (~200GB) using V100 GPU instances.

    If you'd like to learn more, you can use the following additional resources:

    Other posts you might be interested in

    Subscribe to the Data Science Blog

    Receive data science tips and tutorials from leading Data Scientists right to your inbox.