26. July 2016

Spark Scaleout

Orbit is designed to work with several scaleout (e.g. cluster) architectures. Out of the box Orbit supports Spark. For other scaleout architectures the IScaleout interface must be implemented and referenced in the config.properties (in /resources).

Currently Orbit works with Spark 2.2.1. For other Spark versions one has to rebuild the dependency map-reduce-exec-spark and define a different Spark version in the build.gradle.

To use a scaleout infrastructure, Orbit needs to be connected to an image server, i.g. Omero. The scaleout functionality does not work if you use Orbit in standalone mode, e.g. open files from local file system (because the scaleout infrastructure will not have access to your local files).

Spark Setup (Master and Slaves)

Orbit can be used with Mesos or YARN or Spark standalone mode. Here I describe briefly how to setup a Spark standalone cluster.

An existing Spark cluster can be used. If you want to setup a new cluster (which also can be just a single multi-core remote machine!) I recommend to use a docker distribution of Spark, e.g. sequenceiq.

The spark-env.sh on server side (master and slaves) define the Spark setup (usually in $SPARK_HOME/sbin). The worker cores (num workers * cores per worker) should be the number of cores you have on each machine. An example spark-env.sh might look like this:

SPARK_PUBLIC_DNS=<Computername>
SPARK_LOCAL_IP=127.0.0.1
SPARK_WORKER_INSTANCES=1
SPARK_WORKER_CORES=8
SPARK_WORKER_MEMORY=1G
SPARK_DRIVER_MEMORY=1G
SPARK_MASTER_IP=0.0.0.0
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8089
SPARK_WORKER_WEBUI_PORT=8081

The in $SPARK_HOME/sbin start the master node with start-master.sh and on all slaves (master can be slave, too!) run start-slave.sh <master-ip>:7077

Orbit Setup

Orbit configuration to work with a Spark cluster.

Orbit needs a SparkConf.properties file. This can be located in the execution directory of Orbit or in the user-home folder. At Orbit startup Orbit writes into the log which file is used. Here is a typical SparkConf.properties:

spark.master=spark://<Spark-master-name>:7077
spark.app.name=OrbitSpark
spark.default.parallelism=100
# further spark options
Orbit.samba.share=domain/user:password@smb://<computername>/<share>

This config mainly contains Spark parameters.

  • Very important is the Spark master of course so that Orbit knows where to connect.
  • The Spark name can be set to anything but will help you to identify the Orbit jobs on the Spark status page.
  • The parallelism is also very important because it defines how many worker cores will be used. Set to the total number of worker cores in your cluster.
  • For further Spark parameters please have a look on the Spark configuration website.

In addition to the Spark parameters Orbit needs a remote share to store models and results. Per default a Samba share is used for that. Therefore please setup somewhere in your network a Samba share (e.g. an a Windows machine simply share a folder or use smbd on Linux) and define it via the Orbit.samba.share parameter. ‘Domain‘ can be omitted, also ‘user‘ and ‘password‘ if not needed fir the share.

Hint: Out of the box Orbit supports Samba, but Orbit can be configured to work with other remote data stores, e.g. HDFS. To make Orbit work with s.th. else then Samba you have to implement the IRemoteContextStore (part of mapReduceGeneric) and use it in your IScaleout implementation (getRemoteContextStore()).

With this configuration Orbit is ready to be used in scaleout mode (Batch -> scaleout execution). For debugging I strongly recommend to have a look in the Orbit log.