Playing with Spark in sbt

Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark on your own machine. The standard VMs or docker images (e.g. Cloudera, Hortonworks, IBM, MapR, Oracle) do not offer the latest and greatest. If you really want the bleeding edge of Spark, you have to install it locally yourself, roll your own Docker container, or simply use sbt.

If you have already installed sbt on your machine, read on. If not, have a look here on how to set up your machine.

With sbt available, create a folder in which you can play around, your ‘sandbox’. I’ll assume you have created the folder under /path/to/sandbox. On Windows, also create a sub-folder inside it for Spark’s so-called warehouse directory. Let’s call that sub-folder ‘warehouse’.

All you have to do now is create a very simple build.sbt file that has the following contents:

name := "sandbox"
version := "0.0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.1.0"

val grpId = "org.apache.spark"
libraryDependencies ++= Seq(grpId %% "spark-core"      % sparkVersion,
                            grpId %% "spark-sql"       % sparkVersion,
                            grpId %% "spark-streaming" % sparkVersion,
                            grpId %% "spark-mllib"     % sparkVersion,
                            grpId %% "spark-graphx"    % sparkVersion)

Now you can simply run sbt console and, sbt is going to download all the packages you need to run Spark in your command line. A minimal setup that’s a bit like the Spark shell itself can be achieved as follows:

import org.apache.spark._
import org.apache.spark.sql._

val conf = new SparkConf()
val sc = new SparkContext("local[*]", "sandbox", conf)
val ss = SparkSession
  .builder
  .config("spark.sql.warehouse.dir", "/path/to/sandbox/warehouse")
  .getOrCreate()

import ss.implicits._

// Add your code here!

It is important that you add the Spark warehouse directory option on Windows machines because otherwise Java will complain about relative paths in absolute URIs, which is caused by this bug.

Of course, this does not provide a complete development environment replete with Hadoop or any other software. It does, however, provide a very simple way to tinker with Spark in the comfort of a command line or your favourite IDE.

Have fun!