Shell Scripts to Ease Spark Application Development

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.

Configuration: app.cfg

The basic configuration file is on GitHub:

#!/bin/bash

ORGANIZATION="tech.databaseline"
DEV_FOLDER=$HOME/Development
SCALA_FOLDER=$DEV_FOLDER/scala
ORG_FOLDER=${ORGANIZATION//.//}

SCALA_VERSION="2.11.8"
SBT_VERSION="0.14.1"
SCALATEST_VERSION=""
SPARK_VERSION="2.1.0"
HADOOP_VERSION="2.7.3"

You can use the ORGANIZATION variable to set your organizationId, DEV_FOLDER to point to the local folder where you’d like your development efforts to be saved, and SCALA_FOLDER in case you have a special folder within your development folder where Scala code lives. The ORG_FOLDER simply creates the nested folder structure from ORGANIZATION, so you don’t have to do anything about that.

In case you want to fix specific versions on your system, for instance the Hadoop (client) version, you can do so in the section with Scala-specific definitions.

Creation: sparkSetup.sh

Once that’s set up, you can execute ./sparkSetup.sh MyClassName and it’ll generate the scaffold for an sbt-based Spark/Scala project with ScalaTest already included. The plug-in that allows you to use sbt assembly to create a single fat jar is also added automatically by the script. The main entry point will be MyClassName prefixed by what’s in the ORGANIZATION as per usual. Some packages are commented out, but you can easily adapt the script to your needs:

#!/bin/bash
PROJECT="$1"
ORIGINAL_FOLDER="$(pwd)"
CFG_FILE="app.cfg"
if [ -r "$CFG_FILE" ]; then
  echo "Sourcing application configuration..." && source "$CFG_FILE"
else
  echo "No application configuration found." && exit 1
fi
# -----------------------------------------------------------------------------
# Create main folder and plugins file
# -----------------------------------------------------------------------------
cd "$SCALA_FOLDER"
mkdir "$PROJECT"
cd "$PROJECT"
mkdir project
echo "addSbtPlugin(\"com.eed3si9n\" % \"sbt-assembly\" % \"$SBT_VERSION\")" > project/plugins.sbt
# -----------------------------------------------------------------------------
# Create build.sbt
# -----------------------------------------------------------------------------
echo "name := \"$PROJECT\"
version := \"0.0.1\"
scalaVersion := \"$SCALA_VERSION\"
organization := \"$ORGANIZATION\"

val sparkVersion = \"$SPARK_VERSION\"
val suffix = \"provided\"
val test = \"test\"

scalacOptions ++= Seq(
  \"-unchecked\",
  \"-feature\",
  \"-deprecation\",
  \"-encoding\", \"UTF-8\",
  \"-Xlint\",
  \"-Xfatal-warnings\",
  \"-Ywarn-adapted-args\",
  \"-Ywarn-dead-code\",
  \"-Ywarn-numeric-widen\",
  \"-Ywarn-unused\",
  \"-Ywarn-unused-import\",
  \"-Ywarn-value-discard\"
)

libraryDependencies ++= Seq(
  \"org.apache.spark\"  %% \"spark-core\"       % sparkVersion            % suffix,
  \"org.apache.spark\"  %% \"spark-sql\"        % sparkVersion            % suffix,
  \"org.apache.hadoop\" %  \"hadoop-client\"    % \"$HADOOP_VERSION\"     % suffix,
  \"org.scalatest\"     %% \"scalatest\"        % \"$SCALATEST_VERSION\"  % test,
) " > build.sbt
# -----------------------------------------------------------------------------
# Create folder structure
# -----------------------------------------------------------------------------
mkdir -p src/{main,test}/scala/$ORG_FOLDER
# -----------------------------------------------------------------------------
# Create main entry point
# -----------------------------------------------------------------------------
echo "package $ORGANIZATION

import org.apache.spark._
import org.apache.spark.sql._

object $PROJECT extends App {
  val spark = SparkSession
   .builder()
   .appName(\"$PROJECT\")
   .enableHiveSupport()
   .getOrCreate()

  import spark.implicits._

  try {
    // ...
  } finally {
    spark.stop()
  }
}" > src/main/scala/$ORG_FOLDER/$PROJECT.scala
# -----------------------------------------------------------------------------
# Create UnitSpec class with common mixins
# -----------------------------------------------------------------------------
echo "package $ORGANIZATION

import org.scalatest._

abstract class UnitSpec extends FlatSpec with Matchers with OptionValues with BeforeAndAfterAll" > src/test/scala/$ORG_FOLDER/UnitSpec.scala
# -----------------------------------------------------------------------------
# Create unit test scaffolding
# -----------------------------------------------------------------------------
echo "package $ORGANIZATION

import org.apache.spark._
import org.apache.spark.sql._

class "$PROJECT"Spec extends UnitSpec {
    val spark = SparkSession
   .builder()
   .appName(\"Suite: $PROJECT\")
   .setMaster(\"local[*]\")
   .enableHiveSupport()
   .getOrCreate()

  import spark.implicits._

  \"A $PROJECT\" should \"...\" in {

  }

  it should \"...\" in {

  }

  override def afterAll() {
    spark.stop()
  }
}" > src/test/scala/$ORG_FOLDER/"$PROJECT"Spec.scala
# -----------------------------------------------------------------------------
# Return to original location
# -----------------------------------------------------------------------------
cd $ORIGINAL_FOLDER

Update: updateAppConfig

Whenever you change the Scala or sbt version, or there is a new edition of Spark with the latest and greatest you want to use in your application, it can be a slight pain to manage these versions yourself. Sure, you can look up the correct version and then update app.cfg, but developers are typically a little bit lazy. That’s why I created updateAppConfig.sh.

It reads your system’s current Scala and sbt versions and checks Maven for newer versions of the standard dependencies: ScalaTest, Spark, and Hadoop.

In case you want the script to manage different libraries (e.g. HBase or Scalding) as well, just add them to MAVEN_CONFIG that’s already in place. Please observe the comments on the underscores to make sure you use the correct notation.

FUNCTIONS="$(realpath ../functions.sh)"
if [ -r "$FUNCTIONS" ]; then
  source "$FUNCTIONS"
else
  echo "Cannot load functions file." && exit 1
fi

CFG_FILE="$(basename app.cfg)"
if [ -r "$CFG_FILE" ]; then
  source "$CFG_FILE"
else
  echo "Cannot load configuration file." && exit 2
fi
# -------------------------------------------------------------------------------------------------
# Internal configuration: VERSION prefix|groupId|ArtifactId
# -------------------------------------------------------------------------------------------------
MAVEN_URL="https://search.maven.org/solrsearch/select?"

MAVEN_CONFIG=("SCALATEST|org.scalatest|scalatest_" \
              "SPARK|org.apache.spark|spark-core_" \
              "HADOOP|org.apache.hadoop|hadoop-client")
# -------------------------------------------------------------------------------------------------
# Scala and sbt versions
# -------------------------------------------------------------------------------------------------
CURR_SCALA_VERSION=$(scala -version 2>&1 | sed 's/.*version \([[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\).*/\1/')
CURR_SCALA_VERSION_SHORT=$(echo $CURR_SCALA_VERSION | sed 's/\([[:digit:]]*\.[[:digit:]]*\)\..*/\1/')
CURR_SBT_VERSION=$(sbt --version 2>&1 | sed 's/.*version \([[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\).*/\1/')
# -------------------------------------------------------------------------------------------------
# Update Scala and sbt version configurations in CFG_FILE
# -------------------------------------------------------------------------------------------------
if [ $CURR_SCALA_VERSION != $SCALA_VERSION ]; then
  sed -i "s/^SCALA_VERSION=.*$/SCALA_VERSION=\"$CURR_SCALA_VERSION\" # $SCALA_VERSION/" "$CFG_FILE"
fi

if [ $CURR_SBT_VERSION != $SBT_VERSION ]; then
  sed -i "s/^SBT_VERSION=.*$/SBT_VERSION=\"$CURR_SBT_VERSION\" # $SBT_VERSION/" "$CFG_FILE"
fi
# -------------------------------------------------------------------------------------------------
# Update version configurations in CFG_FILE based on MAVEN_CFG
# -------------------------------------------------------------------------------------------------
for i in "${!MAVEN_CONFIG[@]}"; do
  lib="${MAVEN_CONFIG[i]}"
  libConfig="(${lib//|/ })""
  libId="${libConfig[0]}"
  libGroupId="${libConfig[1]}""
  libArtifactId="${libConfig[2]}"

  # Add short Scala version if artifact configuration end with underscore
  if [ "${libArtifactId: -1}" = "_" ]; then
    libArtifactId=$libArtifactId$CURR_SCALA_VERSION_SHORT
  fi

  # Build REST URL and extract version (already sorted)
  query="q=g:\"$libGroupId\" AND a:\"$libArtifactId\""
  opts="&core=gav&rows=1&wt=json"
  libUrl="$(encodeURL $MAVEN_URL$query$opts)"
  libInfo="$(curl -s -X GET "$libUrl")"
  libVersion="$(echo $libInfo | sed 's/.*"v":"\([^"]*\)".*/\1/')"

  # Build configuration string and replace in CFG_FILE
  configString=""$libId"_VERSION=\""$libVersion"\""
  sed -i "s/^"$libId"_VERSION=.*$/"$libId"_VERSION=\"$libVersion\"/" "$CFG_FILE"
done

You have to add the entry to the list, where the first column is the variable name that the script uses (you’ll see it appear with the suffix _VERSION), the second the groupId, and the third the artifactId. Then, add the appropriate line to the script in the libraryDependencies.

Whenever you run updateAppConfig.sh, it updates app.cfg accordingly. Obviously if you do not want a particular entry to be overwritten (because you’re stuck with a particular version), simply remove that line from the MAVEN_CONFIG array. That’s all there is to it. Please note that you should compile your application with the same Scala version as the one used to build Spark. You can find out what version was used by simply starting the shell.

Important to note is that the updateAppConfig.sh script depends on the script library library.sh, which you can find in the same directory on GitHub. Specifically, the function encodeURL is used to construct the libUrl variable that is used to query Maven’s Solr API.

At some point I may want to add the ability to automatically add entries from updateAppConfig.sh to the dependencies in sparkSetup.sh but I haven’t yet. Moving the MAVEN_CONFIG to a separate file together with static configurations (i.e. stuff not related to library versions) is an acceptable solution, in case you want to try it yourself.