Shell Scripts to Ease Spark Application Development
When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt
, the same imports, and the skeleton application looks the same.
All that really changes is the main entry point, that is the fully qualified class.
Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.
Configuration: app.cfg
The basic configuration file is on GitHub:
#!/bin/bash
ORGANIZATION="tech.databaseline"
DEV_FOLDER=$HOME/Development
SCALA_FOLDER=$DEV_FOLDER/scala
ORG_FOLDER=${ORGANIZATION//.//}
SCALA_VERSION="2.11.8"
SBT_VERSION="0.14.1"
SCALATEST_VERSION=""
SPARK_VERSION="2.1.0"
HADOOP_VERSION="2.7.3"
You can use the ORGANIZATION
variable to set your organizationId, DEV_FOLDER
to point to the local folder where you’d like your development efforts to be saved, and SCALA_FOLDER
in case you have a special folder within your development folder where Scala code lives.
The ORG_FOLDER
simply creates the nested folder structure from ORGANIZATION
, so you don’t have to do anything about that.
In case you want to fix specific versions on your system, for instance the Hadoop (client) version, you can do so in the section with Scala-specific definitions.
Creation: sparkSetup.sh
Once that’s set up, you can execute ./sparkSetup.sh MyClassName
and it’ll generate the scaffold for an sbt-based Spark/Scala project with ScalaTest already included.
The plug-in that allows you to use sbt assembly
to create a single fat jar is also added automatically by the script.
The main entry point will be MyClassName
prefixed by what’s in the ORGANIZATION
as per usual.
Some packages are commented out, but you can easily adapt the script to your needs:
#!/bin/bash
PROJECT="$1"
ORIGINAL_FOLDER="$(pwd)"
CFG_FILE="app.cfg"
if [ -r "$CFG_FILE" ]; then
echo "Sourcing application configuration..." && source "$CFG_FILE"
else
echo "No application configuration found." && exit 1
fi
# -----------------------------------------------------------------------------
# Create main folder and plugins file
# -----------------------------------------------------------------------------
cd "$SCALA_FOLDER"
mkdir "$PROJECT"
cd "$PROJECT"
mkdir project
echo "addSbtPlugin(\"com.eed3si9n\" % \"sbt-assembly\" % \"$SBT_VERSION\")" > project/plugins.sbt
# -----------------------------------------------------------------------------
# Create build.sbt
# -----------------------------------------------------------------------------
echo "name := \"$PROJECT\"
version := \"0.0.1\"
scalaVersion := \"$SCALA_VERSION\"
organization := \"$ORGANIZATION\"
val sparkVersion = \"$SPARK_VERSION\"
val suffix = \"provided\"
val test = \"test\"
scalacOptions ++= Seq(
\"-unchecked\",
\"-feature\",
\"-deprecation\",
\"-encoding\", \"UTF-8\",
\"-Xlint\",
\"-Xfatal-warnings\",
\"-Ywarn-adapted-args\",
\"-Ywarn-dead-code\",
\"-Ywarn-numeric-widen\",
\"-Ywarn-unused\",
\"-Ywarn-unused-import\",
\"-Ywarn-value-discard\"
)
libraryDependencies ++= Seq(
\"org.apache.spark\" %% \"spark-core\" % sparkVersion % suffix,
\"org.apache.spark\" %% \"spark-sql\" % sparkVersion % suffix,
\"org.apache.hadoop\" % \"hadoop-client\" % \"$HADOOP_VERSION\" % suffix,
\"org.scalatest\" %% \"scalatest\" % \"$SCALATEST_VERSION\" % test,
) " > build.sbt
# -----------------------------------------------------------------------------
# Create folder structure
# -----------------------------------------------------------------------------
mkdir -p src/{main,test}/scala/$ORG_FOLDER
# -----------------------------------------------------------------------------
# Create main entry point
# -----------------------------------------------------------------------------
echo "package $ORGANIZATION
import org.apache.spark._
import org.apache.spark.sql._
object $PROJECT extends App {
val spark = SparkSession
.builder()
.appName(\"$PROJECT\")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
try {
// ...
} finally {
spark.stop()
}
}" > src/main/scala/$ORG_FOLDER/$PROJECT.scala
# -----------------------------------------------------------------------------
# Create UnitSpec class with common mixins
# -----------------------------------------------------------------------------
echo "package $ORGANIZATION
import org.scalatest._
abstract class UnitSpec extends FlatSpec with Matchers with OptionValues with BeforeAndAfterAll" > src/test/scala/$ORG_FOLDER/UnitSpec.scala
# -----------------------------------------------------------------------------
# Create unit test scaffolding
# -----------------------------------------------------------------------------
echo "package $ORGANIZATION
import org.apache.spark._
import org.apache.spark.sql._
class "$PROJECT"Spec extends UnitSpec {
val spark = SparkSession
.builder()
.appName(\"Suite: $PROJECT\")
.setMaster(\"local[*]\")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
\"A $PROJECT\" should \"...\" in {
}
it should \"...\" in {
}
override def afterAll() {
spark.stop()
}
}" > src/test/scala/$ORG_FOLDER/"$PROJECT"Spec.scala
# -----------------------------------------------------------------------------
# Return to original location
# -----------------------------------------------------------------------------
cd $ORIGINAL_FOLDER
Update: updateAppConfig
Whenever you change the Scala or sbt version, or there is a new edition of Spark with the latest and greatest you want to use in your application, it can be a slight pain to manage these versions yourself.
Sure, you can look up the correct version and then update app.cfg
, but developers are typically a little bit lazy.
That’s why I created updateAppConfig.sh
.
It reads your system’s current Scala and sbt versions and checks Maven for newer versions of the standard dependencies: ScalaTest, Spark, and Hadoop.
In case you want the script to manage different libraries (e.g. HBase or Scalding) as well, just add them to MAVEN_CONFIG
that’s already in place.
Please observe the comments on the underscores to make sure you use the correct notation.
FUNCTIONS="$(realpath ../functions.sh)"
if [ -r "$FUNCTIONS" ]; then
source "$FUNCTIONS"
else
echo "Cannot load functions file." && exit 1
fi
CFG_FILE="$(basename app.cfg)"
if [ -r "$CFG_FILE" ]; then
source "$CFG_FILE"
else
echo "Cannot load configuration file." && exit 2
fi
# -------------------------------------------------------------------------------------------------
# Internal configuration: VERSION prefix|groupId|ArtifactId
# -------------------------------------------------------------------------------------------------
MAVEN_URL="https://search.maven.org/solrsearch/select?"
MAVEN_CONFIG=("SCALATEST|org.scalatest|scalatest_" \
"SPARK|org.apache.spark|spark-core_" \
"HADOOP|org.apache.hadoop|hadoop-client")
# -------------------------------------------------------------------------------------------------
# Scala and sbt versions
# -------------------------------------------------------------------------------------------------
CURR_SCALA_VERSION=$(scala -version 2>&1 | sed 's/.*version \([[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\).*/\1/')
CURR_SCALA_VERSION_SHORT=$(echo $CURR_SCALA_VERSION | sed 's/\([[:digit:]]*\.[[:digit:]]*\)\..*/\1/')
CURR_SBT_VERSION=$(sbt --version 2>&1 | sed 's/.*version \([[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\).*/\1/')
# -------------------------------------------------------------------------------------------------
# Update Scala and sbt version configurations in CFG_FILE
# -------------------------------------------------------------------------------------------------
if [ $CURR_SCALA_VERSION != $SCALA_VERSION ]; then
sed -i "s/^SCALA_VERSION=.*$/SCALA_VERSION=\"$CURR_SCALA_VERSION\" # $SCALA_VERSION/" "$CFG_FILE"
fi
if [ $CURR_SBT_VERSION != $SBT_VERSION ]; then
sed -i "s/^SBT_VERSION=.*$/SBT_VERSION=\"$CURR_SBT_VERSION\" # $SBT_VERSION/" "$CFG_FILE"
fi
# -------------------------------------------------------------------------------------------------
# Update version configurations in CFG_FILE based on MAVEN_CFG
# -------------------------------------------------------------------------------------------------
for i in "${!MAVEN_CONFIG[@]}"; do
lib="${MAVEN_CONFIG[i]}"
libConfig="(${lib//|/ })""
libId="${libConfig[0]}"
libGroupId="${libConfig[1]}""
libArtifactId="${libConfig[2]}"
# Add short Scala version if artifact configuration end with underscore
if [ "${libArtifactId: -1}" = "_" ]; then
libArtifactId=$libArtifactId$CURR_SCALA_VERSION_SHORT
fi
# Build REST URL and extract version (already sorted)
query="q=g:\"$libGroupId\" AND a:\"$libArtifactId\""
opts="&core=gav&rows=1&wt=json"
libUrl="$(encodeURL $MAVEN_URL$query$opts)"
libInfo="$(curl -s -X GET "$libUrl")"
libVersion="$(echo $libInfo | sed 's/.*"v":"\([^"]*\)".*/\1/')"
# Build configuration string and replace in CFG_FILE
configString=""$libId"_VERSION=\""$libVersion"\""
sed -i "s/^"$libId"_VERSION=.*$/"$libId"_VERSION=\"$libVersion\"/" "$CFG_FILE"
done
You have to add the entry to the list, where the first column is the variable name that the script uses (you’ll see it appear with the suffix _VERSION
), the second the groupId, and the third the artifactId.
Then, add the appropriate line to the script in the libraryDependencies
.
Whenever you run updateAppConfig.sh
, it updates app.cfg
accordingly.
Obviously if you do not want a particular entry to be overwritten (because you’re stuck with a particular version), simply remove that line from the MAVEN_CONFIG
array.
That’s all there is to it.
Please note that you should compile your application with the same Scala version as the one used to build Spark.
You can find out what version was used by simply starting the shell.
Important to note is that the updateAppConfig.sh
script depends on the script library library.sh
, which you can find in the same directory on GitHub.
Specifically, the function encodeURL
is used to construct the libUrl
variable that is used to query Maven’s Solr API.
At some point I may want to add the ability to automatically add entries from updateAppConfig.sh
to the dependencies in sparkSetup.sh
but I haven’t yet.
Moving the MAVEN_CONFIG
to a separate file together with static configurations (i.e. stuff not related to library versions) is an acceptable solution, in case you want to try it yourself.