Reading JSON Resource Files in Apache Spark

If you need to read a JSON file from a resources directory and have these contents available as a basic String or an RDD and/or even Dataset in Spark 2.1 (with Scala 2.11), you’ve come to the right place.

Previously I have written about how to get started with Spark application development using Scala and sbt, how to use shell scripts to create a basic outline of a Spark application, and how to use sbt to experiment with Spark without the need for a full-blown Spark installation.

I’m now going to piggy-back on these posts and show you how to read a JSON file from src/main/resources. You may have this desire in a unit test or simply to try out how to best extract the information from a JSON document.

The following snippet read a JSON file from the resources directory, trims leading/trailing spaces and collapses repeated whitespace characters:

def readJsonResource(file: String): String = {
  val stream = getClass.getResourceAsStream(s"/$file")
  scala.io.Source.fromInputStream(stream)
    .getLines
    .toList
    .mkString(" ")
    .trim
    .replaceAll("\\s+", " ")
}

The trimming and regular-expression replacement are strictly not necessary, but they make the string slightly easier on the eyes. Note that from Scala 2.12 onwards, you can use scala.io.Source.fromResource(file) instead, in which case you do not have to provide an initial forward slash. However, the current Spark version (2.1.0) has been compiled against 2.11, so we have to stick with 2.11.

If you then have implicit parameters of types SparkContext and SparkSession in scope, you can easily make an RDD[String] or a Dataset[Row] out of a JSON file:

def readJsonResourceAsRDD(file: String)(implicit sparkCtxt: SparkContext): RDD[String] =
  sparkCtxt.parallelize(List(readJsonResource(file)))

def readJsonResourceAsDataset(file: String)(implicit sparkSession: SparkSession): Dataset[Row] =
  sparkSession.read.json(readJsonResourceAsRDD(file)(sparkSession.sparkContext))

Instead of copy-pasting these functions, just go to GitHub and download the minimal sandbox project. That way, you play around with Spark and your JSON file as much as you like and perhaps even add more packages to build.sbt. It also comes with a tiny bit of documentation on the functions I’ve shown and a Scalastyle plugin, so that your IDE can complain when you write awfully formatted code.

It is not recommended that you specify a schema for JSON data for any data types other than String. That is, unless you are absolutely sure that each key-value pair in each JSON that Spark is handed for processing is formatted correctly and consistently. The reason is that the parser has only very limited type coercion capabilities and may cause data corruption/loss: only one type cast has to fail for the entire row to be NULL. Spark does not spit out an exception when that happens, so be very careful when using a schema.

In for instance IntelliJ IDEA you can use an interactive Scala console to fiddle with Spark and JSON. What you have to do to get that up and running is the following:

  1. Import the sbt project, and set the JDK and Scala SDK.
  2. Go to RunEdit Configurations…Use class path and SDK of module and choose ‘sandbox’ (or whatever you have chosen to call the project).
  3. Select the code you want to execute.
  4. Right-click on the selection and pick ‘Send Selection to Scala Console’ or use whatever keyboard shortcut you feel is comfortable.

There is a region in SparkSandbox.scala labelled Setup, which can be collapsed (i.e. folded) in its entirety. That way, you can execute the entire section with code to set up Spark, including the auxiliary functions I have shown above, with a single-line selection. It’s there to make selecting and executing a tad easier.

In case you run into java.net.UnknownHostException, you have to execute the following command:

echo "127.0.0.1 localhost.localdomain $(hostname)" | sudo tee --append /etc/hosts > /dev/null

Now you can execute Spark (and Scala) code to tinker with JSON data from the comfort of your favourite IDE.