Spark Actions, Laziness, and Caching

Time is important when thinking about what happens when executing operations on Spark’s RDDs. The documentation on actions is quite clear but it doesn’t hurt to look at a very simple example that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness.

Operations on RDDs, as I have previously mentioned, are divided into transformations and actions. Transformations are lazy, which means they are only evaluated when needed; they are delayed-action operations. ‘When needed’ means when an action is called, because actions force Spark to come up with an answer.

As a corollary, before you call an action nothing gets computed. All Spark does is build the DAG (i.e. directed acyclic graph). The computation runs after you have invoked an action on the RDD.

Persisting of RDDs is lazy, as stated in the documentation:

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes.

Let’s take a look at a simple example that clearly shows the laziness of persisting or caching. The code below jumps between Bash and a Spark shell (Scala), so you see what happens in terms of timing. If you want to reproduce the code on your machine, just open two windows or tabs: one with Spark and one with Bash (or whatever shell you prefer). The relevant output is shown on the right.

If caching were an action – which it isn’t! – the number of files would be different in the first instance. In the second instance, the absence of the extra.txt file causes an exception because the DAG is recalculated from the last known point, which in this case means from scratch; the additional file has been removed in the meantime. When we do the same for files2, we do not see an exception because the file exists again. Please observe that the contents have been modified though. So, when persisting and unpersisting, make sure you get the timing right.

So, just in case you didn’t know already: cache() and persist() are evaluated lazily and their computations are triggered by actions.