Spark Actions, Laziness, and Caching
Ian Hellström | 1 April 2016 | 2 min read
Time is important when thinking about what happens when executing operations on Spark’s
The documentation on actions is quite clear
but it doesn’t hurt to look at a very simple example
that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness.
Operations on RDDs, as I have previously mentioned, are divided into transformations and actions. Transformations are lazy, which means they are only evaluated when needed; they are delayed-action operations. ‘When needed’ means when an action is called, because actions force Spark to come up with an answer.
As a corollary, before you call an action nothing gets computed.
All Spark does is build the DAG (i.e. directed acyclic graph).
The computation runs after you have invoked an action on the
Persisting of RDDs is lazy, as stated in the documentation:
You can mark an RDD to be persisted using the
cache()methods on it. The first time it is computed in an action, it will be kept in memory on the nodes.
Let’s take a look at a simple example that clearly shows the laziness of persisting or caching. The code below jumps between Bash and a Spark shell (Scala), so you see what happens in terms of timing. If you want to reproduce the code on your machine, just open two windows or tabs: one with Spark and one with Bash (or whatever shell you prefer). The relevant output is shown on the right.
If caching were an action – which it isn’t! – the number of files would be different in the first instance.
In the second instance, the absence of the
extra.txt file causes an exception because the DAG is recalculated from the last known point, which in this case means from scratch; the additional file has been removed in the meantime.
When we do the same for
files2, we do not see an exception because the file exists again.
Please observe that the contents have been modified though.
So, when persisting and unpersisting, make sure you get the timing right.
So, just in case you didn’t know already:
persist() are evaluated lazily and their computations are triggered by actions.