A Glossary of Some Software Terminology

Continuous integration, Docker, Jenkins, Vagrant, DevOps, PaaS, serverless… If you are sometimes confused as to what the latest buzzwords in the tech industry mean, you have come to the right place

read on...

The Problems with Visual Programming Languages in Data Engineering

Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.

read on...

Shell Scripts to Ease Spark Application Development

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.

read on...

Spark Actions, Laziness, and Caching

Time is important when thinking about what happens when executing operations on Spark’s RDDs. The documentation on actions is quite clear but it doesn’t hurt to look at a very simple example that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness.

read on...

An Overview of Apache Streaming Technologies

There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, Gearpump, Apex, Kafka Streams, Spark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.

read on...