As technology evolves at a rapid rate, it may sometimes be difficult to keep up. While I am certainly not the world’s leading expert on anything, I thought I’d share how I keep abreast of the latest developments in the industry.read more
Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark on your own machine. The standard VMs or docker images (e.g. Cloudera, Hortonworks, IBM, MapR, Oracle) do not offer the latest and greatest. If you really want the bleeding edge of Spark, you have to install it locally yourself, roll your own Docker container, or simply use sbt.read more
Continuous integration, Docker, Jenkins, Vagrant, DevOps, PaaS, serverless… If you are sometimes confused as to what the latest buzzwords in the tech industry mean, you have come to the right placeread more
Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.read more
Apache Phoenix is a SQL skin for HBase, a distributed key-value store.
Phoenix offers two flavours of
but from that it may not be obvious how to update non-key columns, as you need the whole key to add or modify records.
When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same
build.sbt, the same imports, and the skeleton application looks the same.
All that really changes is the main entry point, that is the fully qualified class.
Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.
Time is important when thinking about what happens when executing operations on Spark’s
The documentation on actions is quite clear
but it doesn’t hurt to look at a very simple example
that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness.
There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, Gearpump, Apex, Kafka Streams, Spark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.read more