archaeology
audio
blog
data engineering
Published on 12 March 2016
There are many technologies for streaming data: simple event processors, stream processors, and complex event processors.
Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find.
That’s why I’ve decided to create an overview of Apache streaming technologies, including
Flume,
NiFi,
Gearpump,
Apex,
Kafka Streams,
Spark Streaming,
Storm (and Trident),
Flink,
Samza,
Ignite,
and Beam.
data governance
data integration
Published on 8 June 2014
In this second post of a four-part series on the challenges of data integration I want to talk about project management.
Data integration is, as I have said before, not simply a matter of throwing technical people at a business problem.
Not literally of course: most people do not like being flung at things, abstract or concrete, but probably at the latter a bit less than at the former.
Project management is the key to your success.
Sure, you need able people to build the data warehouse, but without a solid foundation in project management your project will tip over at the slightest sigh.
And please take it from someone who has been there, done that, got the T-shirt, and has outgrown it: there will be a lot of sighs during the project, even full-blown tornadoes… To weather any storm, you and the entire organization have to live project management practices.
Project management is not the silver bullet, but it can protect you against the most common enemies: no idea, no plan, no back-up plan, and no support.
data science
Published on 1 February 2016
It has – perhaps somewhat prematurely – been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay.
The available literature, the majority of courses in both the virtual and real world, and the media all purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day.
The reality for many in the field is quite different.
Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity.
Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science.
Hadoop
hustle
leadership
machine learning
machine learning
music
Oracle
product management
productivity
quantum computing
round-up
Scala
Published on 9 November 2018
Algebird, Cats, and Scalaz are Scala libraries that provide, among other things, type classes that are based on algebraic constructs, such as groups, rings, and monoids.
Contrary to what I had initially believed, perhaps naively, these libraries do not perform compile- (or run-)time checks on the laws governing these algebraic structures, which means it’s possible to create a class that is an instance of such a type class, but it violates the laws of that type class.
Spark
stereoscopy
studio
technology
Published on 12 March 2016
There are many technologies for streaming data: simple event processors, stream processors, and complex event processors.
Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find.
That’s why I’ve decided to create an overview of Apache streaming technologies, including
Flume,
NiFi,
Gearpump,
Apex,
Kafka Streams,
Spark Streaming,
Storm (and Trident),
Flink,
Samza,
Ignite,
and Beam.