Topics

archaeology 

Aegean Bronze Age Chronology

Published on 5 September 2022

I present a timeline for the Aegean Bronze Age based on the high chronology.

audio 

Learn Audio Engineering

Published on 6 September 2021

What is the best way to learn audio engineering for the (wannabe) home studio owner in 2021?

blog 

Migrations: from Wordpress to Firebase

Published on 11 December 2019

As of December 2019, databaseline.tech is served from Firebase to deliver all content securely, reliably and quickly, for free. This article details the background and various reasons for the migrations that have happened over the years.

data engineering 

An Overview of Apache Streaming Technologies

Published on 12 March 2016

There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, Gearpump, Apex, Kafka Streams, Spark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.

The Problems with Visual Programming Languages in Data Engineering

Published on 22 September 2016

Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.

data governance 

The Challenges of Data Integration: Data Governance (Part 4/4)

Published on 6 July 2014

The fourth and final part of the series on the challenges of data integration is about data governance. Data governance is not so much a challenge as it is a critical component of continued success. Data integration is usually only the Band-Aid that is applied to a particular business problem. Beyond that it has the power to transform a business, but to do so you need to continuously guard, monitor, analyse, and improve your data and related business processes, so that the information you glean from it is always sound.

Data ≠ Information

Published on 3 August 2014

Big Data. Every time I hear that buzz phrase whizzing around I have the inexplicable urge to smack it back into the hive.

The Data Quality Doctor

Published on 31 August 2014

When it comes to data quality, many companies are still stuck in firefighting mode: whenever a problem arises, a quick fix is introduced, often without proper documentation, after which the problem is filed under ‘H’ for “hopefully not my problem the next time around”. The patches are often temporary, haphazard, and cause more problems downstream than they really solve. In most instances there is no obvious reason for it, which makes a bad situation worse. It reminds me of the saying that the treatment is worse than the disease, so it’s time to bring in the doctor…

Frustration: The Data Governance Sinkhole

Published on 23 November 2014

Plentiful are the companies that revel in fancy descriptions of data-driven decision-making cultures on their corporate websites. Scarce are they who actually have a data governance office to back up these grand claims, for any data-centred programme without clear definitions and business processes regarding data is doomed to fail. What is less known about data governance is that there is a phase during which companies run a risk of losing their best and brightest because of inaction or even worse: wrong actions. Michael Lopp has written an excellent article on why bored engineers quit, and as bad as that may be in general situations it is disastrous in the early phases of a data governance programme.

Why Govern Your Data?

Published on 1 February 2015

The way a company looks at its data is indicative of its readiness to embrace a data governance programme: is data a by-product of doing business or an asset that requires attention and resources? One of the key questions with data governance is, ‘Why?’ Why should you govern your data? What’s the benefit?

Dimensions of Data Quality

Published on 5 May 2021

Data is like underwear: preferably hidden from outsiders, and always a lot dirtier than you think.

data integration 

The Challenges of Data Integration: Technical (Part 1/4)

Published on 25 May 2014

Data integration is a formidable challenge. For one, data integration is never the goal of an organization. Similarly, a data warehouse is never the objective. It is merely a vehicle that can drive you to your destination. Data storage and integration for data’s sake are a waste of time, money, resources, and nerves. Without a clear business case, effective leadership, and strong support, including but not limited to a highly visible and respected sponsor, any data integration project is doomed from the get-go.

The Challenges of Data Integration: Project Management (Part 2/4)

Published on 8 June 2014

In this second post of a four-part series on the challenges of data integration I want to talk about project management. Data integration is, as I have said before, not simply a matter of throwing technical people at a business problem. Not literally of course: most people do not like being flung at things, abstract or concrete, but probably at the latter a bit less than at the former. Project management is the key to your success. Sure, you need able people to build the data warehouse, but without a solid foundation in project management your project will tip over at the slightest sigh. And please take it from someone who has been there, done that, got the T-shirt, and has outgrown it: there will be a lot of sighs during the project, even full-blown tornadoes… To weather any storm, you and the entire organization have to live project management practices. Project management is not the silver bullet, but it can protect you against the most common enemies: no idea, no plan, no back-up plan, and no support.

The Challenges of Data Integration: People (Part 3/4)

Published on 22 June 2014

The third part of the four-part series on the challenges of data integration deals with people. I have already hinted at a few people issues in the first and second parts on technical and project management challenges, respectively, but I have not gone into specifics. Where people work together there will be conflicts. As we shall see, data integration projects can be particularly tricky, as they require dirty data to be ‘smuggled’ over organizational borders into enemy territory.

The Challenges of Data Integration: Data Governance (Part 4/4)

Published on 6 July 2014

The fourth and final part of the series on the challenges of data integration is about data governance. Data governance is not so much a challenge as it is a critical component of continued success. Data integration is usually only the Band-Aid that is applied to a particular business problem. Beyond that it has the power to transform a business, but to do so you need to continuously guard, monitor, analyse, and improve your data and related business processes, so that the information you glean from it is always sound.

data science 

The Football Fortune Tellers

Published on 13 July 2014

It was of course a public relations gimmick by Goldman Sachs and PricewaterhouseCoopers to ‘predict’ the outcome the 2014 FIFA World Cup. Their best and brightest looked into their crystal footballs at historical data and created statistical models that foresaw what has recently transpired. Let’s take a look at what they claimed before the World Cup and what really happened in Brazil in order to answer the question: Were they right?

Homelessness: A Problem with Data but Few Solutions

Published on 14 September 2014

A while ago I had the pleasure to visit Strasbourg, the capital of the Alsace region and home of the European parliament. Strasbourg is a lovely city to spend a few days, go shopping, enjoy French and Alsatian cuisine, and take in the scenery, most notably the picturesque Petite France area with its half-timbered houses and restaurants serving local delicacies along the river Ill. The story I want to tell has very little to do with Strasbourg though. It is about homelessness, statistics on homelessness, and more to the point: the lack of action by our governments to solve the problem of homelessness.

What Makes A New York Times Article Popular? (Part 1/2)

Published on 22 May 2015

I completed the MITx course The Analytics Edge on edX, which I can wholeheartedly recommend to anyone who is interested in analytics. As a part of the MOOC, there was a competition on Kaggle to build a predictive model to answer the question, ‘What makes a New York Times blog article popular?’

What Makes A New York Times Article Popular? (Part 2/2)

Published on 22 May 2015

In the previous post I talked about the details of the data set for the Kaggle challenge to build a model that predicts which New York Times blog articles become popular. In this post I shall discuss how I went about to discover and add features, and build a predictive model that landed me in the top 15% of all entries.

The Case for Industrial Data Science

Published on 1 February 2016

It has – perhaps somewhat prematurely – been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay. The available literature, the majority of courses in both the virtual and real world, and the media all purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day. The reality for many in the field is quite different. Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity. Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science.

Introducing Zoose

Published on 1 March 2022

Zoose is an open-source Docker container image for Jupyter notebooks, pre-loaded with common Python and R packages for data science as well as a Neo4j web server for graph analytics with Cypher.

Introducing Zoose 2.0

Published on 22 August 2022

Zoose 2.0 introduces many improvements and Zoose now also includes common quantum computing libraries!

Zoose 3.0 ♥ Gitpod

Published on 15 October 2022

Zoose for Gitpod offers a powerful Python notebook experience with both VSCode and JupyterLab. Absolutely no installation needed!

Zoose on GitHub Codespaces

Published on 14 November 2022

Spin up Zoose in VSCode with only two clicks.

Hadoop 

An Overview of File and Serialization Formats in Hadoop

Published on 7 December 2015

Apache Hadoop is the de facto standard in Big Data platforms. It’s open source, it’s free, and its ecosystem is gargantuan. When dumping data into Hadoop, the question often arises which container and which serialization format to use. In this post I provide a summary of the various file and serialization formats.

Batch-Updating Non-Key Columns in Apache Phoenix

Published on 15 July 2016

Apache Phoenix is a SQL skin for HBase, a distributed key-value store. Phoenix offers two flavours of UPSERT, but from that it may not be obvious how to update non-key columns, as you need the whole key to add or modify records.

Shell Scripts to Ease Life with Hadoop

Published on 10 February 2017

Interaction with HDFS via the file system shell commands or YARN’s commands is cumbersome. I have collected several helpful functions in a shell script to make life with Hadoop and YARN a tad more bearable. Here I’ll go through the salient bits.

Shell Scripts to Check Data Integrity in Hive

Published on 11 February 2017

Apache Hive does not come with an out-of-the-box way to check tables for duplicate entries or a ready-made method to inspect column contents, such as for instance R’s summary function. In this post I shall show a shell scripts replete with functions to do exactly that: count duplicates and show basic column statistics. These two functions are ideal when you want to perform a quick sanity check on the data stored in or accessible with Hive.

hustle 

So You Want to Be a Billionaire?

Published on 1 May 2021

Then you’d better think about your background.

Personal Priorities

Published on 3 May 2021

What is important in life and how can you figure it out before it is too late?

leadership 

The Quit Upwards Paradox

Published on 13 February 2020

There can be many reasons why employees quit. Many people leave for jobs with better remuneration at other companies, yet the companies they leave behind hire people who do the same. Therein lies a paradox: why are fewer people promoted than hired from outside, especially since those who quit obtain pay rises significantly larger than those who stay at the same company?

Take-Home Interview Challenges are Biased

Published on 14 February 2020

A common practice in hiring in the tech industry is the take-home interview challenge, which, despite its neutral appearance, is implicitly biased and can actually decrease diversity in an industry that is already predominantly young, white, and male.

Essential Reading for Data and Machine Learning

Published on 25 February 2020

Here’s my list of essential reading for people in the data and machine learning space.

machine learning 

Lean Data and Machine Learning Operations

Published on 13 December 2019

Lean Data and Machine Learning Operations (D/MLOps) is the adoption of the ‘lean’ philosophy from manufacturing. Its aim is to continuously improve the operation of data and machine learning pipelines.

A Tour of End-to-End Machine Learning Platforms

Published on 21 February 2020

Machine Learning (ML) is known as the high-interest credit card of technical debt. It is relatively easy to get started with a model that is good enough for a particular business problem, but to make that model work in a production environment that scales and can deal with messy, changing data semantics and relationships, and evolving schemas in an automated and reliable fashion, that is another matter altogether. If you’re interested in learning more about a few well-known ML platforms, you’ve come to the right place!

A Brief History of Machine Learning Platforms

Published on 23 September 2020

Data and machine learning technologies for the big data era have been developed in the last twenty years. Let’s review these two decades in terms of data processing and machine learning frameworks as well as machine learning platforms that have been spotted in the wild.

What is a Feature Store?

Published on 27 January 2021

What is all the fuss about feature stores in machine learning?

Machine Learning Platforms in 2021

Published on 2 February 2021

How many machine learning platforms run on Kubernetes? Which machine learning platforms can run in air-gapped environments? How common are feature stores in current machine learning platforms?

Machine Learning Industry Research

Published on 8 February 2021

Check out the latest industry research for enterprise machine learning.

DevSecOps for ML

Published on 21 April 2021

Go distroless and reduce the image size as well as the number of CVEs for machine learning containers.

How to Spot a Rogue Data Scientist

Published on 30 April 2021

The last few years have seen a massive influx of “data scientists” who complete a few MOOCs, participate in a Kaggle competition or two, and think that experience is representative of life as a data scientist in the enterprise. If that mindset persists, and they show no desire to change it, they are what I call rogue data scientists.

Of Research and Rogues

Published on 30 April 2021

When was the last time you read an article in the press on machine learning operations?

Sorry, Batman is Busy

Published on 10 May 2021

The Riddler is back in town and no one can deploy a single machine learning model. Throwing more resources at the problem won’t solve anything and Commissioner Gordon knows it. There is only one man who can save Gotham from the pitfalls of perennial productionless prototypes: Batman!

ML Cards for D/MLOps Governance

Published on 11 May 2021

Machine learning without proper attention to code or data is futile. But how do you approach data and machine learning governance in practice? With cards!

Run PyTorch in Your Browser

Published on 3 November 2022

Learn how to run PyTorch in your browser in VSCode or JupyterLab. No installation required.

DevOps for ML

Published on 27 November 2022

While DevOps can be applied to machine learning, there are subtle differences that may not be obvious to the casual observer.

machine learning 

Optimization with TensorFlow

Published on 28 August 2019

TensorFlow is a free, open-source machine learning framework that’s geared towards deep learning. Optimization algorithms are at the heart of artificial neural networks. We can therefore let TensorFlow solve numerical optimization problems.

music 

How Do Musicians Make Money? (Part 1/2)

Published on 12 February 2017

For most of human history, music has been composed and performed by amateurs, enslaved individuals, and professionals in direct employment of the nobility. Music as a full-time career option without some form servitude has only been a fairly recent phenomenon. Before we shall dive into the complexities of the music industry and how musicians make money now, let’s go back in time and see how we got here.

How Do Musicians Make Money? (Part 2/2)

Published on 12 February 2017

2016 was a good year for the music business: revenue from streaming and digital downloads exceeded that of physical sales for the first time in the history of the industry. On the one hand, companies such as Spotify and Apple have made digital music easily accessible and legal. On the other hand, each stream only pays artists fractions of a cent, which means that all but the most popular artists make very little money from streaming. Fair compensation for artists in a digital world is a recurring topic in media and industry. In this article I want to look at the ways professional musicians make money now.

Blockchains and Music

Published on 3 March 2017

There are two crucial pieces of information for each song: who performed it (performing rights) and who owns the rights to it (copyrights). As of today, there is no central system that maintains this information. Is this really such a problem? Well, Spotify had to settle a $30m lawsuit a while ago because they had no idea whom to pay. There is also an infamous case in which 107% of the rights to a single song were sold. So, yes, it’s a pretty big deal. That’s why I want to take a look at blockchains as a possible solution.

Learn Audio Engineering

Published on 6 September 2021

What is the best way to learn audio engineering for the (wannabe) home studio owner in 2021?

Oracle 

Oracle Database Optimization for Developers

Published on 17 August 2014

Let’s talk about my goings-on over at Read The Docs (RTD).

How to Multiply Across a Hierarchy in Oracle: SQL Statements (Part 1/2)

Published on 28 September 2014

Hierarchical and recursive queries pop up from time to time in the life of a database developer. Complex hierarchies are often best handled by databases that are dedicated to such structures, such as Neo4j, a popular graph database. Most relational database management systems can generally deal with reasonable amounts of hierarchical data too. The classical example of hierarchical queries in SQL is the employees table: construct the organization chart with the CEO sitting at the top of the tree and all employees dangling from the branches on which their respective managers have placed their bottoms. While a direct acyclic graph of all company talent may tickle your fancy, I very much doubt it.

How to Multiply Across a Hierarchy in Oracle: Performance (Part 2/2)

Published on 12 October 2014

In the first part I have shown several options to calculate aggregation functions along various branches of a hierarchy, where the values of a particular row depend on the values of its predecessors, specifically the current row’s value is the product of the values of its predecessors. Performance was the elephant in the room. So, it’s time to take a look at the relative performance of the various alternatives presented.

Searching The Oracle Data Dictionary

Published on 9 November 2014

Databases and especially data warehouses typically consist of many dozens of tables and views. Good documentation is essential but even the best documentation cannot answer your questions as quickly as you want the information.

Tuning Distributed Queries in Oracle

Published on 7 December 2014

When it comes to SQL statements and optimizing queries on relational databases, probably the first thing developers (ought to) look at is the execution plan. The execution plan shows you what the database engine thinks is the best way to execute a query and it gives estimates of relevant runtime indicators that influenced the optimizer’s decision. When a query involves calls to remote databases you may not always get the best execution (plan) available, because Oracle always runs the query on the local database as it has no way of estimating the cost of network traffic and thus no way of weighing the pros and cons of running your query remotely versus locally. Many tips and tricks have been noted by gurus and of course Oracle, but I was recently asked to tune a query than involved more than the textbook cases typically shown online.

Unit Testing PL/SQL Code?

Published on 21 December 2014

In almost all areas of software development, unit testing is not only common sense but also common practice. After all, hardly any serious software vendor would dare ship applications without having properly tested their functionality. When it comes to databases, many organizations still live in the Dark Ages. With Oracle SQL Developer there is absolutely no reason to remain in the dark: unit testing PL/SQL components is easy, free, and fully integrated into the IDE.

Oracle SQL and PL/SQL Coding Guidelines

Published on 4 January 2015

Coding standards are important because they reduce the cost of maintenance. To enable database developers on the same team to read one another’s code more easily, and to have consistency in the code produced and to be maintained, I have prepared a set of coding conventions for Oracle SQL and PL/SQL. These are by no means the be-all and end-all of Oracle Database standards, and in some instances you may not agree with the conventions I have proposed. That’s why I have created an easy-to-share, easy-to-edit Markdown document with these guidelines, including a snazzy CSS3 style sheet, in my GitHub repository. You can adapt these guidelines for your organization’s needs as you see fit; an attribution would be grand but I won’t sue you if you’re dishonest.

An Overview of PL/SQL Collection Types

Published on 1 March 2015

Collections are core components in Oracle PL/SQL programs. You can (temporarily) store data from the database or local variables in collections and pass these collections to subprograms. Collections are also critical to bulk operations, such as BULK COLLECT and FORALL, as well as table functions, both simple and pipelined. Bulk operations and table functions are critical to high-performance code. Here, I provide an overview of their characteristics as an introduction to novice PL/SQL developers and as an one-stop reference.

Checking Data Type Consistency in Oracle

Published on 29 March 2015

In large databases it can be a challenge to have data type consistency across many tables and views, especially since SQL does not understand PL/SQL’s %TYPE attribute. When designing the overall structure of the tables, tools such as SQL Developer’s Data Modeller can be used to reduce the pain associated with potential data type inconsistencies. However, as databases grow and evolve, data types may diverge and cause headaches when moving data back and forth. Here I present a utility to identify and automatically fix many of these issues.

Connecting to Oracle Database VM From A Host

Published on 26 April 2015

Although the pre-built Oracle Database 12c VMs come with Oracle SQL Developer and APEX, you may not want to leave the host environment and develop in the virtual machine (guest). Sure, you can set up a shared folder and enable bi-directional copy-paste functionality thanks to the so-called Guest Additions, but it’s not the same as working in your own host OS. In this post I describe how you can connect from the host to the guest on which the VM resides with a few simple tweaks. I have also included a simple installation overview of SQL Developer for Ubuntu.

Oracle Date Arithmetic Weirdness

Published on 8 July 2015

Although the date arithmetic in Oracle Database is well documented, it is not always as clear as it could be. In this blog post I want to point out a few common traps with regard to date calculations in Oracle that you should be aware of, especially with regard to intervals.

ETL: A Simple Package to Load Data from Views

Published on 29 February 2016

A common, native way to load data into tables in Oracle is to create a view to load the data from. Depending on how the view is built, you can either refresh (i.e. overwrite) the data in a table or append fresh data to the table. Here, I present a simple package ETL that only requires you to maintain a configuration table and obviously the source views (or tables) and target tables.

product management 

Principles of Product Management

Published on 27 August 2019

Over the years I have worked with some good product managers, heard of a few great ones, and endured others who were less than stellar. From these observations and my own experience, I have distilled a set of principles for product managers, a collection of dos and don’ts.

Product Prioritization with ICE·T

Published on 27 July 2021

ICE·T is a product prioritization model that works well for high-level prioritization and applies to internal products and platforms, too.

The 70 + 3×10 Model

Published on 2 August 2021

The 70 + 3×10 model is a simple but effective model for DevOps teams to distribute their time among feature development, maintenance and support, operations, and learning.

The Product Management Opportunity in Quantum Computing

Published on 6 December 2022

Recent surveys by Omdia and Q-CTRL point to an urgent need for product management in quantum computing: vendors and early adopters disagree on the top priorities for near-term use cases.

Research Papers with Logseq

Published on 30 January 2023

Here’s how I organize and annotate research papers in Logseq.

PM, Pope of Nope

Published on 8 February 2023

A good product manager is the Pope of Nope.

productivity 

How to Stay Up to Date with Trends in Tech

Published on 23 January 2017

As technology evolves at a rapid rate, it may sometimes be difficult to keep up. While I am certainly not the world’s leading expert on anything, I thought I’d share how I keep abreast of the latest developments in the industry.

How to Stay Up to Date with Trends in Tech (Revisited)

Published on 22 November 2019

Almost three years ago I wrote about how I keep up with industry trends. My habits have changed quite a bit, so I thought I’d write another post about it, as it can be daunting to find pertinent information in a timely manner that is not overwhelming, especially when you are starting out in your technology career.

How to Boost Your GCP Productivity in Chrome

Published on 9 March 2020

When you work with the Google Cloud Platform, console.cloud.google.com is your home base outside of the Cloud SDK. It requires a lot of clicking, even with pinned products in the side bar. As a developer I prefer the keyboard to the mouse. So, how can you use the keyboard to boost your productivity in Google Chrome?

Business Email Etiquette

Published on 13 March 2020

It’s a familiar situation: you’ve spent the better part of an hour crafting an email with all relevant background information and various suggestions before you hit ‘Send’. A day goes by. Then another. You wonder, “Did no one read it?” The truth is: somebody probably did, but none of the recipients were inclined to go through your magnum opus. Too bad, but not entirely unavoidable. Here is my business email etiquette guide to make it easier for recipients to reciprocate with ‘Reply’.

PKM with Logseq and Readwise

Published on 1 August 2022

Learn more about my personal knowledge management (PKM) setup with Logseq connected to Readwise, which aggregates highlights from Command, Feedly, Kindle, and Pocket.

Thoughts on Writing

Published on 18 February 2023

What is it your reader needs to do, feel, think, decide, or understand? Help your reader to achieve that as quickly as possible, and never exceed 10 seconds to get your main point across.

Tips for Presentations

Published on 19 February 2023

Here are five tips to create great presentations. Or at least ones that do not suck the life out of your audience.

Useful Research Tools

Published on 25 February 2023

I lift the veil on a few tools I rely on when researching various topics.

Running Meetings

Published on 26 February 2023

While meetings are not documents, the amount of rambling is a lot higher. That often stems from a few core mistakes.

quantum computing 

A Brief History of Quantum Programming

Published on 29 July 2022

The era of the quantum computer is nigh. Where are we today and how have we arrived here in terms of software for quantum computers?

What is Quantum Computing?

Published on 29 August 2022

Whereas current computers use bits that are each either zero or one, quantum bits can be both zero and one. This quirky property of quantum physics can be used to our advantage in quantum computers.

Learn Quantum Computing

Published on 7 September 2022

No idea where to begin with quantum computing? Here is a curated list of articles, books, videos, courses, and online programmes to get ready for the revolution!

Roads to Quantum Advantage

Published on 8 September 2022

When can we expect quantum computers to be able to solve real problems more efficiently than classical computers?

Near-Term Quantum Use Cases

Published on 19 November 2022

What are possible use cases for quantum computing in the current NISQ era? And who’s already using quantum technologies today?

Quantum Tech in the Global 500

Published on 20 November 2022

What is the state of quantum technologies in the Fortune Global 500? What are the top use cases?

15 Questions about Quantum

Published on 5 December 2022

Fifteen questions (and answers) about quantum computing you always wanted but were afraid to ask.

A Review of the Best Quantum Computing Courses for All Skill Levels

Published on 1 June 2023

No idea where to begin with quantum computing? Here is a review of the various courses I have completed.

round-up 

Weekend Reading Round-Up (Sept. '17)

Published on 15 September 2017

Below I have collected an initial batch of recent research articles and posts on various topics, such as deep learning, graphs, music, and Scala, that may be of interest to readers of Databaseline.

Weekend Reading Round-Up (Oct. '17)

Published on 13 October 2017

In this month’s reading round-up I look at businesses in Africa, category theory and Scala, fancy copy-pasting of code, neuromorphic microchips, machine learning, philosophy, supercomputers, topology, and of course data.

Weekend Reading Round-Up (Nov. '17)

Published on 17 November 2017

How easy is it to deceive a deep neural network? Does gender of leaders affect team cohesion? Can music be classified by looking at entropy alone?

Weekend Reading Round-Up (Dec. '17)

Published on 21 December 2017

Hot off the press! Articles on deep learning, early-warning signs of critical transitions, recommendation engines, support vector machines, and music.

Scala 

Passing Functions in Scala

Published on 1 June 2018

In Scala there are multiple ways to pass one function to another: by value, by name, and as a function, either anonymous or defined. The differences are subtle, and, if you’re not careful, you may end up with surprising results.

Associativity in Semigroups in Scala: Algebird, Cats, and Scalaz

Published on 9 November 2018

Algebird, Cats, and Scalaz are Scala libraries that provide, among other things, type classes that are based on algebraic constructs, such as groups, rings, and monoids. Contrary to what I had initially believed, perhaps naively, these libraries do not perform compile- (or run-)time checks on the laws governing these algebraic structures, which means it’s possible to create a class that is an instance of such a type class, but it violates the laws of that type class.

Differences in Conversions of Java Numbers in Scala 2.11, 2.12, and 2.13

Published on 21 March 2019

There are a few subtle changes between Scala 2.11 and 2.12/2.13 when it comes to conversions between Java and Scala types that you may not be aware of: nullable boxed primitives, such as numbers.

Joins in Scio

Published on 15 August 2019

Scio is Spotify’s open-source Scala API for Apache Beam and Google Cloud Dataflow. It’s used by data engineers at Spotify to process many petabytes of data each day. Let’s look at the different joins it supports and how and when to use each.

How to Get Started with Scala

Published on 2 December 2019

Scala is a key language in the data space. While Python is the lingua franca of data science and machine learning, Scala frequently pops up in data engineering and backend systems. It provides a type-safe, functional layer on top of the battle-hardened JVM, which means it benefits from a rich ecosystem that’s available without all of Java’s boilerplate. Scala comes with its own REPL, so it is as easy and fast as Python to experiment with code. But what are the best ways to learn Scala? Here are a few of my suggestions.

Scalafmt with Docker

Published on 4 March 2020

Scalafmt is a popular formatter for Scala. The formatting it produces is not always identical across versions, even with the same configuration file. To ensure all developers on a team format the code in the same way, I’ll show you how to roll your own Scalafmt action container with Docker.

Spark 

Setting up Scala for Spark App Development

Published on 4 January 2016

Apache Spark is a popular framework for distributed computing, both within and without the Hadoop ecosystem. Spark offers interactive shells for Scala as well as Python. Applications can be written in any language for which there is an API: Scala, Python, Java, or R. Since it can be daunting to set up your environment to begin developing applications, I have created a presentation that gets you up and running with Spark, Scala, sbt, and ScalaTest in (almost) no time.

Spark Actions, Laziness, and Caching

Published on 1 April 2016

Time is important when thinking about what happens when executing operations on Spark’s RDDs. The documentation on actions is quite clear but it doesn’t hurt to look at a very simple example that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness.

Shell Scripts to Ease Spark Application Development

Published on 25 April 2016

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.

Playing with Spark in sbt

Published on 13 January 2017

Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark on your own machine. The standard VMs or docker images (e.g. Cloudera, Hortonworks, IBM, MapR, Oracle) do not offer the latest and greatest. If you really want the bleeding edge of Spark, you have to install it locally yourself, roll your own Docker container, or simply use sbt.

Reading JSON Resource Files in Apache Spark

Published on 17 February 2017

If you need to read a JSON file from a resources directory and have these contents available as a basic String or an RDD and/or even Dataset in Spark 2.1 (with Scala 2.11), you’ve come to the right place.

stereoscopy 

A Beginner's Guide to Stereoscopic Photography on a Smartphone: Basics

Published on 9 October 2018

3D and VR may sound like recent inventions, but the former has been around since the mid-1800s. Stereoscopic cards were extremely popular in Paris in the 1860s (e.g. Les Diableries). In a series of posts I’ll focus on how stereoscopic imaging works, how to create stereoscopic images with any camera or smartphone, and what mistakes beginners tend to make and how to avoid them.

A Beginner's Guide to Stereoscopic Photography on a Smartphone: Pitfalls

Published on 9 October 2018

Stereoscopic photography is possible with very simple tools: a viewer and a camera. In this post we’ll take a look at common mistakes beginners make, how to spot them, and how to fix these problems, so your stereo images look great.

A Beginner's Guide to Stereoscopic Photography on a Smartphone: Kúla Bebe

Published on 23 July 2019

While you can take great stereoscopic pictures with a smartphone using sequential shots, as I described in the first and second parts of this three-part series on smartphone stereo photography, sometimes you want to capture a three-dimensional scene in motion. Even that’s possible but you do need additional gear, for instance the Kúla Bebe.

studio 

Learn Audio Engineering

Published on 6 September 2021

What is the best way to learn audio engineering for the (wannabe) home studio owner in 2021?

technology 

Numerical Algorithms: Variational Integrators

Published on 26 October 2014

When I was rummaging through my digital attic I found code that I had worked on a few years ago. It is not related to data or database technologies but it is in itself interesting, so that I thought I’d better share it. To calculate integrals in mathematics or solve differential equations in physics or chemistry you often need a computer’s help. Only in very rare cases can you express the integral symbolically. In most instances you have to be satisfied with a numerical solution, which for many problems is perfectly fine.

Mapping a Value Stream in Neo4j

Published on 16 August 2015

The canonical use cases of a graph database such as Neo4j are social networks. In logistics and manufacturing networks also arise naturally. In particular, supply chains and value streams spring to mind. They may not be as large as Facebook’s social graph of all its users, but seeing them for the beasts they truly are can be beneficial. In this post I therefore want to talk about how you can model a value stream in Neo4j and how you can extract valuable information from it.

The Fast Track to Julia

Published on 13 September 2015

Julia is a fairly new and promising programming language that is designed for technical computing. Here I present a cheat sheet, or rather cheat page, with the salient features. Interested? Head on over to databaseline.tech/julia.html.

An Overview of Apache Streaming Technologies

Published on 12 March 2016

There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, Gearpump, Apex, Kafka Streams, Spark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.

A Glossary of Some Software Terminology

Published on 8 December 2016

Continuous integration, Docker, Jenkins, Vagrant, DevOps, PaaS, serverless… If you are sometimes confused as to what the latest buzzwords in the tech industry mean, you have come to the right place

Collecting Table and Column Statistics in MemSQL

Published on 5 February 2017

MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not automatically collect table (i.e. all columns) and column (i.e. range) statistics. These statistics are important to the query optimizer. Here I’ll present a lightweight shell script that collects table and column (i.e. range) statistics based on a configuration file.

2BIG in DevOps

Published on 25 November 2019

Needless complexity in production systems is contrary to best practices and common sense. Yet quite a few developers relish beautifully crafted solutions that, in theory, can cater to everyone’s taste, but in practice serve no one. Stability and predictability are paramount to operations, yet the desire to experiment with technologies and frameworks often outweighs such considerations. To that end I’d like to present a simple principle for DevOps: 2BIG.

Shell Scripts for Virtual Python Environments

Published on 6 March 2020

Python development without virtual environments is a pain; with virtual environments it quickly gets messy. Here is a shell script to make life with Python easier.

Open-Source Data and Machine Learning Software by the Numbers

Published on 16 March 2020

Let’s take a closer look at the more popular open-source tools for data engineering, machine learning, and container orchestration.