Open-Source Data and Machine Learning Software by the Numbers

Ian Hellström | 16 March 2020 | 3 min read

Let’s take a closer look at the more popular open-source tools for data engineering, machine learning, and container orchestration.

While popularity is by no means an indication of quality, it does show towards which technologies the industry converges. That can inform product management, especially with regard to integrations.

To gauge popularity, I shall look at open-source software available on GitHub, and focus on the number of stars on each repository as well as Google searches in the category ‘software’ over the past five years.

Why Google? It has been the world’s leading search engine for at least ten years straight, so it does capture trends fairly accurately.

Data Engineering

The chart below shows the number of stars on GitHub over time for various data processing frameworks. It has been produced with Star History.

Stars on GitHub for data processing frameworks

It’s obvious that Apache Spark leads the pack on GitHub as well as Google.

Scio is Spotify’s open-source Scala API for Apache Beam, so not a data processing framework. In fact, Apache Beam is itself an SDK, for which you choose different runners (e.g. Spark, Flink, or Google Cloud Dataflow).

Workflow Management

Spotify’s Luigi has been losing ground to Airflow, which was created by Airbnb, and Argo is trending in Google web searches.

Stars on GitHub for workflow management systems for data pipelines

Chronos is a fault-tolerant alternative to cron, which is perhaps not a workflow scheduler in the same sense as Airflow and Luigi. Please note that Pinball is no longer maintained.

Uber’s Cadence and Conductor by Netflix are not included, as these tools are geared towards microservices and not data pipelines.

Machine Learning

TensorFlow rules the machine learning libraries:

Stars on GitHub for machine learning frameworks

When you look on Google Trends, you see that TensorFlow is almost twice as often sought as the next framework: PyTorch. The latter is catching up though.

Storage

For databases of all kinds, DB-Engines is the standard popularity meter. What is interesting to note is that certain trends visible in GitHub are not transparent in DB-Engines’ ranking.

SQL: OLTP

PostgreSQL is the open-source leader of relational OLTP databases on GitHub:

Stars on GitHub for OLTP databases

While DB-Engines lists both OLTP and OLAP RDBMSs in the same category, MySQL is its favourite. In terms of Google searches, MySQL is way ahead of Postgres, so Google and DB-Engines agree on that.

Please observe that some projects are mirrored to GitHub (e.g. PostgreSQL and SQLite), so stars and other statistics may not be perfectly representative of the project’s main activity elsewhere.

SQL: OLAP and SQL Engines

Presto leads in the category of data warehouses and SQL engines:

Stars on GitHub for OLAP databases and SQL engines

Some entries are (distributed) SQL query engines and not relational databases per se. These include Presto, Hive, Drill, and Impala. Hive, for example, is a SQL engine on top of Hadoop.

Combined

When you compare OLTP databases and OLAP engines together, Presto and ClickHouse are ahead of PostgreSQL on GitHub, even though DB-Engines lists these in 26th and 39th position, respectively. Below I list the combined RDBMSs and query engines based on their GitHub stars and contributors (as of this writing). The rank shown is from DB-Engines where available.

Name Stars Contributors Rank
Presto 10,266 340 26
ClickHouse 9,831 404 39
Druid 9,296 331 56
PostgreSQL 6,444 33 4
MySQL 4,887 74 2
MariaDB 3,227 221 8
Hive 3,013 194 10
Kylin 2,570 167
Pinot 2,474 101
Drill 1,349 126 46
Impala 403 127 22
SQLite 298 7

The lack of contributors for SQLite is likely due to the fact that the GitHub repository is a mirror.

Google agrees with DB-Engines, which indicates that the community on GitHub is perhaps not entirely representative of popularity in general.

NoSQL

Key-Value Stores

Both DB-Engines and GitHub agree that Redis is the most popular key-value store:

Stars on GitHub for key-value stores

Aerospike does not have nearly as many stars as DB-Engine’s ranking (6th) suggests.

Document Stores

In the category of document-oriented databases, GitHub has Elasticsearch, RethinkDB, and MongoDB as the top 3, whereas DB-Engines sees MongoDB as the current leader:

Stars on GitHub for document-oriented databases

Note that Elasticsearch is a search engine for documents, which is why DB-Engines has it listed as the most popular search engine rather than a document store.

Wide Column Stores

Cassandra is the queen of wide column stores:

Stars on GitHub for wide column stores

ScyllaDB has been catching up to Cassandra on GitHub although it’s in a stable 8th position according to DB-Engines.

Graph Databases

For graph databases, Neo4j is still ahead, although ArangoDB, a multi-model NoSQL database, is gaining in popularity:

Stars on GitHub for graph databases

Container Orchestration

Undoubtedly, Kubernetes is captain of the container orchestrators.

Stars on GitHub for container orchestrators

Mesos is not a container but rather generic workload orchestrator. Marathon is the direct competition of Kubernetes, but typically runs on Mesos, which is why I have included both.

As can be expected, web searches on Google back up Kubernetes’ supremacy.