For most of human history, music has been composed and performed by amateurs, enslaved individuals, and professionals in direct employment of the nobility. Music as a full-time career option without some form servitude has only been a fairly recent phenomenon. Before we shall dive into the complexities of the music industry and how musicians make money now, let’s go back in time and see how we got here.
Apache Hive does not come with an out-of-the-box way to check tables for duplicate entries or a ready-made method to inspect column contents, such as for instance R’s
In this post I shall show a shell scripts replete with functions to do exactly that: count duplicates and show basic column statistics.
These two functions are ideal when you want to perform a quick sanity check on the data stored in or accessible with Hive.
Interaction with HDFS via the file system shell commands or YARN’s commands is cumbersome. I have collected several helpful functions in a shell script to make life with Hadoop and YARN a tad more bearable. Here I’ll go through the salient bits.
MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not automatically collect table (i.e. all columns) and column (i.e. range) statistics. These statistics are important to the query optimizer. Here I’ll present a lightweight shell script that collects table and column (i.e. range) statistics based on a configuration file.
As technology evolves at a rapid rate, it may sometimes be difficult to keep up. While I am certainly not the world’s leading expert on anything, I thought I’d share how I keep abreast of the latest developments in the industry.
Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark on your own machine. The standard VMs or docker images (e.g. Cloudera, Hortonworks, IBM, MapR, Oracle) do not offer the latest and greatest. If you really want the bleeding edge of Spark, you have to install it locally yourself, roll your own Docker container, or simply use sbt.