Collecting Table and Column Statistics in MemSQL

MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not automatically collect table (i.e. all columns) and column (i.e. range) statistics. These statistics are important to the query optimizer. Here I’ll present a lightweight shell script that collects table and column (i.e. range) statistics based on a configuration file.

MemSQL supports two main storage engines: rowstore and columnstore. The difference is that rowstore tables are kept (and typically replicated) in memory, whereas columnstore tables are stored on disk. Before columnstore tables are flushed to disk, a rowstore is used to buffer the data, which is why – without flushing – the MemSQL Ops dashboard often shows that rows are ‘missing’ right after an insert.

Without regular statistics collection, the MemSQL Ops dashboard will also inform you that the ‘[d]ata volume has significantly changed since the last time ANALYZE TABLE was run. Run ANALYZE TABLE on each table to improve query performance and refresh schema.’

All in all, it seems that flushing, optimizing, and analysing are important tasks that ought to be automated as much as possible.

Below is the script you can find on the repository, which is where you’ll also find sample configuration files:

  • memsql.cfg, which contains the server details and the location of the log file that will be maintained by the script;
  • tables.cfg, which contains the tables and possibly columns that require statistics collection. Table statistics are collected for all tables listed. Tables that are followed by a colon and then a comma-separated list of columns will also have range statistics for these columns collected.
#!/bin/bash
source memsql.cfg
readarray -t tabs < tables.cfg

[[ $(find $MEM_LOG -type f -size +$MEM_MAX_LOG_SIZE 2>/dev/null) ]] && rm -f "$MEM_LOG"

function exec_memsql() {
  mysql -u "$MEM_USER" -h "$MEM_SERVER" -P "$MEM_PORT" -e "$1" "$MEM_SCHEMA"
}

function add_log_entry() {
  echo "$(date +"%Y-%m-%d %H:%M:%S.%4N")|$(hostname -f)|$USER|$HOME|$1|$2" >> "$MEM_LOG"
}

function collect_col_stats() {
  add_log_entry "$1" "Row flush started"
  exec_memsql "OPTIMIZE TABLE $1 FLUSH"
  add_log_entry "$1" "Row flush completed"

  add_log_entry "$1" "Table optimization started"
  exec_memsql "OPTIMIZE TABLE $1"
  add_log_entry "$1" "Table optimization completed"

  add_log_entry "$1" "Column statistics collection started"
  exec_memsql "ANALYZE TABLE $1"
  add_log_entry "$1" "Column statistics collection completed"
}

function collect_range_stats() {
  add_log_entry "$1" "Range statistics collection started"
  exec_memsql "ANALYZE TABLE $1 COLUMNS $2 ENABLE"
  add_log_entry "$1" "Range statistics collection completed"
}

function trim() {
  echo "${*// /}"
}

add_log_entry "Script" "Started"

for tab in "${tabs[@]%:*}"; do collect_col_stats "$tab" & done

for entry in "${tabs[@]}"; do
  if [[ "$entry" == *":"* ]]; then
    tab="${entry%:*}"
    cols="$(trim "${entry#*:}")"
    collect_range_stats "$tab" "$cols"
  fi
done

add_log_entry "Script" "Completed"

The meat of the entire script is contained in the two for loops near the end. The first one flushes, optimizes, and analyses tables. The flush and optimization only affect columnstore tables. This for loop runs the operations in parallel because there is no performance issue as the regular (i.e. non-full) optimization runs in the background in multiple threads. The second loop collects range statistics on specified columns. This one runs in sequence to avoid a situation in which many potentially long-running queries hit the database at the same time.

You can either run the script ad hoc or on a schedule with, say, cron. The latter is obviously preferred if MemSQL ingests data continuously or regularly.

Note that the script assumes that all tables in the configuration file live in the same schema. To have a more flexible script that is capable of analysing tables in various schemas, simply remove $MEM_SCHEMA from the exec_memsql function, and extract the schema from the configuration file. Once that is done, you can either use ANALYZE TABLE schema.table or USING schema ANALYZE TABLE table in both for loops.

Notes

MemSQL is wire-compatible with MySQL, which means that migration from MySQL to MemSQL is pretty easy and also that there are no additional drivers: the ones for MySQL will do. There are a few differences though.

A bunch of features from MySQL are not supported in MemSQL, such as stored procedures, user-defined functions, triggers, and foreign keys. Secondary indexes (added with either CREATE INDEX or ALTER TABLE table ADD INDEX) on columnar tables, and cross-database references in views are also not supported.

An annoyance I ran into a while ago is when a query with a window function causes an internal shuffle of a table’s shard partitions: ‘reshuffling a select containing window function’ is out of MemSQL’s reach too, at least as of this writing.

Unfortunately, MemSQL has fewer window (a.k.a. analytic) functions than, say, Oracle Database, IBM DB2, and nowadays even Microsoft SQL Server. Note that MySQL has in fact no window functions at all, although its community fork MariaDB does, even though window functions have been around since the SQL:1999 and SQL:2003. MemSQL does sport the (named) window clause, which is something only PostgreSQL can claim too.

What is cool is that MemSQL comes with Pipelines (formerly known as Streamliner), its own built-in Spark connector, which is used by Pinterest.