Shell Scripts to Ease Life with Hadoop

Interaction with HDFS via the file system shell commands or YARN’s commands is cumbersome. I have collected several helpful functions in a shell script to make life with Hadoop and YARN a tad more bearable. Here I’ll go through the salient bits.

Configurations

When I log on to a server for the first time I want to know what the default settings for Hadoop are. The function hconf displays the primary and secondary name nodes, any backup nodes, the block size (in MB), the replication factor and interval, and the maximum number of objects:

function hconf() {
  local priNameNodes=$(hdfs getconf -namenodes 2> /dev/null)
  local secNameNodes=$(hdfs getconf -secondaryNameNodes 2> /dev/null)
  local backupNodes=$(hdfs getconf -backupNodes 2> /dev/null)
  local blockSize=$(hdfs getconf -confKey dfs.block.size 2> /dev/null)
  local replication=$(hdfs getconf -confKey dfs.replication 2> /dev/null)
  local replInterval=$(hdfs getconf -confKey dfs.replication.interval 2> /dev/null)
  local maxObjects=$(hdfs getconf -confKey dfs.max.objects 2> /dev/null)

  blockSize=$(echo "$blockSize/1024/1024" | bc)

  if [ "$maxObjects" = "0" ]; then
    maxObjects="unlimited"
  fi

  echo "Primary namenode(s):      $priNameNodes"
  echo "Secondary namenode(s):    $secNameNodes"
  echo "Backup node(s):           $backupNodes"
  echo "Block size (MB):          $blockSize"
  echo "Replication factor:       $replication"
  echo "Replication interval (s): $replInterval"
  echo "Max. # of objects:        $maxObjects"
}

Sure, the information is available in Ambari or Cloudera Manager in case you are stuck with Hortonworks’ or Cloudera’s distribution, respectively. A simple call to hconf reveals all the basic information in one go, and the function is independent of the distributor.

The hdfs dfs Family

The combo hdfs dfs is a synonym for hadoop fs for HDFS. Since you end up typing a lot of hdfs dfs -XYZ variations, it makes sense to abbreviate that:

function hdp() {
  [ $# -eq 0 ] && echo "$FUNCNAME: at least one argument is required" && return 1

  hdfs dfs -$@
}

This way you can type hdp ls /data instead of hdfs dfs -ls /data. That’s 12 instead of 18 characters. If you are really lazy you can rename the function to h, saving two additional characters.

Am I secretly advertising for HDP, the Hortonworks Data Platform? Nope: hdp are the three consonants in ‘Hadoop’.

But hang on! We can do better:

function hls() {
  local folder="$(__hdfs_folder "$1")"

  hdp ls "$folder"
}

Now we can simply type hls /data, which is only 9 characters. Or, if we set HDP_DATA=/data in aliases.sh and source it, we can just type: hls. Sweet!

By the way, what does the wacky __hdfs_folder thingy do? Have a look:

function __hdfs_folder() {
  local dir="${1:-$HDP_DATA}"

  if [ "$dir" != "/" ]; then
    dir="${dir%/}"
  fi

  local full_dir=""
  if [ "$dir" != "" ] ; then
    if [ "${dir%${dir#?}}" = "/" ] ; then
        full_dir="$dir"
    else
      full_dir="$HDP_DATA/$dir"
    fi
  fi
  echo "$full_dir"
}

What this ancillary function does is create an absolute, normalized HDFS path based on the arguments provided. It’s best explained with a few examples:

Input Output
/ /
/some/fancy/dir /some/fancy/dir
/some/fancy/dir/ /some/fancy/dir
a/relative/dir $HDP_DATA/a/relative/dir
a/relative/dir/ $HDP_DATA/a/relative/dir
` ` (i.e. blank) $HDP_DATA

It strips the trailing forward slash, and, if no forward slash (i.e. root directory) is specified, it considers all paths to be relative to HDP_DATA, which, among others, is defined in aliases.sh. This variable holds the standard data directory in HDFS, often /data, but you are free to set it to whatever suits your needs best. Since you often end up looking in the same ‘master’ directory, __hdfs_folder interprets relative paths always with respect to HDP_DATA.

Just in case you're wondering why I'm mixing camel and snake case: the former is for 'public' functions and the latter for 'private' ones. It appears to be an unwritten rule to use leading underscores for auxiliary functions. No such distinction can technically be made since Bash does not have access modifiers, hence the quotes.

Most of the Hadoop-specific functions that require paths as arguments use __hdfs_folder internally. For instance, to create a directory relative to the default data directory (and set both user and group permissions to ‘rwx’), you can use hmkdir some/relative/dir. In a similar vein you can remove said directory (and all its contents) with hrm some/relative/dir.

Since it’s easy to over-eagerly remove folders, hrm actually checks whether you’re trying to remove the default data location, and if that is the case, it stops. So, when using the library function, you’ll never remove all data at once. That does not mean that you can’t remove all folders in separate commands, but it does provide a basic sanity check. A normal hdfs dfs -rm -R $HDP_DATA command does no such thing!

Note that not all Linux commands and/or options are available in Hadoop. For instance, generating a directory tree is not as easy as issuing tree. However, the library feels your pain and provides that functionality for you: just type in htree and you’ll see all files and folders in a tree. Similarly, there is a command hdu that gives you information about disk usage per HDFS folder (in a human readable format).

The function hcount counts the number of directories, the number of files, and the total file size for all subdirectories:

function hcount() {
 local folder="$(__hdfs_folder "$1")"

  hdfs dfs -ls -C "$folder" | awk '{ system("hdfs dfs -count -h " $1) }'
}

Another one of my favourites is hclear, which removes all empty subdirectories from a directory:

function hclear() {
  [ $# -ne 1 ] && echo "$FUNCNAME: one argument is required" && return 1

  local folder="$(__hdfs_folder "$1")"

  hdp ls -C -d "$folder" | \
    awk '{ system("hdfs dfs -count " $1) }' | \
    awk 'index($2, "0") { system("hdfs dfs -rm -R " $4) }'
}

The ls command lists directories (-d) as a single column (-C). For each directory a call to hdfs dfs -count is made. If the second column (i.e. the total number of bytes contained within the directory) is zero, then the entire subdirectory is removed.

YARN

The YARN and Spark history servers are both useful for a quick glance of the status of jobs managed by YARN. The URL for these services can be found by looking at mapred-site.xml and spark-defaults.conf, respectively:

cat /etc/hadoop/conf/mapred-site.xml | grep -n1 mapreduce.jobhistory.webapp.https.address
cat /etc/spark/conf/spark-defaults.conf | grep spark.yarn.historyServer

Analyses of the logs are, however, best done via the command line, where you can use the familiar GNU/Linux goodies. For instance, to list the ten most recent jobs submitted to YARN:

yarn application -list \
  -appTypes SPARK \   # filter by application type (e.g. MAPREDUCE)
  -appStates ALL \    # filter by state (e.g. SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED)
  | sort \            # sort (applicationIds are incremental)
  | tail              # show last 10

This is too long for my taste, so let’s shorten it:

function ylist() {
  local num="${1:-10}"
  local apps="${2:-SPARK}"
  local states="${3:-ALL}"

  yarn application -list \
    -appTypes "${apps^^}" \
    -appStates "${states^^}" \
    | sort \
    | tail -n "$num"
}

All arguments are optional. The first one is the number of recent applications, the second one is the application type, and the third represents the list of states. Internally, both the application type and the states are automatically converted to upper case.

Similarly, you often end up looking at logs and errors contained within these logs, so we might as well package these two use cases in functions too:

function ylog() {
  [ $# -ne 1 ] && echo "$FUNCNAME: one argument is required" && return 1

  yarn logs -applicationId "$1"
}

function yerr() {
  [ $# -ne 1 ] && echo "$FUNCNAME: one argument is required" && return 1

  yarn logs -applicationId "$1" | grep -n5 ' ERROR '
}

The yerr function returns a context of five lines before and after each occurrence of exceptions, which is helpful when you have to look through stack traces.

As always, you can find the code on GitHub.