# Optimization with TensorFlow

Ian Hellström | 28 August 2019 | 4 min read

TensorFlow is a free, open-source machine learning framework that’s geared towards deep learning. Optimization algorithms are at the heart of artificial neural networks. We can therefore let TensorFlow solve numerical optimization problems.

## TensorFlow and TFX

For end-to-end machine learning (ML) workflows there is TensorFlow eXtended (TFX), which runs on Kubeflow, an ML orchestration framework that leverages Kubernetes. Inside a TFX workflow the following components are available:

`ExampleGen`

ingests and splits the input dataset. As of version 0.13, only CSV files and BigQuery result sets are supported!`StatisticsGen`

calculates statistics for the dataset.`SchemaGen`

examines the statistics and creates a data schema.`ExampleValidator`

looks for anomalies and missing values in the dataset.`Transform`

performs re-usable feature engineering on the dataset.`Trainer`

trains the model, for instance with (canned) estimators.`Evaluator`

performs analysis of the training results.`ModelValidator`

ensures that the model is ‘good enough’ to be pushed to production.`Pusher`

deploys the model to a serving infrastructure.

TensorFlow Serving can be used for serving the saved model. TensorFlow Serving is a C++ backend that has built-in support for hot-swapping models in production. For local development, Apache Airflow is used in the current TFX repository set-up, although it’s possible to use any orchestration framework, including Spotify’s open-source Luigi.

The TFX User Guide and the various TensorFlow Dev Summit talks contain plenty of useful information. An overview of TFX and Kubeflow Pipelines is available here.

## Optimization

Only the `Transform`

and `Trainer`

components includes TensorFlow user code, which is what I’ll focus on now, in particular the optimizers.

### Algorithms

TensorFlow comes with a few optimization algorithms.

- The
`GradientDescentOptimizer`

is the simplest and most intuitive option. For high learning rates, it can easily miss the optimal value, and for low learning rates it is excruciatingly slow. The algorithm is also prone to oscillate between values. Its learning rate is typically set in the range [0.0001, 0.1] and gradually decreased until the algorithm converges. This makes it a very finicky optimizer. - The next-best alternative is the
`MomentumOptimizer`

, which has a lot in common with the gradient-descent optimizer, but typically converges more quickly and suffers less from oscillations. - The
`AdagradOptimizer`

is an adaptive gradient descent algorithm that allows different variables to converge at different rates. - The
`RMSPropOptimizer`

is a gradient descent algorithm with a decaying learning rate. - Probably the best optimization algorithm in TensorFlow is the
`AdamOptimizer`

. It’s often recommended as the default and only in exceptional cases should anyone pick another algorithm. - An algorithm that ‘follows the regularized leader’ is the
`FtrlOptimizer`

. In each iteration the solution with the least loss over all past rounds (i.e. the ‘leader’) is selected. This is known as ‘follow the leader’ (FTL). The problem with that is that it can easily oscillate between local minima. To stabilize FTL, a regularization term is added to the loss up to the current step, which is referred to as FTRL, ‘follow the regularized leader’.

There are also `Proximal*`

versions of both Adagrad and the gradient-descent optimizers.
These methods can solve *non-differentiable* convex optimization problems.

Internally, these optimization algorithms use automatic differentiation to obtain accurate gradients at different points of interest.

### Set-up

As long as you have Python 3.4 or newer, you can get started with TensorFlow on your machine by:

- installing it directly on your machine (e.g. with
`pip`

); or - using a virtual environment; or
- using a Docker image.

The third option is probably the easiest and does not even require Python:

```
docker pull tensorflow/tensorflow:latest-py3
docker run -it --rm -p 6006:6006 -v $(pwd):/tmp -w /tmp \
tensorflow/tensorflow:latest-py3 python ./script.py
```

Here, `script.py`

is a Python/TensorFlow script you wish to execute.
You can also run an interactive Python session:

```
docker run -it --rm -p 6006:6006 -w /tmp \
tensorflow/tensorflow:latest-py3 python
```

Since Python 2.7 reaches its end of life in 2020, there is little reason to not use Python 3, hence the tag.

If you are not familiar with the Docker CLI, let’s break down the entire command:

`docker run tensorflow/tensorflow:latest-py3 python`

runs`python`

in the specified container;`-it`

creates an interactive Bash shell;`--rm`

removes the container after it exits;`-p 6006:6006`

exposes port 6006 from the container as 6006 on the host, so we can use`localhost:6006`

in any browser to see TensorBoard;`-w /tmp`

sets the work directory to`/tmp`

.

### Problems

For the various (unconstrained) optimization problems I’ll use the ones discussed in an introduction to genetic algorithms in optimization.
The objective (or loss) functions are defined in `losses`

:

The first has a global minimum at (x, y) = (2, 1), and the second has a maximum at x = 1.540005942 approximately.

```
import os
import sys
import argparse
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
FLAGS = None
def main(_):
learning_rate = FLAGS.learning_rate
epochs = FLAGS.epochs
x_var = tf.Variable(0.0, name='x_opt')
y_var = tf.Variable(0.0, name='y_opt')
step_var = tf.Variable(0, trainable=False)
losses = {1: 4.0 * x_var * x_var + 4.0 * y_var * y_var - 4.0 * x_var * y_var - 12.0 * x_var,
2: tf.math.sin(x_var) * tf.math.exp(- 0.01 * x_var * x_var)}
optimizers = {'gradientdescent': tf.train.GradientDescentOptimizer(learning_rate),
'momentum': tf.train.MomentumOptimizer(learning_rate, FLAGS.momentum),
'adagrad': tf.train.AdagradOptimizer(learning_rate),
'adam': tf.train.AdamOptimizer(learning_rate),
'ftrl': tf.train.FtrlOptimizer(learning_rate),
'rmsprop': tf.train.RMSPropOptimizer(learning_rate)}
loss = losses[FLAGS.loss]
optimizer = optimizers[FLAGS.optimizer.lower()]
# Objective function must be defined outside of tf.Session() for all stateful optimizers
# (i.e. all but GradientDescent).
# See: https://github.com/tensorflow/tensorflow/issues/8057
objective = optimizer.minimize(loss, global_step=step_var)
optimizer_name = type(optimizer).__name__
init = tf.global_variables_initializer()
saver = tf.train.Saver()
(x_op, y_op) = (tf.summary.scalar('x', x_var), tf.summary.scalar('y', y_var))
summary_op = tf.summary.merge([x_op, y_op])
log_file = '{}/log_optimizer={}_loss={}_learning_rate={}_epochs={}'.format(FLAGS.log,
optimizer_name.lower(),
FLAGS.loss,
learning_rate,
epochs)
writer = tf.summary.FileWriter(log_file, graph=tf.get_default_graph())
with tf.Session() as sess:
sess.run(init)
for epoch in range(epochs):
_, step, x_res, y_res, summary = sess.run(
[objective, step_var, x_var, y_var, summary_op])
tf.logging.info('%s - epoch %d / step %d: estimated optimum = (%f, %f)' % (
optimizer_name, epoch, step, x_res, y_res))
writer.add_summary(summary, global_step=step)
writer.flush()
saver.save(sess, os.getcwd() + '/output')
x_opt = sess.run(x_var)
y_opt = sess.run(y_var)
tf.logging.info('%s - computed optimum = (%f, %f)' % (optimizer_name, x_opt, y_opt))
if __name__ == '__main__':
parser = argparse.ArgumentParser(prog='TensorFlow optimizer demonstration',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
'--optimizer',
type=str,
help='Optimizer to be used: GradientDescent, Momentum, Adagrad, Adam, Ftrl, or RMSProp',
default='GradientDescent'
)
parser.add_argument(
'--loss',
type=int,
help='Loss function to be used: 1 or 2',
default='1'
)
parser.add_argument(
'--log',
type=str,
help='Location of the logs',
default='/tmp/tf'
)
parser.add_argument(
'--learning-rate',
type=float,
help='Learning rate for optimizer',
default='0.1'
)
parser.add_argument(
'--momentum',
type=float,
help='Momentum for the momentum optimizer',
default='0.01'
)
parser.add_argument(
'--epochs',
type=int,
help='Number of epochs/iterations',
default='100'
)
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
```

You execute the script with the command

```
docker run -it --rm -p 6006:6006 -v $(pwd):/tmp -w /tmp \
tensorflow/tensorflow:latest-py3 python ./optimize.py \
--optimizer Adagrad \
--loss 1 \
--learning-rate 0.01 \
--epochs 50
```

This presumes the Python script is called `optimize.py`

and it is located in the present working directory on the host.
It runs the script for 50 epochs using the `AdagradOptimizer`

with a learning rate of 0.01 for the first loss function.
The optimizers are case insensitive.

You can also supply `-h`

or `--help`

to obtain information on the accepted arguments.

### TensorBoard

The code writes output to log files the work directory under `/tmp/tf`

.
Inside an interactive session on the container you can execute `tensorboard --logdir /tmp/tf`

to open TensorBoard to visualize the progress.
On the host machine (i.e. the computer on which you’ve executed `docker run`

), navigate to `localhost:6006`

to see TensorBoard.

To run all optimizers, let’s create a simple script `optimize.sh`

:

```
#!/bin/bash
rm -rf tf
OPTIMIZERS=("GradientDescent" "Momentum" "Adagrad" "Adam" "FTRL" "RMSProp")
LOSSES=(1 2)
for optimizer in "${OPTIMIZERS[@]}"; do
for loss in "${LOSSES[@]}"; do
python optimize.py --optimizer "$optimizer" --loss "$loss"
done
done
tensorboard --logdir /tmp/tf
```

We run it with:

```
docker run -it --rm -p 6006:6006 -v $(pwd):/tmp -w /tmp \
tensorflow/tensorflow:latest-py3 ./optimize.sh
```

After successful completion you can navigate to `localhost:6006`

and you should see TensorBoard:

By default, it shows the values for `x`

and `y`

for each epoch for all optimizers and loss functions.
You can use a regex `loss=1`

or `loss=2`

in the ‘Runs’ menu on the left (not shown here) to filter only for a single loss function.

For the first loss function we see that Adagrad and FTRL have not converged after 100 iterations for the given (default) learning rate. For the second loss function, only Adam and RMSProp have approximated the optimum within 100 epochs.