Use GNU Parallel for Regression Testing
=======================================

I get a sinking feeling in my stomach when I look at non-parallelized test automation lacking regression analysis. The former can be hard, but the latter isn’t. I’ll show how both are achievable with some simple scripts and the help of GNU parallel, which has recently become one of my favorite tools.

Regression testing is different from normal testing in that a prior result gets compared with a later result to see if a particular change fixed or broke something. Usually you want to limit the differences to one change at a time so that you know that it was precisely that change which caused the new passes or failures. If you change, for example, the versions of two components, you wouldn’t know which component’s update caused the change in test results.

Parallelized testing simply refers to running more than one test in parallel so that they complete faster. For tests which do not place heavy load on your system, this can be easy to manage, but I rarely work with tests like this. Working in High-Performance Computing leads to working with workloads and tests that are designed to stress your hardware, making parallelization without crashing your systems difficult.

Setting up our workloads
------------------------

For demonstration purposes, we’ll use the example workloads from Intel’s oneAPI Deep Neural Network Library (oneDNN).

Dockerfile

# syntax=docker/dockerfile:latest

ARG ONEAPI_VERSION=2023.0.0

FROM intel/oneapi-hpckit:${ONEAPI_VERSION}-devel-ubuntu22.04 AS oneapi-hpckit

FROM ubuntu:jammy AS base
SHELL ["/bin/bash", "-e", "-c"]
ENV CCACHE_DIR=/cache/ccache

FROM base AS toolchain
ENV DEBIAN_FRONTEND=noninteractive
RUN --mount=type=cache,target=/var/apt/cache <<EOF
rm /etc/apt/apt.conf.d/docker-clean
apt-get update
apt-get install -y \
  build-essential \
  ccache \
  cmake \
  git \
  ninja-build
EOF

FROM toolchain AS build-onednn
COPY --link --from=oneapi-hpckit /opt/intel/oneapi/compiler /opt/intel/oneapi/compiler
COPY --link --from=oneapi-hpckit /opt/intel/oneapi/tbb /opt/intel/oneapi/tbb
COPY --link --from=oneapi-hpckit /opt/intel/oneapi/setvars.sh /opt/intel/oneapi/setvars.sh
WORKDIR /onednn-git
ENV CCACHE_BASEDIR=/onednn-git
ENV CC=icx
ENV CXX=icpx
RUN --mount=type=cache,target=/cache/ccache <<EOF
git clone https://github.com/oneapi-src/oneDNN.git .
git checkout v3.2
source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake \
  -D DNNL_CPU_RUNTIME=DPCPP \
  -D DNNL_GPU_RUNTIME=DPCPP \
  -D CMAKE_ASM_COMPILER=icx \
  -D CMAKE_C_COMPILER_LAUNCHER=ccache \
  -D CMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -D BUILD_SHARED_LIBS=0 \
  -G Ninja \
  ..
ninja -j$(nproc)
EOF

FROM toolchain AS build-parallel
ADD https://ftp.gnu.org/gnu/parallel/parallel-20230422.tar.bz2 .
RUN <<EOF
tar xf parallel-20230422.tar.bz2
cd parallel-20230422
./configure --prefix=$PWD/install
make
make install
EOF

FROM base AS onednn
COPY --link --from=build-parallel parallel-20230422/install/bin/* /usr/local/bin/
COPY --link --from=oneapi-hpckit /opt/intel/oneapi/compiler /opt/intel/oneapi/compiler
COPY --link --from=oneapi-hpckit /opt/intel/oneapi/tbb /opt/intel/oneapi/tbb
COPY --link --from=oneapi-hpckit /opt/intel/oneapi/setvars.sh /opt/intel/oneapi/setvars.sh
COPY --link --from=build-onednn /onednn-git/build/src/libdnnl.so* /usr/local/lib/
COPY --link --from=build-onednn /onednn-git/build/examples /workloads
ENV LD_LIBRARY_PATH=/usr/local/lib
WORKDIR /workloads

Build this with:

docker buildx build -t localhost/onednn-examples - < Dockerfile

This produces a container image with the oneDNN example workloads. We can execute one of them like this:

docker run --rm -i onednn-examples bash -e <<EOF
source /opt/intel/oneapi/setvars.sh
./getting-started-cpp
EOF

After a few seconds, this should print the following:

:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: compiler -- latest
:: tbb -- latest
:: oneAPI environment initialized ::
 
Example passed on CPU.

Running in parallel
-------------------

GNU parallel allows us to run the workloads concurrently, taking better advantage of a system’s compute resources.

docker run \
  --rm \
  --interactive \
  $(if (! docker info 2>/dev/null | grep -q rootless); then echo \
    --user=$(id -u):$(id -g); fi) \
  -v $PWD:$PWD \
  -e LOG_FILE=$PWD/results.tsv \
  localhost/onednn-examples:latest \
  bash -e <<EOF
source /opt/intel/oneapi/setvars.sh
memtotal=$(grep MemTotal /proc/meminfo | awk '{print $2}')
echo -e "JobRuntime\tExitval\tCommand" > $PWD/results.tsv
parallel \
  --will-cite \
  --color \
  --tag \
  --line-buffer \
  --retries 0 \
  --delay 1 \
  --jobs '50%' \
  --memfree $((memtotal / 4 * 3))K <<EndOfJobs
$PWD/ectime ./bnorm-u8-via-binary-postops-cpp
$PWD/ectime ./cnn-inference-f32-cpp
$PWD/ectime ./cnn-inference-int8-cpp
$PWD/ectime ./cnn-training-bf16-cpp
$PWD/ectime ./cnn-training-f32-cpp
$PWD/ectime ./cross-engine-reorder-cpp
$PWD/ectime ./getting-started-cpp
$PWD/ectime ./matmul-perf-cpp
$PWD/ectime ./memory-format-propagation-cpp
$PWD/ectime ./performance-profiling-cpp
$PWD/ectime ./primitives-augru-cpp
$PWD/ectime ./primitives-batch-normalization-cpp
$PWD/ectime ./primitives-binary-cpp
$PWD/ectime ./primitives-concat-cpp
$PWD/ectime ./primitives-convolution-cpp
$PWD/ectime ./primitives-eltwise-cpp
$PWD/ectime ./primitives-inner-product-cpp
$PWD/ectime ./primitives-layer-normalization-cpp
$PWD/ectime ./primitives-lrn-cpp
$PWD/ectime ./primitives-lstm-cpp
$PWD/ectime ./primitives-matmul-cpp
$PWD/ectime ./primitives-pooling-cpp
$PWD/ectime ./primitives-prelu-cpp
$PWD/ectime ./primitives-reduction-cpp
$PWD/ectime ./primitives-reorder-cpp
$PWD/ectime ./primitives-resampling-cpp
$PWD/ectime ./primitives-shuffle-cpp
$PWD/ectime ./primitives-softmax-cpp
$PWD/ectime ./primitives-sum-cpp
$PWD/ectime ./rnn-training-f32-cpp
$PWD/ectime ./sycl-interop-buffer-cpp
$PWD/ectime ./sycl-interop-usm-cpp
$PWD/ectime ./tutorials-matmul-inference-int8-matmul-cpp
EndOfJobs
EOF

The parallel command is the real magic here.

--color, --tag, and --line-buffer control the presentation of each job’s output, giving each job a unique background color, prefixing each line with the name of the job (the workload command), and allowing lines to interleave if one program has output to print before another is done (but not collide with eachother mid-line).

--jobs '50%' limits the total percent share of CPU available to launched jobs to 50% of the system’s processors.

--delay 1 helps deal with memory usage spikes by delaying one second between launching subsequent jobs, giving parallel sufficient time to detect and kill jobs when memory usage starts becoming exceeded.

--memfree limits the number of running jobs. memtotal / 4 * 3 means new jobs are not started unless at least 75% of the system’s total memory is free, and if this drops below half of the value (37.5%) then the youngest job will be killed and put back on the queue to be run later. I have found this to be a good value for the types of workloads I run which can have unpredictable large memory spikes. --retries 0 instructs GNU parallel to requeue each job as often as necessary in order for it to eventually succeed.

Using --retries 0 has the consequence of not being able to run each workload directly, since failures in addition to requeuing will be retried indefinitely. This is undesirable for regression testing where we want to catch if any of our jobs are failing, rather than retry them until they succeed. That is why each job is prefixed with $PWD/ectime, which is a small wrapper script which records into $LOG_FILE the exit code (“ec”) and runtime (“time”) of each job. It exits with status 0 regardless of whether the command it ran failed or not.

ectime

#! /usr/bin/env bash

LOG_FILE=${LOG_FILE:-$PWD/log.tsv}

start_time=$(date +%s)
"$@"
exit_code=$?
end_time=$(date +%s)
duration=$((end_time - start_time))

echo -e "${duration}\t${exit_code}\t$@" >> ${LOG_FILE}

Running this produces a tab-delimited results file with the runtime and exit value of each workload. In this case, they all passed.

results.tsv

JobRuntime  Exitval  Command
1           0        ./bnorm-u8-via-binary-postops-cpp
0           0        ./cnn-inference-int8-cpp
0           0        ./cnn-training-bf16-cpp
4           0        ./cnn-inference-f32-cpp
0           0        ./cross-engine-reorder-cpp
1           0        ./cnn-training-f32-cpp
0           0        ./getting-started-cpp
1           0        ./memory-format-propagation-cpp
0           0        ./primitives-augru-cpp
2           0        ./performance-profiling-cpp
0           0        ./primitives-batch-normalization-cpp
0           0        ./primitives-binary-cpp
0           0        ./primitives-concat-cpp
0           0        ./primitives-convolution-cpp
0           0        ./primitives-eltwise-cpp
1           0        ./primitives-inner-product-cpp
0           0        ./primitives-layer-normalization-cpp
0           0        ./primitives-lrn-cpp
0           0        ./primitives-lstm-cpp
0           0        ./primitives-matmul-cpp
0           0        ./primitives-pooling-cpp
0           0        ./primitives-prelu-cpp
0           0        ./primitives-reduction-cpp
17          0        ./matmul-perf-cpp
0           0        ./primitives-reorder-cpp
0           0        ./primitives-resampling-cpp
0           0        ./primitives-shuffle-cpp
0           0        ./primitives-softmax-cpp
0           0        ./primitives-sum-cpp
0           0        ./rnn-training-f32-cpp
0           0        ./sycl-interop-buffer-cpp
0           0        ./sycl-interop-usm-cpp
0           0        ./tutorials-matmul-inference-int8-matmul-cpp

Regression analysis
-------------------

Let’s invent a second run and edit some of the workloads to have failed, for the purposes of demonstrating the regression analysis tool.

results2.tsv

JobRuntime  Exitval  Command
1           0        ./bnorm-u8-via-binary-postops-cpp
0           0        ./cnn-inference-int8-cpp
0           0        ./cnn-training-bf16-cpp
4           0        ./cnn-inference-f32-cpp
0           0        ./cross-engine-reorder-cpp
1           0        ./cnn-training-f32-cpp
0           0        ./getting-started-cpp
1           0        ./memory-format-propagation-cpp
0           0        ./primitives-augru-cpp
2           0        ./performance-profiling-cpp
0           0        ./primitives-batch-normalization-cpp
0           1        ./primitives-binary-cpp
0           0        ./primitives-concat-cpp
0           0        ./primitives-convolution-cpp
0           0        ./primitives-eltwise-cpp
1           0        ./primitives-inner-product-cpp
0           0        ./primitives-layer-normalization-cpp
0           0        ./primitives-lrn-cpp
0           0        ./primitives-lstm-cpp
0           1        ./primitives-matmul-cpp
0           1        ./primitives-pooling-cpp
0           0        ./primitives-prelu-cpp
0           0        ./primitives-reduction-cpp
17          0        ./matmul-perf-cpp
0           0        ./primitives-reorder-cpp
0           0        ./primitives-resampling-cpp
0           0        ./primitives-shuffle-cpp
0           0        ./primitives-softmax-cpp
0           0        ./primitives-sum-cpp
0           0        ./rnn-training-f32-cpp
0           0        ./sycl-interop-buffer-cpp
0           0        ./sycl-interop-usm-cpp
0           0        ./tutorials-matmul-inference-int8-matmul-cpp

Using a simple python script, we can print a colorful regression analysis table showing both the “old” and the “new” exit codes for each workload.

compare.py

import os
import csv
import argparse
from termcolor import cprint

parser = argparse.ArgumentParser(prog='compare.py',)

parser.add_argument('old_log', type=argparse.FileType('r'))
parser.add_argument('new_log', type=argparse.FileType('r'))
args = parser.parse_args()
old_log = args.old_log
new_log = args.new_log

def read_log(log):
    reader = csv.DictReader(log, delimiter='\t')
    return {row['Command']: int(row['Exitval']) for row in reader}

old = read_log(old_log)
new = read_log(new_log)

def print_result(command, old, new):
    if old == 0 and (new == 0 or new == None):
        printer = lambda x: print(x)
    elif old == 0 and new != 0:
        printer = lambda x: cprint(x, 'black', 'on_red', attrs=['bold'])
    elif old != 0 and new == 0:
        printer = lambda x: cprint(x, 'black', 'on_green', attrs=['bold'])
    elif (old != 0 or old == None) and new != 0:
        printer = lambda x: cprint(x, 'red', attrs=['bold'])
    if old == None:
        old = ""
    if new == None:
        new = ""
    printer(f'{old:>3} {new:>3} {command}')

def print_table(old, new):
    print('Old New Command')
    longest = max(len(command) for command in old.keys() | new.keys())
    print('-' * (15 + max(0, longest - 7)))
    for command in sorted(old.keys() | new.keys()):
        print_result(command, old.get(command), new.get(command))

print_table(old, new)

Run this with:

python3 compare.py results.tsv results2.tsv

This renders the following table with still failing in red, still passing in green, newly failing with a red background, and newly passing with a green background.

Old New Command
----------------------------------------------------
  0   0 ./bnorm-u8-via-binary-postops-cpp
  0   0 ./cnn-inference-int8-cpp
  0   0 ./cnn-training-bf16-cpp
  0   0 ./cnn-inference-f32-cpp
  0   0 ./cross-engine-reorder-cpp
  0   0 ./cnn-training-f32-cpp
  0   0 ./getting-started-cpp
  0   0 ./memory-format-propagation-cpp
  0   0 ./primitives-augru-cpp
  0   0 ./performance-profiling-cpp
  0   0 ./primitives-batch-normalization-cpp
  0   1 ./primitives-binary-cpp
  0   0 ./primitives-concat-cpp
  0   0 ./primitives-convolution-cpp
  0   0 ./primitives-eltwise-cpp
  0   0 ./primitives-inner-product-cpp
  0   0 ./primitives-layer-normalization-cpp
  0   0 ./primitives-lrn-cpp
  0   0 ./primitives-lstm-cpp
  0   1 ./primitives-matmul-cpp
  0   1 ./primitives-pooling-cpp
  0   0 ./primitives-prelu-cpp
  0   0 ./primitives-reduction-cpp
  0   0 ./matmul-perf-cpp
  0   0 ./primitives-reorder-cpp
  0   0 ./primitives-resampling-cpp
  0   0 ./primitives-shuffle-cpp
  0   0 ./primitives-softmax-cpp
  0   0 ./primitives-sum-cpp
  0   0 ./rnn-training-f32-cpp
  0   0 ./sycl-interop-buffer-cpp
  0   0 ./sycl-interop-usm-cpp
  0   0 ./tutorials-matmul-inference-int8-matmul-cpp

This achieves our original objective of performing a regression analysis on results from tests that were run in parallel.

· gnu-parallel, testing