You're Not Crazy: Dockerfile Build Caching for ADD/COPY is Broken

You're Not Crazy: Dockerfile Build Caching for ADD/COPY is Broken
=================================================================

Consider this workflow. You have created a git repository on your local machine containing a Dockerfile and a file it copies into the resulting image.

Dockerfile

# syntax=docker/dockerfile:1.5
FROM debian:bullseye
ADD cat-me cat-me
RUN cat cat-me

cat-me

mow!

You also add a simple GitHub Action that builds and pushes a cache of this image to a registry, so that developers checking out the repository and running the build do not need to waste time rebuilding parts that have already been built.

.github/workflows/build.yml

on:
  push:
    branches: [master]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - uses: docker/login-action@v2
      with:
        registry: ghcr.io
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}
    - uses: docker/setup-buildx-action@v2
    - run: |
        docker buildx build \
          --pull \
          --cache-to type=inline \
          --cache-from type=registry,ref=ghcr.io/lynxqtel/cat:latest \
          --push \
          -t ghcr.io/lynxqtel/cat:latest \
          .

This uses buildkit to build and push an image with its cache inlined, which allows incremental step results to be reused in subsequent builds with --cache-from, both in the Actions workflow and on developers’ machines. Developers who have cloned the repository can leverage this automation to produce builds that can be downloaded as cache without needing to run the builds on their own machine. This can be extremely useful if there are multiple different variants of the image that could be built, the automation machines are significantly more powerful or parallelizable than the developers’ machine, or if the builds just take a very long time.

The above is a fanciful idea that is not true. The setup as described will not always produce cache hits when a developer runs a comparable build command.

User Groups and UMASK

A process’ umask controls the permissions for new files. On Ubuntu systems, /etc/login.defs provides its default definition and attributes as UMASK 022 and USERGROUPS_ENAB yes. The value of the umask masks (NANDs) which bits of the user, group, and other permission fields can be set. The file provides the following commentary on these settings:

# UMASK is the default umask value for pam_umask and is used by
# useradd and newusers to set the mode of the new home directories.
# 022 is the "historical" value in Debian for UMASK
# 027, or even 077, could be considered better for privacy
# There is no One True Answer here : each sysadmin must make up his/her
# mind.
#
# If USERGROUPS_ENAB is set to "yes", that will modify this UMASK default value
# for private user groups, i. e. the uid is the same as gid, and username is
# the same as the primary group name: for these, the user permissions will be
# used as group permissions, e. g. 022 will become 002.

What this means is for a developer cloning a repository under an account with a GID matching their UID (private user group), the permissions of all the files in the checkout will be -rw-rw--r-- due to the USERGROUPS_ENAB setting altering the umask from 022 to 002. If the developer’s account GID does not match their UID, the umask is left unmodified at 022, resulting in permissions of the checked-out files to be -rw-r---r--. This means that the group of an account being a private user group or not affects permissions for new files.

ADD/COPY Caching

Both the Docker daemon and the docker-container buildx driver builders compute a checksum of the file(s) in the build context specified in an ADD or COPY instruction. The files’ permissions are included in this checksum, and it is currently not possible to prevent different permissions from producing different checksums. Even when using the --chmod option, this only applies to the file after the checksum is computed, rather than before.

It is in this way that having a different effective umask between two checkouts will cause ADD/COPY instructions to produce different checksums, which can cause cache misses depending on its value on the machine that built and pushed the cache.

Solutions

There are no elegant ways to deal with this. It can either be addressed on the user side or the producer side. A user could use a script to manually set permissions of files in a Docker build context before performing a build, such as with a git checkout hook. Alternately, the producer can publish multiple cache builds of the same image for each variant of umask expected to be relevant across all the developers. In this case, additional --cache-from flags must be provided on the developer’s build command for each variant. While the latter approach avoids placing a burden on other developers to only run build command-lines as part of or adjacent to permission-fixing shell scripts, it will add additional complexity to what may already be a complciated tagging scheme. I slightly prefer the former method as straightforward setup instructions for a git checkout hook that unifies the permissions of the entire Docker build context will effectively solve the problem without any additional action or accommodation needed.

Until an official solution to this problem is developed and merged, developers performing Docker image builds should take care to check that the permissions on the file they wish to include in the build match the permissions used when creating caches. Git does not track file permissions, so even with a completely clean repository different developers can get different cache hit success rates simply due to whether or not their GID matches their UID, and in the process of trying to discover this come dangerously close to descending into madness.

March 20, 2023 · docker, caching

P’ar Aed