Identifying What To Build in a Mono-Repo With Bash & Git

Bob Van Dell II (Awesome Bob)
4 min readJan 31, 2023

--

Across my career as a software engineer, I’ve found myself optimizing CI/CD pipelines several times to eliminate waste and inefficiencies. In many cases, I work with mono-repos that consistently ensure the same build process and quality controls are applied across similar software components. In this situation, I must either build everything, always, or find a way only to build what has changed efficiently.

Mono-repo-specific tooling exists for many runtimes, and some are language-agnostic. However, a reliable approach I’ve found to building only what has changed is to leverage Bash and Git since Git is the de-facto source of truth on “what has changed.” and Bash is generally available in most build environments and, I feel, most engineers building software should be somewhat familiar with it especially if they build software across enough runtimes or languages to need a reusable solution like this.

Assumptions

Before we begin, there are some assumptions this approach makes about the software development process:

  1. Following continuous-integration (CI) best practices, you build on every meaningful commit. We can define meaningful as changes that have a deterministic difference in the output of our build (input => output).
  2. PR builds are compared to the HEAD commit of the mainCI branch. You can replace main with your particular CI branch name or compare a release branch for hotfixes, etc.
  3. You squash-merge commits before merging to your main branch. This enables comparing changes to the head commit of main to HEAD~1 which is the commit before the HEADcommit of main.
  4. This one may not apply to you, but we have a convention that components live under a particular directory. Each can be built independently from its respective component directory (dependencies between components are a separate issue I won’t cover here). For example, in our container image mono-repo, all directories under our ./images directory are the root of a container image.

Use Case: PR

To build up our solution, we’ll start with the use case that we have a PR on our container images mono-repo that we intend to merge into our main branch.

Our implementation will live in a Bash script located in ./scripts/find_changed_images.sh and our images are built from files located in ./images/{image name}.

First, let’s look at how we can identify all the files that differ between our PR and main:

#!/usr/bin/env bash
set -euo pipefail

# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"

for file_changed in $(git diff --name-only "${COMPARE_TO}"); do
echo $file_changed
done

This script is a great start, but it will list every file created, deleted, or modified, and we only want a list of directories that contain meaningful changes. So let’s update our implementation to only look for changes under our ./images directory by adding a filter to our output of the original git diff command with | grep '^images/':

#!/usr/bin/env bash
set -euo pipefail

# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"

for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
echo $file_changed
done

We’ve made even more progress, but we only need each image’s root directory under our ./images directory. We can get that by using a combination of dirname, awk, and [ -d "./images/$top" ] like so:

#!/usr/bin/env bash
set -euo pipefail

# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"

for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
top=$(dirname "$file_changed" | awk 'BEGIN { FS="/" } {print $2}')
[ -d "./images/$top" ] && echo "$top"
done

Now that we have our image root directories, we can de-duplicate our list by creating a function for our current implementation and defining a new loop that uses that function’s output and filters it to unique values with | sort -u:

#!/usr/bin/env bash
set -euo pipefail

# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"

get_all_changed_image_dirs() {
for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
top=$(dirname "$file_changed" | awk 'BEGIN { FS="/" } {print $2}')
[ -d "./images/$top" ] && echo "$top"
done
}

for dir in $(get_all_changed_image_dirs | sort -u); do
echo "$dir"
done

Lastly, since we’re building images with docker we want to only output directories that contain a Dockerfile in their root. So we add one more filter with [ -f “./images/$dir/Dockerfile” ] && echo “$dir" to our new loop:

#!/usr/bin/env bash
set -euo pipefail

# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"

get_all_changed_image_dirs() {
for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
top=$(dirname "$file_changed" | awk 'BEGIN { FS="/" } {print $2}')
[ -d "./images/$top" ] && echo "$top"
done
}

for dir in $(get_all_changed_image_dirs | sort -u); do
[ -f "./images/$dir/Dockerfile" ] && echo "$dir"
done

And that’s it!

Now we can pass the output of this script to our build command to build the container images from directories that have changed:

for image_dir in $(./scripts/find_changed_images.sh); do;
docker build -t "$image_dir" ./images/$image_dir
done

Use case: main branch

The only difference between our PR builds and our main branch build is what we set ourCOMPARE_TO value as:

for image_dir in $(COMPARE_TO='HEAD~1' ./scripts/find_changed_images.sh); do;
docker build -t "$image_dir" ./images/$image_dir
done

In various build systems, you can set different environment variables for changes to a main branch that differs from those set on a release/* branch, or for builds run against PRs. That allows much flexibility in determining what changed files decide which directories to build.

--

--