Identifying What To Build in a Mono-Repo With Bash & Git
Across my career as a software engineer, I’ve found myself optimizing CI/CD pipelines several times to eliminate waste and inefficiencies. In many cases, I work with mono-repos that consistently ensure the same build process and quality controls are applied across similar software components. In this situation, I must either build everything, always, or find a way only to build what has changed efficiently.
Mono-repo-specific tooling exists for many runtimes, and some are language-agnostic. However, a reliable approach I’ve found to building only what has changed is to leverage Bash and Git since Git is the de-facto source of truth on “what has changed.” and Bash is generally available in most build environments and, I feel, most engineers building software should be somewhat familiar with it especially if they build software across enough runtimes or languages to need a reusable solution like this.
Assumptions
Before we begin, there are some assumptions this approach makes about the software development process:
- Following continuous-integration (CI) best practices, you build on every meaningful commit. We can define meaningful as changes that have a deterministic difference in the output of our build (input => output).
- PR builds are compared to the
HEAD
commit of themain
CI branch. You can replacemain
with your particular CI branch name or compare a release branch for hotfixes, etc. - You squash-merge commits before merging to your
main
branch. This enables comparing changes to the head commit ofmain
toHEAD~1
which is the commit before theHEAD
commit ofmain
. - This one may not apply to you, but we have a convention that components live under a particular directory. Each can be built independently from its respective component directory (dependencies between components are a separate issue I won’t cover here). For example, in our container image mono-repo, all directories under our
./images
directory are the root of a container image.
Use Case: PR
To build up our solution, we’ll start with the use case that we have a PR on our container images mono-repo that we intend to merge into our main
branch.
Our implementation will live in a Bash script located in ./scripts/find_changed_images.sh
and our images are built from files located in ./images/{image name}
.
First, let’s look at how we can identify all the files that differ between our PR and main:
#!/usr/bin/env bash
set -euo pipefail
# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"
for file_changed in $(git diff --name-only "${COMPARE_TO}"); do
echo $file_changed
done
This script is a great start, but it will list every file created, deleted, or modified, and we only want a list of directories that contain meaningful changes. So let’s update our implementation to only look for changes under our ./images
directory by adding a filter to our output of the original git diff
command with | grep '^images/'
:
#!/usr/bin/env bash
set -euo pipefail
# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"
for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
echo $file_changed
done
We’ve made even more progress, but we only need each image’s root directory under our ./images
directory. We can get that by using a combination of dirname
, awk
, and [ -d "./images/$top" ]
like so:
#!/usr/bin/env bash
set -euo pipefail
# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"
for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
top=$(dirname "$file_changed" | awk 'BEGIN { FS="/" } {print $2}')
[ -d "./images/$top" ] && echo "$top"
done
Now that we have our image root directories, we can de-duplicate our list by creating a function for our current implementation and defining a new loop that uses that function’s output and filters it to unique values with | sort -u
:
#!/usr/bin/env bash
set -euo pipefail
# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"
get_all_changed_image_dirs() {
for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
top=$(dirname "$file_changed" | awk 'BEGIN { FS="/" } {print $2}')
[ -d "./images/$top" ] && echo "$top"
done
}
for dir in $(get_all_changed_image_dirs | sort -u); do
echo "$dir"
done
Lastly, since we’re building images with docker
we want to only output directories that contain a Dockerfile
in their root. So we add one more filter with [ -f “./images/$dir/Dockerfile” ] && echo “$dir"
to our new loop:
#!/usr/bin/env bash
set -euo pipefail
# default value of "main" if not already set
COMPARE_TO="${COMPARE_TO:-main}"
get_all_changed_image_dirs() {
for file_changed in $(git diff --name-only "${COMPARE_TO}" | grep '^images/'); do
top=$(dirname "$file_changed" | awk 'BEGIN { FS="/" } {print $2}')
[ -d "./images/$top" ] && echo "$top"
done
}
for dir in $(get_all_changed_image_dirs | sort -u); do
[ -f "./images/$dir/Dockerfile" ] && echo "$dir"
done
And that’s it!
Now we can pass the output of this script to our build command to build the container images from directories that have changed:
for image_dir in $(./scripts/find_changed_images.sh); do;
docker build -t "$image_dir" ./images/$image_dir
done
Use case: main
branch
The only difference between our PR builds and our main
branch build is what we set ourCOMPARE_TO
value as:
for image_dir in $(COMPARE_TO='HEAD~1' ./scripts/find_changed_images.sh); do;
docker build -t "$image_dir" ./images/$image_dir
done
In various build systems, you can set different environment variables for changes to a main
branch that differs from those set on a release/*
branch, or for builds run against PRs. That allows much flexibility in determining what changed files decide which directories to build.