This post is the second of a 2-part series about running Kubernetes build jobs. In part 1 we reviewed the basic building blocks for deploying workloads to Kubernetes – the Docker image/container and the Kubernetes pod. In this post, we utilize the Kubernetes job object which provides better fault-tolerance and scalability. It’s strongly recommended that you read the first part to understand the low-level building blocks on which this post depends.
Running the Example Code
The following prerequisites are required to run the code, it’s recommended to run the commands yourself as you follow-along the post:
- Install Docker
- Install Kubectl and be connected to a Kubernetes cluster, you can use kind or Minikube to run a development cluster locally.
- Fork the example code repository so you can publish releases
- Create a personal access token with permissions for managing releases
All the commands shown in the post should run from the root of the repository. So, after you forked the code repository you should clone it and open a terminal at the code repository root directory.
To make the code samples easier to run, set the following environment variables in your shell (replace YourGitHub* values with your relevant details):
Also, set the following environment variable which will make following code samples shorter and easier to follow:
More Robust Workload Execution Using Kubernetes Job Objects
In the previous post, we saw how to run pods on our Kubernetes cluster. While this can work fine for many use cases, it does have some subtle shortcomings. A Kubernetes cluster can be very dynamic, nodes can be stopped for upgrades, or pods could be scheduled on a node without enough RAM which will cause it to die unexpectedly. It is not recommended to use pods directly, the best practice is to use higher-level abstractions which allow Kubernetes to handle this and other unexpected failures.
The recommended object for running CI build jobs or other one-off processing tasks is the Kubernetes job object. The job object schedules pods and manages them to ensure the job runs to completion.
Jobs are defined in Kubernetes yaml files – all the example yaml files are available in the code repository under manifests/ directory. We use shell templates based on envsubst to simplify creation of multiple objects based on the same templates.
Let’s start with a simple example replicating the pods we used in part 1 to a Kubernetes job object:
- name: builder
args: ["$OS/$ARCH", "$GITHUB_USER", "$TAG"]
- name: TOKEN
- name: the name of the job object that will be created. We generate it dynamically based on environment variables using envsubst
- image: The Docker image we want to deploy, same image we used in part 1 which compiles and publishes a simple hello world golang binary. This could also be changed to handle C++ compilation or any other build or processing task.
- args: The arguments to pass to the image, in this case – the OS architecture to compile, the GitHub user name and tag name to publish the binary to.
- env: We add the GitHub token as an environment variable inside the container to allow the script to publish the binary to your GitHub.
- restartPolicy: Kubernetes by default tries to restart pods which stop for any reason, but for build jobs that’s not desired as we want to run the job once without restarting.
Publish a new release named v0.0.3 and deploy a few jobs to your cluster using the following command:
cat manifests/single-pod-job.yaml | OS=linux ARCH=amd64 TAG=v0.0.3 envsubst | kubectl apply -f -
cat manifests/single-pod-job.yaml | OS=linux ARCH=386 TAG=v0.0.3 envsubst | kubectl apply -f -
cat manifests/single-pod-job.yaml | OS=windows ARCH=arm TAG=v0.0.3 envsubst | kubectl apply -f -
Let’s breakdown these commands to see what’s the meaning of each part:
- cat manifests/single-pod-job.yaml |
Opens the manifest file and forwards its contents to the next command.
- OS=linux ARCH=amd64 TAG=v0.0.3 envsubst |
This runs the envsubst command which should be available in any shell and provides basic templating capabilities. It replaces the environment variables inside the manifest file with the actual values, allowing to reuse the same file multiple times with different values. Then it forwards to the next command.
- kubectl apply -f –
Apply the manifest on the Kubernetes cluster – causing the creation of the job object.
After running these commands you can see the pods the same as we saw in the previous post:
kubectl get pods
But, you will now also see job objects:
kubectl get jobs
The job objects keep track of the pods and make sure each pod runs to completion. In case of failure the job will retry and schedule a new pod up to 6 times (configurable via the backoffLimit attribute). This means that unexpected failures like a node failure or out of RAM will not prevent the job from running and Kubernetes will make sure your job runs.
When the jobs are complete, you should clean-up all the pods to prevent clutter on your cluster by deleting the job objects:
kubectl delete job builder-v0.0.3-linux-amd64 builder-v0.0.3-linux-386 builder-v0.0.3-windows-arm
When you delete the job objects, the created pods will also be deleted.
Scaling up Using Job Queues
The build script supports 44 OS architectures and if your cluster has the capacity it would be best to run all of them in parallel. However, all the examples so far require scheduling each job separately. One of the strengths of the Kubernetes job object is the ability to schedule many parallel pods and wait for them to complete processing.
To use that functionality we need a queue to store the items which need to be processed and handle the queue logic – getting items from the queue, handling timeouts/errors, etc. There are many different ways to implement that and it’s worth checking if your company has existing solutions. For this example, I will show a simple queue implementation based on Redis and minimal Python “glue” code.
You can see all the code under builder-queue/ directory, I will highlight parts of the code below:
- builder-queue/Dockerfile – extends the builder image which contains the build script, this can easily be modified to extend any build script or processing task with the queue functionality. We add Python3 and the rq library which handles the queue implementation.
- builder-queue/builder_queue_entrypoint.sh – overrides the builder entrypoint and adds commands to handle the queue – get info about the queue, add items to the queue and run a worker which processes items from the queue.
- builder-queue/builder_queue.py – handles adding items to the queue. Supports adding all supported OS architectures to the queue as individual items.
- builder-queue/builder_queue_lib.py – handles the processing of each item, all it does is call the original entrypoint of the builder script which compiles and publishes the example binary for the requested OS architecture.
To deploy this job queue, we first need our queue server, in this case, we will use Redis. The code repository contains a simple yaml containing a Redis deployment and service. You can review this yaml here. The following commands will deploy it and wait for it to be available:
kubectl apply -f manifests/redis.yaml &&\
kubectl wait deployment/redis --for condition=available
- kubectl apply -f manifests/redis.yaml: Apply the given manifest file to the cluster. In this case, we don’t need to pipe it through envsubst because there are no environment variables to replace.
- kubectl wait deployment/redis –for condition=available: Wait until the Redis deployment is available.
To add jobs to the queue and query the queue status we will need to access this Redis server, we can use the kubectl port-forward command to enable this:
kubectl port-forward deployment/redis 6379 &
Now local port 6379 is forwarded to the redis deployment on your Kubernetes cluster.
Deploy a new release named v0.0.4 and run the following command to add all the OS architectures to the queue:
docker run --network host $QUEUE_IMAGE --rq-add all $GITHUB_USER v0.0.4
You can review what this command does here, but it basically just adds items to the Redis queue. One item per OS architecture.
Now everything is ready to start running the actual workloads. We will use the following yaml:
- name: builder-queue
- name: TOKEN
- name: RQ_REDIS_HOST
The main difference from the simpler job we used earlier is the parallelism attribute. In this case, we set it to 4 – meaning that 4 parallel pods will be started. We use the jobs-builder-queue image with –rq-worker argument which will process jobs from the queue until no items are left. We set the restartPolicy to OnFailure so that if there is an error the pod will be restarted, but when there are no items left in the queue the process will exit with a successful return code and the pod will not restart.
Deploy this job using the following command:
cat manifests/multi-pod-job.yaml | envsubst | kubectl apply -f -
Check the pods as they are being created and wait for them to be Running:
kubectl get pods
You should see 4 pods as we specified 4 in the parallelism attribute. When pods are running, you can check the queue status using the following command:
docker run --network host $QUEUE_IMAGE --rq-info
You should see all the workers in a busy state. The number of items in the queue should decrease.
When all items in the queue have been processed, you should see all the pods with status Completed.
You can now stop the port-forward to the Redis deployment by running the following:
Now you may clean-up the created pods by deleting the job object, to prevent clutter on your cluster:
kubectl delete job builder-queue
In this post, we saw how to expand on the previous post and utilize the full power of Kubernetes fault-tolerance and scalability. I recommend reading some more into the Kubernetes job object to fully understand all the available features and configuration options:
The simple example we saw of a hello world program in Go can easily be expanded to C++ compilation or any other build job/data processing/time-consuming task. The scale of pods can easily be increased to launch hundreds of parallel pods instead of the 4 we used in this example by changing the parallelism attribute. While your CI system is the most likely place to handle CI jobs, sometimes you will reach some limitations and it’s very useful to have the power of Kubernetes via the job object in your tool belt to use when needed.