Jacob Garber

Bootstrapping a Highly Available Container Registry

Deploying a bare-metal Kubernetes cluster is very much a build-your-own adventure: kubeadm will create a cluster for you, but fundamental building blocks like the load-balancer, persistent storage, and ingress you need to setup yourself. In addition, what if you are running in an air-gapped environment? No incoming or outgoing internet connections, at best you get to SSH in.1 Since you can't access DockerHub or Quay, you need to setup a registry somewhere locally. Furthermore, this registry needs to be highly-available, but since we are bootstrapping a cluster we can't take advantage of any of the HA machinery that Kubernetes gives us since it doesn't exist yet (sorry Harbor). It took me a fair bit of head-scratching to figure out how to solve this.

To set the stage, let's say you are deploying onto three nodes inside a private network, and are going to implement fail-over using a virtual IP managed at Layer 2 (ARP or NDP). Broadly speaking we can split the deployment into three parts:

  • A backend that stores the container image layers and replicates them amongst the other replicas.
  • A frontend that serves client requests (e.g. push/pull) and dispatches them to the backend.
  • A virtual IP manager that checks the health of the previous two components and will move the VIP to a different node if they are not healthy.

For the frontend I found two main options:

  • Distribution Registry: This is the original container registry developed by Docker and has since been donated to the CNCF. It's fairly bare-bones but the de facto way of running a registry yourself.
  • Zot Registry is a bit of a new comer here and seems to have been developed by Cisco before being open sourced. Zot comes with many more features than the distribution registry, including a GUI. (Also, retention policies! Never manually delete old images again!) One potential hiccup with this registry is that it strictly conforms to the OCI specifications, and so you will be unable to push container images to it using docker push, since that uses the docker format. This is easy to fix though by using Podman or skopeo, which let you push images in the OCI format.

If it isn't clear already I like Zot quite a bit: it strikes a nice balance between features and simplicity, plus the GUI is nice too.

The backend is the heart of the system and is the most difficult to setup. In general, replicated storage can be done at two levels:

  • Block Storage. This operates at a very low level and replicates the individual blocks of your file system. Due to this generality, replicated block storage systems such Ceph are often very complicated and resource intensive. Not the choice here.
  • Object Storage. This operates at a high level and replicates invididual "objects", which are just blobs of data. Most object stores use the S3 API, and fortunately Zot supports it as a backend. There are several good options here:2
    • Garage. An interesting project, unlike many other systems it uses CRDTs for replication instead of a consensus algorithm like Raft.
    • SeaweedFS. I've heard excellent things about this and people have been able to run it with billions of files. Definitely a good choice, however Garage seemed simpler to deploy so that's what I went with.

Finally, managing the virtual IP. Keepalived is the simplest and traditional choice here, though (self-plug!) if you want more resilient leader election you could try kayak too.

All together then:

A diagram showing how zot, garage, and keepalived interact on three nodes with arrows indicating how they communicate with each other.

For deployment, all of the above programs are static binaries and can be easily run using systemd. Another option could be Podman quadlets.345

There are more things you can add to this setup that would be useful in production (e.g. a reverse proxy to load balance between the nodes and do TLS termination), but that's the fundamental idea. In practice it works pretty well!

1

Which itself is actually fairly liberating through the magic of port forwards and SOCKS proxies.

2

MinIO is not one of them.

3

If doing this, be careful about running Podman and Kubernetes on the same node. Podman will try to set up its own firewall rules and they might conflict with what Kubernetes or your CNI does. I think Podman doesn't create firewall rules if you run the containers with host network mode, so that might work. I know for sure this does not work with Docker.

4

For the containerization purists out there, in theory you can avoid the firewall issues if you deploy these as static pods with kubeadm instead. However, this creates a bootstrapping loop with the other control plane pods, you won't be able to use your registry if you nuke your cluster to re-install it, and there's a strong chance a cluster problem could lead to cascading failures.

5

In other words, just use systemd. systemd is great! The kubelet and CRI use systemd! You can do it, I believe!