The following is an interview with Filip Viskovic of Improbable on how they used Anka to set up self-service creation and deployment of on-demand macOS CI environments for 100s of remote developers. Improbable is a company dedicated to providing services for extraordinary multiplayer games. They have been users of Anka since 2019.
Before you found Anka, do you remember what sort of problems you were experiencing with your build/test/ci/cd tooling and setup?
Improbable’s previous build system(s) and its problems
Our CI systems started as a collection of bare-metal and VM environments enrolled into an agent-based configuration management system. These environments were shared by all developers at the company with little in the way of access control and a lack of clear ownership between a CI agent and a team. Over time the number of fixes not captured in code prevented us from confidently rebuilding these environments. This became an operational headache as we scaled our technical needs. Our developers saw an increase in their CI outages which hindered them from working on their products.
It became obvious that this solution was not fit for a company with a rapidly growing developer population. As such, my team was formed to tackle our company-wide CI pain points. For Windows and Linux, we designed a set of systems that allow our end-users to create, deploy, and maintain their CI in the cloud. This is done by using ephemeral VM instances which we scale according to demand. The VM images are built via a pipeline that uses Packer and Ansible to capture a golden image. Our end users then use the id of this image to define in code where they want their pipeline to execute. This has been a great solution for us, as it allows our team of approximately 6 to provide a self-service CI platform to the 100s of our developers across the globe without the need to embed build engineers in each team. You can read more about our design goals and final implementation in our company blog post.
We maintained the bare-metal style of management for our Mac agents until we were able to port all builds happening on the other two operating systems onto our new platform. For macOS support, we wanted to implement a similar solution, as close to our ideal as possible.
In short, we wanted to solve the following pain points:
- We had heaps of bare-metal agents, which we charitably described as inconsistent. Even though we had agent-based configuration management in place, it was not enough to ensure build environment consistency without additional complexity. It was also possible for configuration management to change the state of the agent while a build was running.
- Upgrading the OS in place of a bare-metal machine often causes the build environment to break in fun and mysterious ways. Xcode upgrades also fall under this category.
- Making broad changes to the agent baseline behaviour by our team (git or perforce configuration, monitoring and logging, security, etc) required manual steps for testing across all of our supported operating systems.
- We had a strong requirement to be able to reproduce historical builds and the CI environments used to build them.
- Mac management includes some additional tasks when building them into CI agents. Building a machine from scratch requires manually running the OS installer and completing the out-of-box setup process. Automation for these steps and other security authorisations, such as approval of system extensions, could only be achieved by using MDM workflows.
- The testing and building of new agents by engineers were bottlenecked by our ability to assign Ops capacity to wrangle hardware. During the global pandemic, this stressed our internal processes.
- On the flip side, we have several CI jobs that can only be run on Improbable owned and managed hardware. Our public certifications and other sensitive requirements prevent us from using the cloud for these jobs.
- Upgrading the OS in place of a bare-metal machine often causes the build environment to break in fun and mysterious ways. This runs against the grain with our need to run the latest version of macOS, with a delay up to ~6 months since the version release. Again, our security compliance sets hard requirements on our developer environments.
Do you remember what originally drew your attention to Anka compared to other solutions?
So to help us achieve reproducible builds for macOS, we tested Anka. We spiked it for a couple of weeks and were able to implement an effective local workflow for testing Mac CI environments. It satisfied our initial requirements and provided some additional benefits:
- All the benefits of running workloads inside of an immutable VM along with Docker-like behaviour and layered images.
- By moving the builds to happen inside of a VM, the host’s only responsibility is running Anka. These thinly-provisioned hosts are simpler to maintain and could be built from any Mac hardware. We have the option of scaling onto the cloud if needed in the future.
- Anka has a Packer builder. Since we’re already using Packer for building Linux and Windows VMs, this means less bespoke code for us to maintain.
- Veertu also provides an image registry for the storage and distribution of VMs. We’re able to easily run this in our Kubernetes cluster with a persistent volume claim.
While testing Anka, what sort of surprises did you experience?
We took our on-premise Mac Minis and turned them into Anka-VM-nodes. The agreement between our team and our end users is that we are responsible for the availability of these nodes, while our end users are only concerned with the VM environments in which their CI jobs run. For building new VM images, we’ve integrated Anka into our pre-existing Packer/Ansible implementation, therefore our end users can use the existing pipelines and processes they are accustomed to. Since there are limitations to running macOS as true ephemeral machines, we’ve opted to keep our hosting on-premise for now.
We carefully re-architected our environments to avoid common mistakes and provided these benefits into our base VM layer which we call the OS layer. No credentials or other secrets are ever baked into a VM and are instead dynamically injected by the Anka-node hosts. We try to remove or disable anything that may change the state of the VM when it’s in use. The VMs do not run with admin privileges. By using this “batteries-included” approach, we’ve also reduced the amount of context required by our end users to use macOS.
Where is Improbable today
- For the first time, it’s possible to remotely self-service the creation and deployment of macOS CI environments by our users. Previously, the iteration loop for creating new Mac CI machines from scratch could take up to a day, and many iterations may be required to build into a reliable agent. With Anka, we’ve been able to reduce this loop to ~25 minutes. The deployment of Anka images onto our nodes takes a few minutes and is further sped up by locally caching the OS layer of the image.
- The separation of the Anka VM and the Anka-node allows us to perform OS upgrades and testing separately from each other. Our end users get the choice of upgrading when they can, while our team can maintain security compliance on the nodes without drastically affecting capacity.
- We’ve been able to automate the manual testing of agent baseline behaviour by including macOS environment building within our matrix-build pipeline, reducing our manual testing needs by 10s of hours each quarter.
- Flakey CI environments can be almost instantly rolled back by restarting the VM from a specific image tag. The benefit of this feature is hard to estimate but it eliminates many categories of debugging costs from our users.
- If needed, we’re able to double our capacity of macOS CI agents by running two VMs on a single node.
- Anka provides useful workarounds for automating the out-of-box experience and approving of systems extensions. We no longer need to maintain an MDM implementation for our build environments. The IT and security teams, who maintain our company MDM services, are no longer on the critical path for making changes in our environments.
We’re currently porting all of our bare-metal environments into Anka and seeing some great results so far. I hope I was able to give you some insight into what we’ve built and the impact that we’ve seen.