Armada: the decision to Open Source

5 October 2021

Software Engineering

We made the decision to open-source the Armada Project before we wrote a single line of code.

This was a completely new choice for G-Research. We have made a few projects public and open-source over the years (ParquetSharp, and geras, for example), but we had always started the project internally and moved it into the open when we felt it was ready. Others we inherited when their original maintainers no longer had the time or passion to contribute to their original project (consuldotnet, SparkMagic).

We had never, however, started a project as open-source from its inception. With Armada, we made a conscious decision to start completely open from the outset. I’d like to share our rationale for this decision and some of our learnings from along the way.

Two Approaches to an Open-Source Project

The main alternative to starting a project in the open is to work on it internally, bring it as close to perfection as possible, and open-source it later. There are pros and cons to each approach.

Open Sourcing Later

Pros

Get to confirm that your project works in production before trying to sell the public on it – everyone wants to be part of something that works and is good
You don’t have to share all your early mistakes and missteps with the world – everyone wants to look good, seemingly right from the start

Cons

You receive input from the community later in the process, possibly leading to a project that doesn’t apply to the needs of a more general population – nobody wants what you’ve built because it only solves your specific problem
You may inadvertently tie your project to some internal process or architecture, thereby requiring more work to release the project publicly – nobody wants what you’ve built because it only solves a problem internal to your systems
You may never get around to it. So many things, so little time – nobody wants what you’ve built because it doesn’t scratch the one specific itch that they have

Open-Source from the Start

Pros

By sharing the work publicly, you are encouraged to follow best practices from the start – everyone likes to be seen doing the right thing
It allows others to join your project, potentially providing valuable input into the design of the project or actual code contributions that accelerate the project’s progress – everyone likes to contribute in a meaningful way
The design of the project is (hopefully) free of the constraints or dependencies of your company’s internal development process or infrastructure from the get-go – everyone who needs to be involved in the decision making process is

Cons

Sharing the project too early may actually turn some potential collaborators off from the project – nobody wants to get behind a half-baked project.
G-Research is particularly sensitive to security concerns, so there is extra development time involved in developing a project publicly. We have an elaborate system of levers and pulleys to synchronise code from inside to outside and vice versa. The system is brittle and the sync can be broken if you make a simple mistake. Your own company may not have quite as onerous a process, but there is often some additional amount of operational overhead associated with contributing to open-source projects – nobody wants to over-share things that they shouldn’t

There isn’t a right or wrong choice here. In some situations, we continue to prefer the “Open-Source Later” approach, particularly in situations where we feel that it is valuable to verify that our approach works in our production environment first. For example, we recently released and moved the Aerospike plugin for Vault to the Aerospike-community repository, but only after we felt confident that it was indeed working correctly for our own internal users.

The Rationale for Armada

The rationale for starting the Armada in the open was based in the aims of the project itself.

We thought long and hard about creating yet another batch processing tool. We worried that it was an increasingly crowded space and that someone else would eventually have the same requirements that we had. Namely, we needed something to coordinate the fair-use scheduling of batch-processing jobs across Kubernetes clusters.

For us, the number of machines that we need to run our quants’ insatiable desire for processing exceeds the number of machines that is recommended in a single Kuberentes cluster – perhaps by an order of magnitude. The official guidance on large clusters is 5,000 nodes per cluster (https://kubernetes.io/docs/setup/best-practices/cluster-large/), although our experience suggested a lower number was a more practical threshold. We could push the limit if we tweaked a lot of little settings but it seemed like pushing the envelope in this dimension would ultimately lead to unexpected issues.

Another consideration was that we wanted the ability to perform cluster-wide upgrades without taking down the entire cluster. This, again, suggested multiple clusters running side-by-side which would allow us to selectively take a cluster offline for maintenance without affecting the overall running of the business.

While our rationale was directing us towards a multi-cluster setup, we also wanted to keep the interface to the system simple. We didn’t want to burden our users by requiring an intimate knowledge of the infrastructure their jobs will run on. Armada sits above all of our clusters and distributes batch-processing jobs to those clusters intelligently. Users submit their job to Armada and the complexity of where the job actually runs is abstracted away. Our hope is that this makes our users’ lives easier by providing a simple way to submit a batch job to the massive amount of infrastructure required to run all of G-Research’s jobs.

Open From the Start

With all this in mind, we still could have developed the code in-house first and then open-sourced it later. Our decision to start as an open-source project from the beginning was really driven by three points:

We hoped a coalition of willing contributors would be able to help design and build a better tool than if we built it by ourselves. We knew other organisations suffered from the same problems we did and were working on their own ways of addressing the same concerns
Often when a useful tool is built in-house, someone likely has the same idea in the open-source world. Because of the additional contributors, often, the open-source tool eclipses our internal efforts and we are forced to switch. If we put our initial efforts towards the eventual open-source tool, we set up the possibility that we don’t have to pay that migration cost
Given the core position that this technology would take in our overall infrastructure, the temptation to build something bespoke would be too great if we started in-house. To force ourselves to generalie the architecture, we needed to start in the open to avoid falling into this trap

Collaboration from the Get-Go

We benefit from having more input into the product. We believe that a diversity of ideas, viewpoints and methodologies is one of the strengths of open-source software and will make our own software stronger if we adhere to the same principles.

Before the project even started, we spent a lot of time talking to our peers and confirming that this could be a useful project. We reached out to people who had built other versions of batch-scheduling capabilities into Kubernetes, including k8s-batch and Volcano. We eventually all sat down and collaborated on a design document outlining the overarching goals for all of us to bring batch-processing to Kubernetes natively:

https://docs.google.com/document/d/1jgXiLNZUh7Voz-MHeggrC227tQogEVA4wbtlPIKzZPI/edit#heading=h.4p9vyufcbgne

Among the group were representatives from research institutions such as CERN and the University of Michigan; corporate representatives from Google, Amazon, Huawei, VMWare and JD.com; and various CNCF core committers and some Apache PMC’s. We also presented our work to the CNCF End-User group that we are a part of to see if the issues that we were trying to solve were relevant for anyone in that audience.

By soliciting all of this feedback, we grew increasingly assured that what we were trying to do – batch processing across multiple Kubernetes clusters – was a concern for other parties. Others shared our need and this gave us the encouragement to try and fill it.

If others suffered the same problems, there might be others willing to help build the solution with us. From our initial conversations with this group, there was clearly a willingness to give input and advice and encouragement. All of this helped convince us that we should undertake the project and initiate it in the open.

Avoid Building and Switching

We hoped that we could avoid building a proprietary system in-house and then migrating to an open-source alternative later. We thought that if we managed to build something that was useful to us and others, it had a chance of being the thing we would use for years to come.

From another angle, if our project was truly useful, it would almost certainly be developed at some point in time in the open-source world. Because the Kubernetes ecosystem develops so fast, if the open-source alternative caught on it would quickly eclipse whatever we had built internally. Eventually, we would be forced to decide between maintaining an in-house tool with fewer features than the open-source alternative or migrate away from what we had built. That was a decision we didn’t want to have to make.

It’s still early days for the Armada Project. We hope our solution will catch on. We encourage this to happen by ensuring that others are engaged with our production process and understand the key design decisions at every step by working out in the open – in full view of a community of collaborators and contributors.