Technology 25/08/2020

Sparking a new level of scale

Written by Mitul Bhakta, Senior Software Engineer at G-Research.

As a developer at G-Research you are always challenged to answer the question “What about the 10x?” Often the answer may be “out of scope”, but the key point is that you consider it.

In 2017 we began work to replace one of our main processing frameworks and, especially given the extreme market volatility seen in 2020, it has become a great example of where we have benefited from thinking about scale.

Rewind To 2017

In order to process market simulations we were relying on a C#-based distributed computing framework that was built in-house over 10 years ago. The framework would be responsible for aggregating terabytes of data into various sets of time series which would ultimately be visualised on a UI. A lot, however, has changed in 10 years. G-Research is no longer just a dotnet shop, there has been a boom in open source software and, most importantly for us, the platform was struggling with 1x of our use case let alone 10x! It was decided that we were going to ditch the entire system and replace it with… Apache Spark.

The Big Rewrite

So here is what we had:

  • A distributed compute framework written in house and in C#.
  • At least 10 years’ worth of application code written in C#. (Just think of the number of bugs you fixed by fixing bugs you fixed by fixing…you get the idea)
  • Terabytes of data stored as protobuf on file shares.
  • A team of almost exclusively C# developers.

And here is where we decided that we wanted to be:

  • Leverage the open source Apache Spark platform for distributing work.
  • Rewrite all the business logic in Scala so that we can use Spark.
  • Backfill and use HDFS to store data going forward.
  • Do this whilst having no Scala experience.

Rewriting an entire legacy system with minimal tests into not only a new framework but a new programming language, neither of which you had any experience with? We sure like a challenge at G-Research!

Off To A Great Start

A few weeks into the project and I’m up and running with Spark, Scala doesn’t look like a foreign language anymore, and I’m ready to get going.

At first, the existing code base was incredibly complex and difficult to read. This was mainly because it had to be optimized in a number of weird and wonderful ways to eke every last bit of performance. Luckily for me, extracting out the initial business logic was fairly straightforward, and so I was quickly able to translate thousands of lines of dotnet code into only a handful of Spark transformations. In particular, these transformations formed part of a pipeline architecture where each transformation is stateless and each pipeline can be split and merged to aid debugging.

But What’s the Catch?

The initial implementation was elegant, simple, and worked really well when run locally with a subset of data. Unfortunately life is less straightforward, and running on the cluster with even ¾ of the representative data simply didn’t work. I had, of course, neglected one of the main considerations when processing ‘big data’: how and in what shape to store your data.

Reading books and using Stack Overflow helped to overcome some of the issues. (Top tip: make sure to consider the size of your files on HDFS in relation to the HDFS block size!) Other issues however were just pure trial and error (I tried a good 10 or 12 different partitioning strategies). Once I was confident with the data layout and, after a few more code optimisations (think broadcasting data, avoiding shuffles/spills, and tweaking executor and driver memory) and weeks of painstaking data reconciliation, the application was ready to deploy.

Don’t Forget About Deployment

Now it was time to productionise. In the big data world, disks fail, executors fail, and, in general, things have a propensity to just fail. At the time, technologies to manage jobs such as airflow were still not mature enough for the company’s liking and what we did have (Apache Oozie or Cron) didn’t give us data dependencies or complex job dependencies. With time ticking away, I decided to spend some 10% time on rolling my own.

From 10% To 100%

Using Kafka as the backing engine, I built a job processing framework whereby producers could place job descriptor objects onto various Kafka topics. Then, by leveraging the only once processing guarantees of the stream API, multiple consumers would use these descriptors to submit and track various Spark jobs. It gave us exactly what we needed and meant that, as far as our pipeline was concerned, the time at which a job started was irrelevant. All you needed to do was put the right message descriptor onto the right Kafka queue and you were done. After a few weeks of developing this on the side and painfully trying out some of the open source options, I decided to roll the dice and use this 10% project in production. It was a success, and we still use it to this day!

Getting Data Out

Last but not least, we needed to figure out how to make use of it all. The ultimate end goal of this data was to be able to visualise, plot, and compare it across multiple horizons via a UI. I previously mentioned that these aggregated results were stored in SQL server. What I didn’t mention was that all of our services aggressively cached data at various layers, because using a relational database to do relational queries in our case was still really slow (a few months ago we ended up migrating these giant single instance windows services onto Kubernetes. That in itself had a whole bunch of challenges!).

Given the aggressive caching, query performance wasn’t much of a concern. However, querying in a relational manner was important. Apache Hive, and in particular Hive LLAP was the obvious choice. It required very little code change in our services as the SQL did not differ much with what was already being used. The problem, however, was reliability. Our LLAP instance would sometimes fail to execute queries, the services would sometimes crash, or, even worse, it would block and timeout forever! For years we had been blessed with the stability of SQL server and so never thought of facing issues with the big data equivalent. In hindsight that was rather naïve, but thanks to the incredible hard work of our big data engineering team they finally got LLAP to be stable and, just over a year after it was first started, the Spark rewrite was complete.

The Results

The results were even better than we could have imagined. Typically it would take anything between 13 to 36 hours to process a set of simulations and on an average we would run 10 a week. Today, on Spark, the number of sets we process has increased fourfold and yet, on average, they each take only 30 minutes to process! Not only has the performance increase been phenomenal, having all the raw data in an easy to access format has enabled our researchers to run their own notebooks and answer questions that were simply not possible to answer two years ago.

Was It Worth It?

It was a lot of hard work, and involved many painful days whilst trying to debug Spark exceptions and reconciling values, but it was worth it. We came out the other side with a great product that our users love, a system that has proven to scale to the ever-changing demands of the financial markets and, on a more selfish note, I got to play around with and learn a great piece of technology. What more can you ask for as a developer?

Stay up to-date with G-Research

Subscribe to our newsletter to receive news & updates

You can click here to read our privacy policy. You can unsubscribe at anytime.