Deep dives in debugging: when it really isn't your fault

Deep dives in debugging: when it really isn’t your fault

7 January 2020

Technology

An oft-repeated maxim in programming is “if you think it’s a compiler bug, it’s not a compiler bug”. For those of us working in managed languages, a corollary might be “if you think it’s a runtime bug, it’s not a runtime bug”. Time spent trying to pin blame on these widely-used tools presents a poor value proposition when so many things within your control are capable of going wrong. Sometimes, however, you’re faced with an issue that really makes you challenge this. We recently encountered such an issue and are excited to share the story, as well as the lessons we learned along the way.

The incident

Here at G-Research, our researchers use a large compute farm on a daily basis. They are very demanding customers and the volume of jobs they submit is extremely high. The farm is split into two sections – the CPU farm and the GPU farm. When users submit jobs, they specify whether or not their jobs will benefit from GPU acceleration, and our farm scheduler matches the job with the appropriate nodes.

With the farm being as heavily-used as it is, maintaining the nodes is a common operational task and usually happens smoothly in the background without users noticing.

However, this time we received complaints almost as soon as the first nodes landed back in the farm after a routine monthly upgrade. Our users started reporting that some (but not all) of their jobs on those nodes were hung, and making no progress. We quickly obtained stack traces for some of the affected jobs, and noted that they all seemed to be stuck waiting for an asynchronous POST request to an internal HTTP service to finish.

The gut reaction was to check for issues with that service. However, all instances seemed to be alive and were serving requests – in particular, there had been no errors logged since users started seeing problems. A client-side hang while using an asynchronous HTTP client suggested a possible deadlock issue, but the code was very simple and it was hard to believe that a threading issue had just popped up out of nowhere.

With user complaints coming in thick and fast, we realised that all the hanging jobs were on nodes with GPUs. This alone was not enough to identify a root cause, but it did make us wonder if anything had changed with the GPU farm in the past hour or so. A quick phone call with our infrastructure team confirmed that a software upgrade was currently being rolled out.

We now had to confirm that all of the jobs with problems had been running on freshly-upgraded nodes. Fortunately, we log a lot of information about each farm job to a database, so this was simply a case of constructing a SQL query that gave us a list of all nodes that had hung jobs.

When the list came back, it looked remarkably similar to the list of nodes that had just been upgraded! So, we pulled all the upgraded nodes from the farm, halted the upgrade process, and told our users to resubmit their jobs. Our researchers were now happy and their jobs were progressing as normal.

Narrowing down the issue

We were not done yet. Now there were a significant number of very expensive machines sitting idle, unable to accept new jobs. We considered rolling the nodes back to how they were pre-upgrade – indeed, the opportunity-cost of not doing so was extremely high.

On the other hand, we needed to identify the root cause of the problem. We couldn’t just decide to cancel the upgrade. We chose to treat the upgraded nodes as an expensive private testing farm which we could use to investigate this issue. This turned out to be an excellent decision – since the chance of a job hanging was relatively low, we needed to run thousands of jobs in order to have any statistical confidence in our testing. Technically, you only need a single node to do this, but minimising the iteration time is very important in an open-ended investigation.

Our first step was, of course, to try and reproduce the issue. We ran thousands of jobs and observed that many of them hung while trying to make HTTP requests to any one of a variety of different services. Some of the jobs even crashed with incredibly suspicious errors – AccessViolationExceptions and IndexOutOfRangeExceptions deep in .NET code, among others.

This piqued our interest – were we seeing memory corruption issues? In any case, we had reproduced the problem, so we began trying to narrow down the cause.

We turned our attention to the list of software updates that were included in the latest farm node image, and one stuck out like a sore thumb – a .NET Framework upgrade from 4.7 to 4.8. We took one of our broken testing nodes, uninstalled 4.8 and installed 4.7 – now we had a single node with an image that differed from the broken nodes on a single axis. A few thousand jobs later, we were convinced that this node was behaving normally; and thus we had confirmed that the .NET Framework upgrade had broken our applications in some way.

Remember how I mentioned our farm is split into two sections? The CPU farm had also had .NET Framework 4.8 rolled out to its nodes, but we hadn’t seen any problematic jobs running on its CPU nodes.

This was confusing – could the presence of a GPU be causing bizarre hangs in the .NET HTTP client? We put this thought to one side for a moment and took a look at how our applications behave differently when they detect a GPU. The major difference was in initialisation code, where there are two additional steps if a GPU is present: initialise Alea (a GPU acceleration library for .NET), and modify the CPU affinity of the process to allow it to run on specific cores. Having only two parameters to vary made it very easy to further narrow things down by creating modified versions of our application that skipped various initialisation steps.

After several thousand more test jobs, it became clear that the CPU affinity modification running on .NET Framework 4.8 led to jobs hanging or crashing with suspected memory corruption.

You’re probably wondering why we do this processor affinity modification in the first place, given that we are operating a compute farm environment where users are requesting the exact number of CPU cores that they need. Most jobs running on our GPU farm only request a single CPU core. This means that the processor affinity of the process is set to a single core when it starts up. For performance reasons, we would ideally like that core to be located in the NUMA node (i.e. CPU socket) closest to the GPU that has been assigned to the job. However, our farm scheduler isn’t capable of understanding this when choosing which core a job gets. To work around this, our GPU initialisation logic unpins the process to allow it to run on all the cores of that NUMA node. We can get away with this because CPUs are typically not the bottleneck in our GPU jobs, and we’ve extensively tested that this is a sensible thing to do and leads to tangible performance gains.

Creating a standalone program to reproduce the issue

At this point, we were pretty sure there was a bug introduced somewhere in between .NET Framework 4.7 and 4.8 that breaks our unpinning code. Time to talk to Microsoft! Of course, .NET has been open source for quite some time, and it wouldn’t be a good idea to start dumping our proprietary code into GitHub issues. We needed to create the smallest possible standalone program that still reproduced the fault. The first iteration of this looked something like the following:

public static int Main()
{
    using (var process = Process.GetCurrentProcess())
    {
        Console.WriteLine($"Starting affinity: {process.ProcessorAffinity.ToInt64():X16}");
        process.ProcessorAffinity = new IntPtr(0x0000_0000_FFFF_FFFF); // We're running on 32-core farm nodes
        Console.WriteLine($"New affinity: {process.ProcessorAffinity.ToInt64():X16}");

        using (var client = new HttpClient())
        {
            Console.WriteLine("Making an HTTP request...");
            var result = client.GetAsync("http://someinternalservice/health/ready").Result;
            Console.WriteLine($"Status code: {result.StatusCode}");
        }
    }

    return 0;
}

We ran this on our farm nodes, and some of the jobs had console output that looked like:

Starting affinity: 0000000000000002
New affinity: 00000000FFFFFFFF
Making an HTTP request...

(never completes)

That’s the hang! Now we needed to try and reproduce this hang on our workstations. This required a little bit of extra code to simulate the setup that we have on the farm nodes whereby the process is initially pinned to a single core and then unpins itself.

We added a small launcher program that sets its processor affinity to a single core, then spawns a subprocess which runs the code above (and does this in a loop since the issue isn’t reproducible every single time). This worked nicely – the hang was typically reproduced within a couple of iterations, and as an added bonus we were still seeing the occasional memory corruption. At this point the testing farm was no longer needed, so we used our automated build processes to downgrade .NET Framework on the nodes and dropped them back into the main farm.

With the time pressure now off, we refined the reproduction code. Was an HTTP request required to reproduce, or might something simpler also trigger the problem? The HttpClient class in .NET provides a whole bunch of async methods, so how about we try just kicking off some Tasks? As it turned out, replacing the HttpClient.GetAsync() call with a Task.Run(() => Thread.Sleep(10)) still left us able to reproduce the problem.

Sounds like the bug might be somewhere in the thread pool. Indeed, this theory was supported by many of the strange memory corruption stack traces we bumped into:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
  at System.Threading.ThreadPoolWorkQueue.Dequeue(ThreadPoolWorkQueueThreadLocals tl, IThreadPoolWorkItem& callback, Boolean& missedSteal)
  at System.Threading.ThreadPoolWorkQueue.Dispatch()
  ...

Our small reproducible example led to some interesting questions about concurrency within .NET. What if there was a concurrency bug in the .NET garbage collector? For those who are unfamiliar with garbage collection in .NET, there are some good articles available on the Microsoft website: fundamentals and performance. Two particular configuration parameters are interesting:

gcConcurrent – Controls concurrent/background garbage collection. Set to ‘enabled’ by default.
gcServer – Switches between workstation and server garbage collection. Server garbage collection is intended for high throughput applications, and can either be non-concurrent or background.

Varying these parameters did not impact the reproducibility of the issue. Interestingly, we later learned that server garbage collection is never used in a single-processor environment, regardless of the configuration parameters. The GC (garbage collection) idea was good but didn’t seem to bear fruit – what about more general concurrency issues?

Recall that the problem occurs when unpinning a process from a single CPU core to multiple CPU cores. What if there was code within the .NET Framework that made an assumption that processor affinity would never change? What if, based on that assumption, it uses routines optimised for single-processor setups? What if those routines break in a multiprocessor scenario because they’re not using atomic operations?

While confirming this ourselves would have taken a significant amount of time (the .NET codebase is massive!), there was a simple test to try: change our launcher program to set its processor affinity to two cores (rather than one). By doing this, we’re still increasing the number of available cores for our child process, but we’re explicitly starting in a multiprocessor environment. As it turned out, our hunch was correct – we were unable to reproduce the issue!

The final thing we wanted to know was how widespread this bug was. We had proven that it was introduced in .NET Framework 4.8, but what about .NET Core? We’re currently going through a .NET Core migration process, and using our unpinning logic in a .NET Core program would have to happen in the not-too-distant future. We made a couple of quick modifications to our testing program in order to multitarget it for .NET Core 2.1 and 3.0. We could still reproduce the fault. Up until this point, we had been running entirely on Windows. We tried to reproduce on Ubuntu 18.04 but we were unable to – this suggested that the issue was probably buried in the platform-specific internals of the Common Language Runtime (CLR).

To sum up our findings:

When a process runs with an initial CPU affinity of one core, then unpins itself to multiple cores, then runs asynchronous Tasks, there is a chance of memory corruption occurring.
This memory corruption often presents as deadlock, but sometimes bizarre exceptions are thrown from deep within the framework.
The problem has been confirmed to occur with .NET Framework 4.8, .NET Core 2.1, and .NET Core 3.0.
We have a small test program which reliably reproduces the issue within a short period of time.
The gcConcurrent and gcServer settings don’t seem to affect reproducibility.

Reporting the bug to Microsoft

We definitely had more than enough material to put together a good bug report. At this point, we felt that the most productive move was to escalate to Microsoft by opening an issue on the official .NET Core CLR GitHub repository. You can find the ticket here along with the source code for the program reproducing the issue (also available as a Gist). The posted issue was picked up quickly. Within a very short amount of time, the root cause was identified.

As it turned out, our hunches were partially correct – the bug did in fact lie around the garbage collector. It was due to an erroneous assumption that CPU affinity would never increase from one core to multiple cores. The turnaround from Microsoft was excellent and a pull request was raised, reviewed, and merged in short order. The fix has been marked for release in .NET Core 3.1 and the January 2020 update for .NET Framework.

Epilogue

The investigation on our part was intense and difficult at times. It certainly took time to isolate the triggering application code and reduce it to a reasonable size, but this was a great experience and we learned a lot along the way.

We identified parts of our farm node upgrade process that could be improved – we needed to make better use of our ‘canary’ nodes to try and catch bugs before they were noticed on a wider scale. (Of course, this is non-trivial when running user-supplied code that typically has a noisy failure rate). The interaction with Microsoft drove home the importance of putting together a good bug report when interacting with another team – internal or external. We are glad to be able to share a glimpse of one of the many challenging problems we have encountered and ultimately solved in the Quant Platform group here at G-Research.

Sean Saville – Software Engineer, Quant Platform

If you have enjoyed reading this, or think you can do a better job than we did, we are hiring!