06/10/2021

Taking out the garbage – fixing a garbage collection issue for Docker on .NET

Marcin Krystianc, Senior Software Developer
Robert Smith, Software Engineer

The problem 

Historically, some of G-Research’s applications have been built on .NET and Windows. However, with the arrival of .NET Core and its excellent Linux support, G-Research began moving these applications to Kubernetes on Linux. 

Most .NET applications show similar performance on Windows and Linux. However, one of our batch applications took around 40% longer to run on Linux compared to Windows – this particular application makes heavy use of the file system. 

By looking at memory usage over time (with a 36GiB memory limit) we can see an interesting difference between the Windows and Linux runs, as shown in the table below.. 

Minutes since Start  Windows Memory Usage (MiB)  Linux Memory Usage (MiB) 
10  6976  36000 
20  10200  36000 
30  13919  36000 
40  14065  36000 
50  15422  36000 
60  15422  36000 
70  15422  36000 
80  15422  36000 
90  15422  36000 
100  15422  36000 
110  15422  36000 
120  16984  36000 
130  36000 
140     36000 
150     36000 
160     36000 
170     36000 

 

Windows memory usage looks sensible, but on Linux the application is always at the memory limit, and yet never fails due to lack of memory. What is going on? 

Modern operating systems use spare memory to keep a cache of files on disk. This is known as the file cache or page cache. File cache greatly improves performance as RAM is much faster than disk. File cache has minimal downside since, if memory is needed for something else, the file cache can simply be reclaimed. 

On Windows, we limit memory usage via a Windows Job Object. Job Object memory limits (and memory usage figures as shown in the table above) do not include the file cache. 

On Linux, Kubernetes uses Docker to run our application. Docker limits memory usage via a cgroup. Cgroup memory limits (and usage figures) include the file cache. 

Since our application makes heavy use of disk, Linux quite reasonably uses all spare memory up to the cgroup limit as file cache. This means that, with a container limit of 36GiB, the cgroup will report a memory usage of 36GiB most of the time (even though a lot of this is only file cache). 

This may sound like a harmless quirk of cgroups, but it has a seriously negative effect – with .NET Core 5.0, the .NET garbage collector mistakenly thinks it’s about to run out of memory (even though Linux will reclaim file cache as needed). When it thinks it’s short of memory, the Garbage Collection (GC) behaves much more aggressively (see documentation here), which significantly slows down the application. 

This effect leads to a ridiculous experimental result: containers with the same memory limit (but running on a machine with a different amount of free RAM) give very different performance results.

Free RAM at Start Application Run Time (s) G2 Garbage Collection Count
~700GB  10174  823 
39GB 10421 812
27GB  5428  105 

A crazy experimental result: Linux/Docker performance doubles when the machine is low on memory! 

The table above shows time taken by our application running in a Docker container with a 36GB limit. Reducing free RAM on the machine below 36GB almost doubles the performance of our application! What is going on? 

When there is >36GB RAM free, file cache is limited by the 36GB container limit. Container memory usage will equal the limit, and the GC will think it’s out of memory. 

When there is <36GB RAM free, file cache is limited by available RAM. Container memory usage will be below the 36GB limit, and the GC won’t think it’s out of memory. Compared to the >36GB free case, runtime almost halves, and G2 garbage collections fall by 8x. 

The solution 

So what have we found? You can control the point at which the GC enters aggressive/slow mode via the System.GC.HighMemoryPercent parameter in runtimeconfig.json. 

In theory, setting this to 100% should stop the GC from ever entering aggressive mode. But, in practice, this makes no difference because HighMemoryPercent is capped at 99% in the GC code. So, having failed to find a workaround, we set about modifying the .NET GC. 

We reported the issue on the dotnet/runtime repository on GitHub, describing what we knew  about the problem so far.

Intending to file a good bug report, we wanted to include information on how to reproduce the bug. We planned to show repro steps simple enough  for anyone to follow. We had evidence that it was possible to double the performance of our closed-source application by reducing available memory on the machine. However, we could not provide an example using our closed source app, as it relies on our internal infrastructure and is commercially sensitive.

Thus, we needed to write a new app that could reproduce the same problem and be shared publicly. For this purpose we created a test application which:

– Periodically allocated some memory to put memory pressure on the .NET GC

– Performed some file operations to populate the file cache 

– Printed GC stats every second so we could spot the problem

Each behaviour of the test application was configurable via command line arguments, so it was easy to experiment with different scenarios.

Once we had implemented the test application, our next goal was to find such a scenario in which adding some I/O pressure to the Docker container would increase the frequency of garbage collections in our application.

Surprisingly, the problem wasn’t easy to reproduce. It took at least a few dozen trials to discover the right combination of input arguments and repro steps but finally we found it. After posting a comment with detailed repro steps, the bug report was complete. Now it was time to find the problem in the implementation of the .NET GC and fix it.

As mentioned before, the problem was related to the cgroup mechanism and the fact that it reports memory usage including the file cache. By looking at the GC code we could see that, in the case of a .NET application running inside a container on Linux, the memory load value was being read from the memory.usage_in_bytes file.

To find out more about this file we reached for the documentation, which pointed us to another file called memory.stat. The difference between memory.usage_in_bytes and memory.stat is that the former stores only a single value, whereas the latter is more granular and stores separate values for each memory usage type (“cache”, “rss”, “swap”, etc.).

At this point, we were getting close to the solution. Now we had to decide which fields from the new file should be used for the purpose of memory load calculation.

Unfortunately, the definition of these fields turned out to be very limited, so this wasn’t a straightforward task – some of these fields are independent, whereas others are aggregates of other fields. We also needed to ensure that we correctly supported scenarios where, for example, application memory is swapped out or memory is not swappable at all as it is used as a ramdisk. Finally, after several ‘try and test’ cycles, and with some help from the community, we came up with the formula that could implement a fix

Most notably, our patch has been merged into the main branch and will be included in the next preview release of .NET 6 (6.0.100-preview.5). Moreover, we have submitted a servicing patch for .NET 5 which also has been accepted and will be available soon in a next patch release.

Conclusions

This bug has existed since .NET Core 2.1.5 (released in October 2018) and was reported by us in March 2021, which means that it wasn’t noticed by anyone for 30 months! So why was a bug that can cause a 2x slow down to a .NET application not reported before? 

There are several potential reasons. First of all, we think that we were simply lucky – as we were porting existing applications from Windows to Linux, we already had a baseline for comparison. Furthermore, for some of our applications the performance impact was very significant but not easy to explain, so it encouraged us to find the root cause. 

It may also be that this bug simply didn’t affect very many people. We don’t really know that for sure though – we can only speculate. In our case, not all our .NET applications running inside Linux containers were affected by it and it wasn’t easy to reproduce outside the production environment. So, the problem might not be that common in the outside world.

However, this fix has had a very positive impact for us at G-Research, and hopefully our contributions will continue to help others in the community.

Related articles

Stay up to-date with G-Research

Subscribe to our newsletter to receive news & updates

You can click here to read our privacy policy. You can unsubscribe at anytime.