Introducing ParquetSharp

23 August 2018

Software Engineering

We are thrilled to announce a new open-source library from G-Research: ParquetSharp. ParquetSharp is a .NET library for reading and writing Apache Parquet files.

Give it a go and tell us what you think: https://github.com/G-Research/ParquetSharp

Why Parquet?

We were looking for a replacement for both CSV files and proprietary file formats used by our teams and services to exchange and store information. As our data requirements keep on increasing, we focused our attention on a solution that would offer efficient storage, fast reading and writing, good multi-platform availability, and easiness of integration.

We looked at a few alternatives to CSV, namely Arrow/Feather, HDF5 and Parquet.

CSV

Comma-Separated Values (CSV) is a common row-oriented file format for storing and exchanging data. It’s relatively easy to parse and even easier to write, as long as you don’t have too much data to handle. It’s biggest advantage is that the data is human-readable and can be inspected with the help of any text editor.

Surprisingly there is no CSV standard; the interpretation of each text field is left to the application. This makes handling of higher-dimension data more difficult between platforms (e.g. arrays). Reading and writing are both slow since they require extensive text to value conversions. The resulting file size is not very good either, even when assisted by aggressive compression (resulting in even slower reading and writing).

Arrow / Feather

Feather is a straight-to-disk memory dump of Apache’s Arrow. The latter is a development platform for in-memory column-oriented data. The obvious advantages are the reading and writing speeds, as they are only limited by how quickly the system can read and write the content to disk with little to no transformation. The downsides are the lack of interoperability with non-Arrow systems, and the resulting file size.

HDF5

HDF is a Hierarchical Data File format. Think XML but in binary format. It’s extremely flexible, but big and complicated such that only one implementation exists. It’s not particularly good at reading and writing speeds, nor is the resulting file size. For our purposes, we did not require the hierarchical layout provided by HDF either.

HDF seems to have been a favoured file format for many years but is slowly being abandoned for better alternatives. A feeling perhaps best summarised by the following post: https://cyrille.rossant.net/moving-away-hdf5/

Parquet

Parquet is a column-oriented file format. Unlike Feather where the data is dumped straight from memory, Parquet groups the column data into chunks and stores them using fast encoding to reduce the data footprint while limiting the impact on serialisation speed (e.g. run-length encoding, delta encoding, etc). The chunks can themselves be compressed by a higher level compression algorithm such as Snappy, Brotli, GZip, or LZ4.

Compared to other file formats, Parquet tends to offer a good balance between reading/writing speeds and file size (especially if fast compression algorithms such as Snappy are used). A good comparison can be found at the following post. Some of our internal testing have shown >25x speed improvements over CSV+GZip while resulting in 50% smaller files.

https://tech.blue-yonder.com/efficient-dataframe-storage-with-apache-parquet/

Parquet has three official Apache implementations: one in Java, one in C++, and one in Rust. The C++ library is available in Python via PyArrow. There is also an independent pure .NET implementation known as Parquet.NET.

https://parquet.apache.org/documentation/latest/

https://github.com/elastacloud/parquet-dotnet

Why ParquetSharp

We desired a Parquet implementation with the following properties:

– Cross platform (i.e. Windows and Linux).

– Callable from .NET Core.

– Good performance.

– Well maintained.

– Close to official Parquet reference implementations.

Parquet.NET does tick a lot of boxes. Being pure .NET, its main advantage is its cross-platform portability. We were left disappointed by its reading and writing performance. It started as a high-level library and lacks the finer control over low-level aspects of Parquet such as row groups and statistics (since our original investigation, Parquet.NET v3 has now been released and addresses some of these issues).

Not finding an existing solution meeting these requirements to our satisfaction, we’ve decided to implement a .NET wrapper around apache-parquet-cpp starting at version 1.4.0. The ParquetSharp library resulting from these efforts tries to stick closely to the existing C++ API; although it does provide higher level APIs to facilitate its usage from .NET. We’ve also decided that the user should always be able to access the lower-level API if needed.

Tests using in-house data show ParquetSharp being >4x to >10x faster than Parquet.NET (depending on Parquet.NET versions -including v3-, the data shape, and whether data is being read or written).

Current status

ParquetSharp is already used in production by G-Research. It covers most of the existing Apache C++ API, although some parts are yet to be ported (e.g. ColumnPath). We are focused on fixing bugs as they are uncovered, and will port more of the API based on on-going feedback (including yours).

ParquetSharp successfully compiles and runs on Windows x64. Linux support is in-progress; in theory the code should just work but the build system has to support it. We expect a working Linux build within a few months once we are happy with the state of ParquetSharp on Windows.

By Tanguy – Software Engineer