Faster string processing in Pandas

Itamar Turner-Trauring, from our Open-Source Team, looks at ways to speed up string processing in Python

Imagine you have a Pandas DataFrame with a 100 million rows and a column of strings you want to manipulate.

Python’s string library is quite fast, but processing 100 million strings can add up.

This was the problem faced by one the developers at G-Research, and so I spent some time researching and measuring alternatives to help them get results faster.

And to help you get faster results, here’s what I learnt about speeding up string operations in Pandas.

Python string operations are quite fast

There are two potential sources of slowness when dealing with this situation:

  • Overhead from the large number of items. 100 million strings means iterating over a sequence of 100 million entries, calling functions 100s of millions of times, and so on
  • The string operations themselves. It was not clear to me from the outset how fast the actual Python string implementation was.

A little bit of experimentation suggested that Python’s string implementation is in many cases quite fast and optimised.

For example, I looked at converting a string to uppercase, useful for normalization; lowercase would also work just as well.

For strings that can are ASCII-only Python can do its work much faster.

Here’s what IPython reports:

In [1]: %timeit "hello world".upper()
57 ns ± 0.855 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [2]: %timeit "hello world\u1234".upper()
151 ns ± 0.359 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

So for standard string operations, it’s quite plausible that Python’s implementation is sufficient.

The problem: 100 million strings

The issue with performance in this situation is therefore more about the fact that you’re doing the same operations over and over and over again, 100 million times.

To see what the impact is, let’s create some strings, and a function to measure run time:

import pandas as pd
import time

    "{} hello world how are you".format(i)
    for i in range(1_000_000)

def measure(what, f):
    start = time.time()
    print(f"{what} elapsed: {time.time() - start}")

We can compare two variants, the naive apply() and Pandas’ “vectorized” string operations vi.str.

“Vectorized” means the code is supposed to take into account the fact that it’s doing the same thing over and over again; actual implementation may vary, but the hope is that this will result in a speedup.

def replace_then_upper(s: str) -> str:
    return s.replace("l", "").upper()

measure("Pandas apply()",
        lambda: SERIES.apply(replace_then_upper))
measure("Pandas .str.",
        lambda: SERIES.str.replace("l", "").str.upper())

If we run that we get the following results:

Pandas apply() elapsed: 0.39

Pandas .str elapsed: 0.73

Turns out apply() is much faster.

Even if we only do a single operation (replace()‘) instead of two in a row, apply() is still faster:

Pandas apply() elapsed: 0.32

Pandas .str elapsed: 0.43

Lesson #1: Pandas’ “vectorized” string operations are often slower than just using apply().

There is a bug filed in the Pandas issue tracker about this.

Lesson #2: If performance matters, you should measure different methods instead of trusting that one implementation is better than the other.

Speeding things up

While we’ve found a faster variation, it would be good to speed things up even more.

So the next step was to try to get rid of the Python overhead of doing the same thing over and over again.

Two of my attempts didn’t help:

  • Numba with nopython mode was much slower. To be fair, it’s mostly oriented towards numeric operations, but I was still a little surprised it doesn’t help purely by reducing Python function call overhead.
  • NumPy’s “vectorized” string operations (yes, they exist!) were also slower.

That being said, let’s not forget lesson #2: you shouldn’t take my word for it, but test for yourself for your particular application.

And even if these results are true now, they might not be true in the future as the relevant libraries evolve.

What did work was Cython. Cython is a Python-like language that compiles to C and from there to a C extension.

By avoiding some of Python’s slow overhead, we can hopefully speed things up:

cimport cython
cimport numpy as np
import numpy as np

ctypedef str (*string_transform)(str)

cdef np.ndarray transform_memoryview(str[:] arr, string_transform f):
    cdef str[:] result_view
    cdef Py_ssize_t i
    cdef str s
    result = np.ndarray((len(arr),), dtype=object)
    result_view = result
    for i in range(len(arr)):
        s = arr[i]
        result_view[i] = f(s)
    return result

cdef str replace_then_upper(str s):
    return s.replace("l", "").upper()

def transform_replace_then_upper(arr: np.ndarray) -> np.ndarray:
    return transform_memoryview(arr, replace_then_upper)

The basic way this works:

There is reusable cdef function called transform_memoryview, a C function basically, that iterates over a NumPy array and applies another C function.

This uses the fast memoryview API provided by Cython.

There another C function, replace_then_upper, that does the transformation we want.
Finally, a def function that actually gets exposed to Python calls the transform_memoryview function with replace_then_upper.

If you want to create multiple functions exposed to Python, you can reuse transform_memoryview, and the fact the two underlying functions are C functions means the C compiler can inline them (which it seems to), reducing overhead a bit more.

Here’s the performance numbers comparing the Cython approach to the previous two:

Method Elapsed time (lower is better)
Pandas .str. 0.73 seconds
Pandas apply() 0.39 seconds
Cython 0.24 seconds

While Cython can be built ahead of time, you can also compile it on demand, including in Jupyter.

The compilation takes time, of course, a few seconds on my computer.

For a speed up from 0.39 seconds to 0.24 seconds it’s clearly not worth it, but for the original use case of 100+ million strings, the few seconds of Cython compilation on-demand are still an improvement when doing interactive exploration.

And for a batch job that runs multiple times, you only need to compile once.

Lesson #3: Python has lots of overhead.

If your code is running the same operation over and over again, consider using Cython to remove that overhead.

Other approaches

Speeding up your existing code is one approach, but there are others.

Considering our motivating example:

  • Is the core problem latency or cost? If the issue is just latency and you don’t care about cost, parallelism using multiple CPUs can get you faster response times.
  • Can those 100 million rows be filtered down in advance in some more efficient way?
  • Does the data need to be a string? Perhaps the originating system could write out something more easily usable.

So if the direct approach doesn’t work or isn’t sufficient, it can pay to try from a different angle.

Itamar Turner-Trauring, Open-Source Team at G-Research

Related articles

Stay up to-date with G-Research

Subscribe to our newsletter to receive news & updates

You can click here to read our privacy policy. You can unsubscribe at anytime.