How to back up Splunk Indexer Clusters

1 June 2020

Technology

What’s the big problem?

Anyone familiar with Splunk Indexer clusters will have wrestled with this boring but essential challenge – how do you back the data up off Indexers without lots of duplication?

You could use the “cold to frozen” script approach to send the buckets somewhere when it finally ages out, but say you want to back up data that’s still in Splunk? Say, when data is no longer than a day old?

One of the benefits of an indexer cluster is replication. You can ask it to store one or more searchable (or unsearchable) replicas within the site it arrived in and also in remote sites. This has resilience and performance benefits, but obviously creates replicas of data. We wanted to find a way of backing it up while a) removing bucket duplication and b) any other space efficiencies we could without losing any raw data.

Here is an example configuration for replication settings (set in server.conf on your cluster master)

[clustering]
mode = master
multisite=true
available_sites=site1,site2
site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2

The above would suit a 2 site cluster, and requires there to be one searchable replica in each site, with a total of 2 searchable replicas (irrelevant of the number of indexers in each site).

If an indexer reboots or crashes, replication targets require the other indexers in the same site to re-create/re-acquire the now-missing buckets. The buckets that were available on the missing indexer are replicated to the remaining indexers (providing there is disk space) until replication targets are met once again. When you restore the missing indexer to service, you then have surplus replicas of data in that site and you may go and remove the surplus buckets.

Furthermore, for our environment new data and their replicas are being written into a (hot) filesystem on fast storage. We use Splunk’s “volume” configuration, which moves the oldest ‘hot’ buckets within the hot volume to a different (cold) filesystem on slower storage. Data moves from hot to cold as it gets older.

To summarise, the set of buckets on each filesystem of each indexer indexer is pretty volatile, data is moving around on the hosts themselves (the hot->cold cycle) and from host to host (bucket replications in from the other nodes in the cluster).

What we used to do

When we tackled this problem we had 2 sites and 3 indexers in each site. Our replication targets were as above – “one searchable replica per site”, so for every bucket, there’s a searchable replica of it in the other site. This is about the simplest case you can have, and if you configure more replicas of buckets, data volatility only gets more severe.

Our old backup strategy was to rsync all hot and cold filesystems off each indexer (3 (indexers) * 2 (filesystems) * 2 (sites) = 12 filesystems to 12 directories (indexer:/hot, indexer:/cold * 6) on NAS Storage). What with all the data moving around, we invariably end up with duplicates of data being backed up, many times more than the data we actually have stored in the indexers. Everything in the hot/warm filesystems inevitably gets moved to the cold filesystems. We had some thinning scripts looking for duplicates but it’s hard to do this safely.

Thinning measures

We’d also worked out that we could fully recover (think ‘fsck’ or ‘rebuild’) individual buckets from just the journal.gz file alone. This process allowed us to simply capture the journal.gz file for each bucket, and if we needed to, we could get all that raw data searchable again – given enough disk space for the expanded/rebuilt buckets and some time to rebuild all the buckets.

Splunk process for this: Available here

Also, restored buckets can be put in any index (i.e something built-in like “main”, rather than the name of the index it was originally written into), and also onto any indexer. Splunk isn’t too fussy here, if it’s a valid bucket in an index directory, it can search it.

Data in our indexes will inevitably get expired due to age constraints, but we wanted a record of all data that had ever passed through our indexes. We also decided that we would not attempt to backup data models or summary indexes as these can both largely be regenerated from the raw data if required (this assumes you’re actually indexing all your key temporal data, as opposed to using non temporal storage such as databases or lookups).

Bucket names

Splunk Ref: Available here

If you consider each bucket on each indexer to be a segment of time within each index, then the name of each warm/cold bucket has a useful encoding in it:

<db|rb>_<NewestTime>_<OldestTime>_<Seq>_<GUID>

These fields/tokens are:

Whether this is a primary, or a replica bucket
Time of the most recent event (newest/youngest)
Time of the least recent event (oldest)
Sequence number (for the indexer, in the index)
GUID of the indexer that first received the data

Primary buckets – (which have the name db_….) tell us the indexer the data first arrived on. Replica buckets (which have the name rb_….) can appear on any other indexer, but still have the GUID of the primary indexer that created it in the name. You can work out the indexer that received this data from either the primary or any replica you might have.

Primary Hot buckets have a slightly different format:

hot_v<seq>

whereas the replicate bucket name will be:

<seq>_<GUID of primary indexer>

What hot bucket names do not yet have is a start or end time (because they’re still being appended to). Splunk renames hot buckets to the warm/cold format when it rolls them from hot to warm. From the replicated bucket directory name, we know the index and can also determine the primary indexer GUID and sequence number gives us sufficient metadata to uniquely identify each bucket.

Side note – your parsing rules are important. If you have data in a bucket which somehow was interpreted with an incorrect time, this will affect the start or end time of the bucket name. We found lots of buckets called db_0_0_xxx_GUID for data that had been indexed and splunk thought a 0 was a timestamp (Doh!). We also found a number of buckets where the start time / end time was way into the future, for the same basic reason.

Since there is no reliable timestamp on these buckets to know the time window of the data they really contain, you need to include them in any restore process (just in case).

The backup algorithm

We capture the Indexer GUID, the sequence number, start and end times from the bucket directory name.
The only file we need to back up is <bucketname>/rawdata/journal.gz. We can use the rebuild process to restore it.
This file is present (and the same) in both primary and replicas of any given bucket. Therefore, it doesn’t matter if we back it up from the primary or replica holding indexer.
We don’t need to retain whether the bucket we’ve backed up the primary or the replica. We also get 2 chances to capture it in the event of a host/filesystem failure.

A splunk index is an aggregate of the data that was captured on all receiving indexers. We have 6 indexers, so for any given time period, all 6 indexers may have received and written to a bucket – events generated in our environment. If it’s a busy index, there may be many buckets generated by each indexer, with overlapping start and end times. In quieter indexes, buckets may span much longer periods of time and there might only be one from each indexer. We can use the start and end times on each bucket name to select which to rebuild in the event of a restore.

We also no longer need to retain the source hot/cold filesystems or origin hostname in the target directory structure. We have the receiving indexer GUID and start and finish time on every bucket, so we can collate all buckets from every index into one target directory per index for one long contiguous sequence of buckets for all time.

Our backup script examines the journal.gz file from every bucket in every index. We also need to run it on every indexer as they all need to back up into the same destination directories to collate data.

Using the index, indexer GUID, sequence number, start time and end time, the backup script has to compare the local journal.gz file to the journal.gz file where it should be in the backup share.

It’s possible (and in fact, the most common case) we already have the bucket backed up from its replica (or previous backup run). If we have the journal.gz already, we check it’s the same size/modification time, then skip it and move on.

Obviously if the journal.gz in the backup share is smaller, or older, it might have been a failed copy, or from a truncated/faulty bucket, we can overwrite what we see in the destination share.

Our indexers have enough storage on them to hold 200,000 buckets, spanning many indexes. As you might expect, the backup script has to stat the journal.gz file for every bucket locally, and also compare this with what exists in the target NAS share. We can, however shortcut searching the local indexer filesystems with a directory search, by querying some rest endpoints to enumerate all indexes and all buckets within each index and generate the list for us.

Link for more info.

However, this process of comparing every journal.gz file on the local indexer filesystems to the target file on the destination network share is still IO intensive and takes some hours to complete (depending on latency to the destination share).

Rest calls

Use rest to find the path on disk of of the volumes:


    hot_volume = client.get('/services/properties/indexes/volume:hot_volume/path').text
    cold_volume = client.get('/services/properties/indexes/volume:cold_volume/path').text

Use a rest call to list the paths on disk of the buckets in hot/cold/frozen for every index, (excluding some builtin ones we don’t want to backup):


        if not name.startswith('volume:') and name not in ['splunklogger','_thefishbucket','default','history','_introspection','_telemetry']:
            hot = client.get('/services/properties/indexes/{0}/homePath'.format(name)).text.replace('volume:hot_volume', hot_volume)
            cold = client.get('/services/properties/indexes/{0}/coldPath'.format(name)).text.replace('volume:cold_volume', cold_volume)
            frozen = client.get('/services/properties/indexes/{0}/coldToFrozenDir'.format(name)).text

This is so we don’t need to maintain an independent config file for which indexes to backup, the backup script asks the local indexer where to look for buckets for all of its local indexes. This also means that if you remove an index from indexes.conf, the backup script will no longer know about them, the files remain on your indexer filesystems. You might want to go and remove those files to reclaim the disk!

The actual copy algorithm is run multi-threaded, and as we are running it on all indexers at the same time it’s quite possible for two indexers to attempt to write/update the journal.gz file at the same time, so we need to have a try/except around the copy, where we try and take an exclusive lock on the target before copying it.

 
    if should_copy(journal, dst):
        stat = os.stat(journal)
        logging.info('action=backup, target=%s, dest=%s, noop=%s, size_bytes=%s, mtime=%s', journal, dst, noop, stat.st_size, stat.st_mtime)
        if not noop:
            try:
                if not os.path.exists(dst_dir): os.makedirs(dst_dir)
                touch(lock)
                try:
                    shutil.copy(journal, dst)
                    if is_cold:
                        cleanup_warm(hot_dst_dir)
                finally:
                    os.remove(lock)
            except OSError:
                logging.error('action=skip, target=%s, dest=%s, noop=%s, msg="File copy in progress"', journal, dst, noop)
            except Exception as why:
                logging.error('action=copyfail, target=%s, dest=%s, msg=%s', journal, dst, str(why))
    else:
        logging.info('action=skip, target=%s, dest=%s, noop=%s', journal, dst, noop)

We also log what happens as we scan every bucket, whether we copied it or skipped it, or whether the copy failed for some reason.

Backing up hot buckets

We initially didn’t attempt this, wanting to ensure the backup of non-changing data was reliable. However, it turns out this is largely feasible too, and in the case where you have a reasonably quiet index (where buckets roll relatively infrequently), we discovered it’s possible to grab the journal file whilst still being appended to. This let us reduce the time where we do not have an out-of-Splunk backup of any data to just a day.

The approach is largely the same, but we have to be a little inventive with bucket names in the destination share. Hot buckets (and hot bucket replicas) do not have a start/end time value in the bucket name, instead they only have the sequence number and the GUID of the indexer where the bucket is being written, and then is only easily interpreted from the replica!

The bucket written to the destination share has the following name format

<seq>_XXXXX_YYYYY_<Primary indexer GUID>

The journal file is copied into it. When this bucket is finally renamed when it rolls to warm, the backup script renames the destination backup directory too.

We tested this a few times and we’ve found rebuilding “hot buckets” to be *reasonably* reliable. (We haven’t had one fail, but we weren’t testing exhaustively).

After all, this is an indexer cluster, you have multiple copies of this data in Splunk already, something incredibly catastrophic must have occurred to lose both replicas of this data in each site.

Additional benefits

This has also helped us when clients come to us demanding 5/10 year retention of their data, we are able to state we can keep their data indefinitely (should we need to restore very old data), and we can keep a more realistic amount (6m/1y/2y) in splunk, saving on disk and bucket count.

Gavin Davenport – Big Data Platform Engineering