Diagnosing Ranger KMS Hangs

30 April 2019

Technology

Introduction

One of primary aims of the Big Data Platform Engineering team at G-Research is to provide our customers with reliable and secure Big Data platforms.

To help us achieve this we use HDFS transparent encryption, which allows us to encrypt files that are stored on HDFS.

Our implementation includes Ranger KMS, an open source cryptographic key management service.

Recently, we had a number of incidents where Ranger KMS would hang, causing jobs running on our Hadoop cluster to fail.

This blog post describes the tools and techniques we used to debug this issue, and explains how you can prevent it from happening in your own environment.

Background

Transparent encryption introduces the concept of an encryption zone. An encryption zone is a special directory. Its contents are transparently encrypted upon write and transparently decrypted on read. Each encryption zone is associated with a single encryption zone key (EZK) and each file within the zone has its own unique data encryption key (DEK). This plain text DEK is encrypted into an encrypted data encryption key (EDEK) and persisted as an extended attribute in the NameNode‘s metadata for the file.

Access to the EZK provides access to all the data in the encryption zone and since it may not be acceptable for Hadoop administrators to have access to the data, EZKs cannot be stored in HDFS. Instead a secure key server is used to generate and store encryption keys (EZKs and DEKs) and manage access to them. We use Ranger KMS for our key server. Ranger KMS is a Java application with a relational database schema (in our case, PostgreSQL).

In summary the key server is involved whenever an HDFS client needs to create a new file in an encryption zone or read an existing one from it. On file creation, the key server generates an EDEK, and upon read it decrypts it.

The problem

One afternoon we became aware that jobs were failing in our cluster, and attempts to submit new ones were hanging and timing out. Our production Hadoop platform was effectively unavailable.

It was clear from the errors in the YARN container logs that Ranger KMS was not responding to EDEK generation and decryption requests.


java.net.SocketTimeoutException: Read timed out
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:171)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
      at sun.security.ssl.InputRecord.read(InputRecord.java:503)
      at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
      at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
      ...
      at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:578)
      at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:566)
      at org.apache.hadoop.crypto.key.kms.KMSClientProvider.decryptEncryptedKey(KMSClientProvider.java:834)
      at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:210)
      at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:206)
      at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.doOp(LoadBalancingKMSClientProvider.java:95)
      at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:206)
      at org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388)
      at org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1404)

We reviewed the KMS server logs and found nothing that indicated the cause of the problem. There was a steady stream of INFO messages up until the time of the hang, and then nothing more was logged. It seemed as though the process was blocked. We used the top command to check CPU utilisation for the Ranger KMS process, and confirmed it was idle:

top -H -p <pid>

Our troubleshooting efforts continued in vain, and after 13 minutes the KMS suddenly sprang back into action. As per the access logs:


[18/Oct/2018:16:28:35 +0000] "OPTIONS /kms/v1/keyversion/<custom>%400/_eek?eek_op=decrypt HTTP/1.1" 200 638
[18/Oct/2018:16:41:58 +0000] "POST /kms/v1/keyversion/dir_user_key%400/_eek?eek_op=decrypt HTTP/1.1" 200 94

We did not know what had caused the hang. We knew that a colleague had been attempting to generate a new encryption key at the time of the incident. Could this have been the trigger?

We needed to be better prepared if this happened again, so decided to turn on DEBUG logging for Ranger KMS and its UI.

For clusters that are managed by Apache Ambari, Ranger KMS debug is enabled using the following Advanced kms-log4j settings:


log4j.logger=DEBUG,kms
log4j.rootLogger=DEBUG,kms
log4j.logger.org.apache.hadoop=DEBUG

For the UI, in Advanced admin-log4j:

log4j.category.xa=debug,xa_log_appender

After restarting Ranger and Ranger KMS for the changes to take effect, we tried listing and generating keys. To our relief, it completed without incident.

It was reassuring to know that listing and generating keys didn’t seem to be the cause of the issue, but frustratingly we were no closer no knowing what was.

The Plot Thickens

A few weeks later, a colleague was creating a new encryption key in our non-production environment and the problem struck once again. This time we were prepared.

The good news was that debug messages were being written to the logs whilst the hang was ongoing. Unfortunately, they didn’t seem to be telling us anything useful!

Next we used the jstack command to capture some stack traces of the Ranger KMS process:


for i in {0..19} ; do sudo -u kms jstack -l <KMS_PID> > /tmp/jstack-kms-$(date +%Y-%m-%d.%H:%M:%S).jstack ; sleep 5 ; done

After reviewing the thread dumps, a very clear pattern began to emerge.

Firstly, each dump included a number of threads in WAITING state. They were waiting on a lock, and their stack trace included the KeyAuthorizationKeyProvider.encryptDecryptedKey method.


"http-bio-9393-exec-47597" #115997 daemon prio=5 os_prio=0 tid=0x00007fa428076800 nid=0x4335 waiting on condition [0x00007fa461be4000]
    java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)
      - parking to wait for  <0x00000005c18ff9b0> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
      ...
      at org.apache.hadoop.crypto.key.kms.server.KMS.decryptEncryptedKey(KMS.java:451)
      ...
      at java.lang.Thread.run(Thread.java:745)

A review of the Ranger KMS source code confirmed that KeyAuthorizationKeyProvider.encryptDecryptedKey is responsible for decrypting EDKs and that it requires a read lock:


@Override
public KeyVersion decryptEncryptedKey(EncryptedKeyVersion encryptedKeyVersion)
         throws IOException, GeneralSecurityException {
             readLock.lock();
             try {
                                 verifyKeyVersionBelongsToKey(encryptedKeyVersion);
                                 doAccessCheck(encryptedKeyVersion.getEncryptionKeyName(), KeyOpType.DECRYPT_EEK);
                                 return provider.decryptEncryptedKey(encryptedKeyVersion);
          } finally {
                  readLock.unlock();
          }
}

The same threads appeared again and again in each thread dump, indicating that they were continually waiting to acquire the lock. No wonder those DECRYPT_EEK operations were timing out.

Our next observation was that the thread attempting to create the new encryption key (via the KeyAuthorizationKeyProvider.createKey method) was RUNNABLE (e.g. executing) and attempting to read from its PostgreSQL socket:


"http-bio-9393-exec-47512" #115751 daemon prio=5 os_prio=0 tid=0x00007fa35400a000 nid=0x68c6 runnable [0x00007fa4609d5000]
    java.lang.Thread.State: RUNNABLE
      at java.net.SocketInputStream.socketRead0(Native Method)
      ...
        at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
      ...
        at org.apache.hadoop.crypto.key.kms.server.KeyAuthorizationKeyProvider.createKey(KeyAuthorizationKeyProvider.java:171)
      ...
        at java.lang.Thread.run(Thread.java:745)

A review of the source code confirmed that the KeyAuthorizationKeyProvider.createKey method acquires a lock before it creates the key.


@Override
public KeyVersion createKey(String name, byte[] material, Options options)
        throws IOException {
                writeLock.lock();
                try {
                                authorizeCreateKey(name, options, getUser());
                                return provider.createKey(name, material, options);
          } finally {
              writeLock.unlock();
          }
}

To summarise: the thread that was trying to create the key was holding a lock and waiting for a database query to complete. All the decrypt threads needed this lock, and would remain waiting until the database operation completed. Therefore, we needed to understand why the database operation was taking so long…

Fortunately, PostgreSQL includes a PG_STAT_ACTIVITY statistics view that includes information about the current activity of each database process. Looking at this view, we were surprised to learn there were no clients currently connected to the Ranger KMS database. In other words, the KMS database connections were dead.

We immediately suspected that networking infrastructure between KMS and PostgreSQL would be the cause.

Running traceroute on the Ranger KMS host confirmed that PostgreSQL was behind a firewall, and a review of the firewall logs showed the following errors:

TCP packet out of state: First packet isn’t SYN tcp_flags=RST-ACK

The source and destination IP addresses were Ranger KMS and PostgreSQL respectively, and the error message indicated that the firewall could not associate the packet with an existing connection in its state table. This typically occurs when the idle timeout on the firewall causes the session to be silently removed from the firewall’s state, and a quick chat with the network team confirmed it. No wonder we couldn’t reproduce the issue initially: because KMS had been restarted, the firewall hadn’t removed the sessions yet.

Now that we understood the root cause, it was simple to fix: enable TCP keepalives on Ranger KMS and PostgreSQL.

For KMS, we added tcpKeepalives=true to the JDBC connection string:

jdbc:postgresql://hostname:port/dbname?tcpKeepAlive=true

For PostgreSQL we specified the following configuration parameters:

tcp_keepalives_idle
tcp_keepalives_count
tcp_keepalives_interval

Bear in mind that the default Linux TCP keepalive configuration is to wait two hours before sending the first keepalive probe, and then send a probe every 75 seconds up to nine consecutive times. This means it would take over two hours for the connection to be considered broken. Therefore, you may want to configure the Ranger KMS host to send keepalive probes sooner and more frequently. In addition the default setting for tcp_retries2 (which determines the maximum number of times a TCP packet is retransmitted before giving up on the connection) is 15 minutes, and you might want to revise this at the same time.

Conclusion

Thread dumps are a useful tool and should be one of the first things you try when diagnosing unresponsive Java applications.

Austin Hackett – Big Data Platform Engineer