Press enter to see results or esc to cancel.

The Object Is Dead. Long Live The Object!

JVM development is not an easy business. A production-grade JVM consists of many complex modules, and sometimes things go terribly wrong even though each module runs like clockwork in isolation. Some bugs are sporadic and hard to reproduce, and that makes debugging even more complicated. That’s why testing is very important for us. We have a lot of automated tests that run every night on all supported machine configurations. And once in a while they would hit a bug in code supposed to be flawless since time immemorial…

Mystery

Recently, after some innocent-looking commits, several tests for ResourceBundle have started to fail sporadically. There is nothing special about those tests, they just verify that gathering resources from resource bundles of various types works. But for some reason, different unexpected errors would appear during certain runs.

Let’s consider the following code snippet similar to the problematic test. Basically, it attempts to obtain a resource with the specified name from a resource bundle:

public static boolean checkResourceFromBundle(String expected) {
    try {
        URLClassLoader ucl =
                new URLClassLoader(new URL[]{new URL(urlString)});

        ResourceBundle bundle =
                ResourceBundle.getBundle(bundleName, locale, ucl);

        String resource = bundle.getString(resourceName);
        return expected.equals(resource);
    } catch (Exception e) {
        System.out.println("FAILED: unexpected exception:");
        e.printStackTrace();
        return false;
    }
}

If urlString, bundleName and locale are correct objects of the respective types, and a bundle with those name and locale can be found at the given URL, this method should never print the error message from the catch block, right?

It turns out that in some very specific cases it would do just that.

Moreover, it can fail not only when compiled with Excelsior JET, but also when run on the HotSpot JVM (OpenJDK 8).

So, how can it be?

The debugging process was quite tricky, as in the end we’ve tracked down the problem to a clash of advanced features of two distinct JVM modules:

Investigation

The problem occurs sporadically, so we better call the checkResourceFromBundle method in a loop:

public static void main(String[] args) {
    for (int i = 0; i < MAX_ITERATIONS; i++) {
        if (!checkResourceFromBundle(expectedResource)) {
            System.out.println("FAILED at iteration number " + i);
            return;
        }
    }
    System.out.println("PASSED");
}

Now, let’s compile the above sample with javac and run it on HotSpot. One important detail: let’s specify a rather small heap with -Xmx. We’ll see later why this helps to reproduce the problem in a more reasonable time frame.

> /opt/jdk-8/bin/javac *.java
> /opt/jdk-8/bin/java -Xmx4096k Test

and the result can be like the following:

FAILED: unexpected exception:
java.util.MissingResourceException: Can't find bundle for base name Bundle, locale en_US
at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1564)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1387)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:1082)
at Krol.checkResourceFromBundle(Test.java:21)
at Krol.main(Test.java:34)
Caused by: java.lang.NullPointerException
at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2640)
at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1501)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1465)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1419)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1361)
... 3 more
FAILED at iteration number 2984

All files and resource bundles have stayed the same throughout test execution, nothing has changed, but after 2983 iterations, an unexpected NullPointerException interrupted the execution.

Reveal

Okay, let’s see: small -Xmx, the problem reproduces sporadically and the most apparent symptom is an unexpected NPE… One could guess that it is somehow connected with the garbage collector (GC), and would be totally right indeed. Let’s look at the place where the NPE gets thrown: ResourceBundle.java:2640

public ResourceBundle newBundle(String baseName, Locale locale,
                                String format, ClassLoader loader,
                                boolean reload) {
    ...
    Class<? extends ResourceBundle> bundleClass
        = (Class<? extends ResourceBundle>) loader.loadClass(bundleName);

    ...
}

A NullPointerException can only be thrown here if loader is null. So let’s follow the stack trace and find its definition. At ResourceBundle.java:1501 there is the following call:

bundle = control.newBundle(cacheKey.getName(), targetLocale, format,
                           cacheKey.getLoader(), reload);

where cacheKey is an instance of the CacheKey class:

private static class CacheKey implements Cloneable {
    ...
    private LoaderReference loaderRef;
    ...
    ClassLoader getLoader() {
        return (loaderRef != null) ? loaderRef.get() : null;
    }
    ...
}


and LoaderReference is actually a weak reference to a ClassLoader instance:

private static class LoaderReference extends WeakReference<ClassLoader>
                                     implements CacheKeyReference {
    ...
}

It means that if loaderRef holds the last reference to a ClassLoader instance, the GC can collect the latter without any problems. In our example, cacheKey was created in the getBundleImpl() method:

private static ResourceBundle getBundleImpl(String baseName,
                                            Locale locale,
                                            ClassLoader loader,
                                            Control control) {
    ...
    CacheKey cacheKey = new CacheKey(baseName, locale, loader);
    ...
}

and that was the last use of the given URLClassLoader! As there are no more uses of that object, the GC has full rights to decide that it is already dead. So, if you are (not so) lucky, it will collect the ClassLoader object, and newBundle()  will throw an NPE moments later. We provoked this by specifying a small -Xmx value and executing the example in a loop. That’s why you should always check whether the result of a WeakReference.get() method call is not null. (Yeah, this was the TL/DR.)

Stealthy Bug

Good, now we understand what is going on here. But there are still some unanswered questions, so let’s dig a little bit deeper.

We have noticed the problem because some tests had started to fail. But why didn’t they fail before that? Looks like the problem was always there…

Seemingly, it was those innocent-looking changes in Excelsior JET that indirectly affected memory consumption patterns in some scenarios. GC started to happen more frequently, or less frequently, or at different moments of time, and that has caused the problem to reveal itself. It was just like the butterfly effect.

Then, we have confirmed that the bug is in the reference implementation of the Java SE library. How has it managed to remain unnoticed by Oracle engineers and the OpenJDK community?

The thing is that the problematic scenario is actually rather artificial for HotSpot. First of all, GC should occur within a very short time span between the creation of a CacheKey instance and an attempt to retrieve its class loader. But, more importantly, the getBundleImpl() method should have already been compiled with C1 or C2.

Unlike those compilers, the HotSpot interpreter is not capable of determining whether some use of a local variable is the last one or not. The GC has no choice but to make a conservative assumption that an object, or, more precisely, an object reference stored in a local variable of an interpreted method is alive. Long live the object! That’s why the problem does not manifest itself if HotSpot is forced to run in interpreter-only mode with -Xint :

> /opt/jdk-8/bin/javac *.java 
> /opt/jdk-8/bin/java -Xmx4096k -Xint Test
PASSED

Many applications gather their resources from bundles only once, during startup. Therefore the number of invocations of the problematic method is likely to not exceed any of the HotSpot compile thresholds1.

Now you understand why this bug hardly ever bites in real-life situations if the application runs on the HotSpot VM. But the Excelsior JET AOT compiler always compiles all methods of all classes supplied as input, so the problem is more noticeable to its users.

Treatment

All right, but nevertheless, the problem is present in the Oracle JDK. Will it be fixed?

We are not sure about JDK 8, but it appeared that the problem is already fixed in JDK 9. The ResourceBundle class was changed to support modules, but the new logic for modules is quite similar to the old one for class loaders:

private static final class CacheKey {
    ...
    private final KeyElementReference<Module> moduleRef;
    private final KeyElementReference<Module> callerRef;
    ...
}

Here, KeyElementReference also extends WeakReference, but cacheKey is now used properly:

Module module = cacheKey.getModule();
if (module == null) {
    // should not happen
    throw new InternalError(
                "Module for cache key: " + cacheKey + " has been GCed.");
}

But why the code marked as “should not happen” would not happen? Looking at the new getBundleImpl() method implementation, you may notice that calls of the Reference.reachabilityFence() method were added before the return statement:

private static ResourceBundle getBundleImpl(String baseName,
                                            Locale locale,
                                            Class<?> caller,
                                            ClassLoader loader,
                                            Control control) {
        ... 

        CacheKey cacheKey = new CacheKey(baseName, locale, module,
                                         callerModule);        
        ... 
        // keep callerModule and module reachable for as long
        // as we are operating with WeakReference(s) to them
        // (in CacheKey)...
        Reference.reachabilityFence(callerModule);
        Reference.reachabilityFence(module);

        return bundle;
}

These calls are needed solely to prolong the life of objects passed as parameters. The GC won’t collect those objects before  getBundleImpl() completes its job, and that solves the problem:

> /opt/jdk-9/bin/javac *.java
> /opt/jdk-9/bin/java -Xmx4096k Test
PASSED

However, there is one tricky detail still left. Let’s look at the definition of reachabilityFence():

@DontInline
public static void reachabilityFence(Object ref) {
    // Does nothing, because this method is annotated with @DontInline
    // HotSpot needs to retain the ref and not GC it before a call to
    // this method
}

A сall to an empty static method is just a sitting duck for any optimizing compiler, which  would definitely try to inline that call so as to make it disappear without a trace2. If the compiler inlined the calls of reachabilityFence(), those Module instances would have received death sentences right after the creation of weak references to them. But pay attention to the annotation: @DontInline is a proprietary HotSpot annotation telling its compilers to avoid inlining of particular methods. That’s why  reachabilityFence() works. HotSpot compilers love to inline empty static methods, but the annotation prevents them from doing that, making the fence work as expected.

Excelsior JET loves to inline empty static methods too, so we have to prevent inlining of reachabilityFence() as well. We could use our own internal mechanisms to achieve that, or teach our compilers to recognize that HotSpot-specific @DontInline  annotation. The choice is not obvious, though. We’ll make the final decision during the course of working on Java 9 support.


That’s all for today! Be careful when using weak references and always check whether their referents are not null, because they can be already dead for a long time.

Your comments are very welcome here, on Reddit, and on Hacker News.

Source Code Links

  1. ResourceBundle.java in OpenJDK 8
  2. Fixed ResourceBundle.java in OpenJDK 9
  3. Example that reproduces the problem (GitHub)

Footnotes

  1. See e.g. What the JIT!? Anatomy of the OpenJDK HotSpot VM by Monica Beckwith for details.
  2. Except maybe for the initialization of the respective class.