Press enter to see results or esc to cancel.

Portable Profilers And Where to Find Them

Disclaimer

This post explains how to gather information about the run-time behavior of a Java application that can then be used to analyze its performance issues and bottlenecks. This goal can be achieved both from within the JVM and from the outside, but this post mostly considers this topic from the JVM’s point of view. Gathering such information is a bit like spying on the application to identify its most popular places and habits. However, please note that this is just a figurative comparison. This post is not about hacking third-party applications or bypassing their security mechanisms. It is about gathering statistics that can be used for optimizations.

Goals And Requirements

Many JVMs implement so called just-in-time (JIT) compilation that takes place at application’s run time. In particular, that provides an opportunity to gather execution profile and find the hottest methods, which are then recompiled with higher optimization level to achieve best performance.

In contrast, Excelsior JET uses ahead-of-time (AOT) compilation as the main compilation mode. This means that classes are compiled down to native code before execution if possible. There is also no such thing as recompilation of a class or a method at run time. However, profile guided optimizations (PGO) are still possible with AOT. The main difference is that the profile data has to be collected by the application developer by running the application in profiling mode and then fed to the compiler.

In order to introduce PGO in Excelsior JET 12, we had to enable application execution profile gathering first. We had formulated several limitations and requirements for that feature:

  1. A profile should contain a list of hot methods along with their call chains;
  2. Profiling should be available on all supported targeted platforms. That means it had to work on each supported O/S: Windows, macOS, and Linux; and also on each supported CPU: x86, AMD64, and ARM;
  3. It is unacceptable to introduce a separate build mode only for profiling. It should be possible to enable the profiler just by specifying additional VM options before launching an application;
  4. The performance of an application with profiler enabled should be acceptable for running it in a production environment.

So, how to implement such a thing in a JVM?

Instrumentation

The first thing that comes to mind is code instrumentation. Many profilers modify either the bytecode of the application or its binary executable, augmenting each method and loop body with code increasing the respective counter each time that method is executed. As a result, all method calls and loop iterations are counted precisely at run time.

This is the way the HotSpot JVM obtains application profile data. Both the template code that is executed by the interpreter and C1-compiled code contain special snippets that gather statistics about every method. Once a method is recognized as hot, it is scheduled for (re)compilation in a more advanced mode for better performance. After several iterations, the highest optimization level is reached and the “final” native code version of the method, without any profiling parts, is generated (usually with the C2 compiler).

For example, let’s suppose the following small method is invoked frequently enough to be compiled with C1:

public static int testMethod(int counts, int val) {
    int res = 0;
    for (int i = 0; i < counts; i++) {
        res += val;
    }
    return res;
}

The following output will be generated with the -XX:+PrintAssembly diagnostic option enabled:

# {method} {0x0000000016930308} 'testMethod' '(II)I' in 'Test'
# parm0:    rdx       = int
# parm1:    r8        = int
#           [sp+0x40]  (sp of caller)
0x0000000002683b80: mov    DWORD PTR [rsp-0x6000],eax
0x0000000002683b87: push   rbp
0x0000000002683b88: sub    rsp,0x30
0x0000000002683b8c: movabs rax,0x16930560     ;   {metadata(method data for {method} {0x0000000016930308} 'testMethod' '(II)I' in 'Test')}
0x0000000002683b96: mov    esi,DWORD PTR [rax+0xdc]
0x0000000002683b9c: add    esi,0x8
0x0000000002683b9f: mov    DWORD PTR [rax+0xdc],esi
0x0000000002683ba5: movabs rax,0x16930300     ;   {metadata({method} {0x0000000016930308} 'testMethod' '(II)I' in 'Test')}
0x0000000002683baf: and    esi,0x1ff8
0x0000000002683bb5: cmp    esi,0x0
...

The highlighted commands increment the invocation counter for testMethod. It will be then used to decide whether to recompile that method for better performance.

An instrumentation-based profiler can be implemented from outside the JVM as well. For that you would instrument the bytecode of the application, e.g. with the help of Javassist or ASM libraries. There are also some frameworks which let you enable profiling of your application by slightly modifying your source code: adding an annotation to the target method or an aspect to execute the profiling code around the target method.

However, instrumentation is not an option for an AOT-centric JVM such as Excelsior JET:

  1. We do no recompilation at run time, so we would have had to introduce a separate build step for building an instrumented native version of the application, violating the requirement (3) above;
  2. Furthermore, the performance of the instrumented version would have been poor, as if HotSpot was limited to use only the C1 compiler always inserting profiling code in all methods that would never be optimized with C2. There is no way users would agree to deploy such an application in production;

In other words, an instrumenting profiler in Excelsior JET wouldn’t act like a spy, but rather like Big Brother that watches your every step and annoys everyone.

Sampling Profilers

An alternative approach to instrumentation is sampling. A sampling profiler inspects the application and records information about its execution at regular intervals of time. Unlike the profiles gathered with the help of instrumentation, such a profile is not exact, but a statistical approximation. In general, such approach can give you an acceptable profile, but the contribution of small, lightweight methods with execution times shorter than the sampling interval can be significantly underscored. This should be taken into account when performing optimizations based on such a profile.The good thing is that sampling profilers usually have less impact on the application performance and need less space to store the profile data.

Sample Profiling on Safepoints

Safepoints are places in the code, where:

  1. The executing thread can be parked on JVM’s request,
  2. Accurate call stack information is available.

The JVM suspends threads at safepoints for internal needs, e.g. for garbage collection or to retrieve their stack traces. It is guaranteed that any thread that is executing Java code can reach the nearest safepoint and stop there in a relatively short time1.

Many sampling profilers (Visual VM, YourKit, JProfiler, NetBeans profiler, etc.) inspect threads only at safepoints. They demand all threads to stop at their nearest safepoint and then iterate over them, gathering their stack traces which are available at safepoints.

This functionality is even easier to implement than bytecode instrumentation. You just need to create a native thread, attach it to the JVM and periodically call the ThreadMXBean.dumpAllThreads(…) function. This function works exactly as described above: it stops threads on safepoints and retrieves their stack traces. All you have to do is record that information somewhere to form a profile2.

As you can see, safepoint-based profilers are easy to implement and they can work everywhere: on every OS and CPU, and also on every JVM3. Unfortunately, these are their only advantages.

The bad news is that every production-grade JVM goes to great lengths to place as few safepoints as possible. You see, it is great to have full info about thread execution at every safepoint, but that does not come free: safepoints cause performance degradation and increase the amount of data that has to be stored. So JIT and AOT compilers alike rely on code analysis and heuristics to identify places where the safepoints are not actually necessary. For example, safepoints are usually placed on backward branches of loops and in method epilogues. However, one common optimization is to remove safepoints from so called counted loops. (Those are loops that are guaranteed to end after a finite number of iterations.) The optimization improves the performance greatly, but can sometimes cause troubles. Of course, JVMs usually have an option to disable this optimization to avoid hanging on finite but long loops.

It turns out that a lot of code can execute between two sequential safepoints in the control flow of a thread, and all that code would be totally invisible to the profiler. For instance, in our previous sample method there is only one safepoint:

public static int testMethod(int counts, int val) {
    int res = 0;
    for (int i = 0; i < counts; i++) {
        res += val;
        // The loop is recognized as counted => no safepoints here
    }
    return res;
    // The only safepoint is in the epilogue
}

A safepoint-based profiler would have to wait until the loop will have executed and only then inspect the thread, hence gathering no information about the code of the loop body!

Moreover, suspending all threads on safepoints is itself quite an expensive operation. Thereby, you just can’t inspect threads too frequently in this manner.

All in all, instead of observing threads in their natural habitat, safepoint-based profilers notice them only when they are trapped in cages. Such observation, of course, can give some information, but does not give the whole picture.

You can read more about safepoint-based profilers and their problems in the post “Why (Most) Sampling Java Profilers Are F*cking Terrible” by Nitsan Wakart.

AsyncGetCallTrace Profilers

So instrumentation is expensive and a profile gathered at safepoints is not very informative. It looks like we need to dive under the JVM’s hood to improve.

Let’s start with an OpenJDK internal API called AsyncGetCallTrace. Please note that it is not an official JVM API, what means that it can be changed unexpectedly and is most probably unavailable in JVMs that are not derived from OpenJDK. AsyncGetCallTrace (AGCT) was originally designed for Oracle Solaris Studio, but there are already several third-party profilers based on this API. The killer feature of these profilers is the ability to collect call chains between safepoints.

How can they achieve that?

AGCT-based profilers perform sampling via POSIX signals: they send the SIGPROF signal to the application at regular intervals of time. That signal interrupts one of the currently executing threads, which calls a special handler for that signal. In effect, the more often a thread occupies the CPU, the more often it gets interrupted with SIGPROF.

The signal handler has access to the native execution context of the interrupted thread (registers and PC). It can therefore call the AsyncGetCallTrace method of the internal OpenJDK API, which attempts to compute the current call chain of a thread given its native context and stack contents.

The key point is that the interrupted thread doesn’t have to be at a safepoint. It can be anywhere, including non-Java code, but the AsyncGetCallTrace method can still retrieve its stack trace (at least it tries).

This is pretty cool actually. Because:

  1. AGCT-profilers can inspect code between safepoints.
  2. We could have stopped here, but sampling via signals has some extra benefits:
    1. Only threads that really occupy the CPU can catch the signals, which is why this technique is sometimes called “honest” profiling. Such honesty is a good thing, as these threads are the most interesting targets when searching for hot methods.
    2. Each time, the profiler only inspects one thread (per CPU core). It means that the performance degradation caused by the profiler doesn’t depend on the total number of application threads, which can by far exceed the number of CPU cores.

However, there are also some flies in the ointment:

  1. The AsyncGetCallTrace function does its best to obtain the stack trace, but it is just not always possible. The thread is not at a safepoint after all, so, there are many situations where AsyncGetCallTrace can say nothing about the thread and returns “unknown” as a result;
  2. AGCT profilers use POSIX signals and therefore cannot be ported to Windows;
  3. AGCT profilers have a blind spot: they do not see sleeping and hanging threads. It is okay when you are looking only for hot spots in the code, but sometimes you want to gather information about all threads.

Nevertheless, using AGCT is certainly a step up from suspending threads at safepoints.

You can find more information about profiling with AGCT in these great posts:

Prying Profiler in Excelsior JET

In Excelsior JET we had to avoid sampling via POSIX signals, as we needed a common solution for all supported platforms including Windows. At the same time, we have full control over all threads in the JVM and can forcibly suspend any thread at any moment by OS-specific means.

So we decided to sacrifice the “honesty” of the profiler for the sake of its universality.

We called our solution “Prying Profiler” because it collects context information from threads when they do not expect that, for example between safepoints. Here is how it works:

  1. With the help of special VM options the user specifies a profiling mode and sampling interval,
  2. The JVM spawns a new high priority internal thread for the profiler,
  3. The profiler inspects Java threads with the specified frequency. For each thread, it performs the following steps:
    1. Decides whether it is necessary to inspect the thread,
    2. If yes, forcibly suspends the thread (right where it is, not necessarily at a safepoint),
    3. Grabs the context of the thread and restores its stack trace with the help of black magic (our version of the AsyncGetCallTrace function actually),
    4. Resumes the thread.

The main difference from a signal-based profiler is that we iterate over all threads at every inspection4. This means that the more live threads the application has, the worse effect profiling has on its performance. From this point of view, a prying profiler is more like safepoint-based than signal-based profilers.

At the same time, it inspects threads between safepoints just like AGCT. Moreover, when iterating over the thread list, it can filter out threads that do not need to be profiled: e.g., non-Java, dying, and newly born threads.

One of the challenges here is to restore the stack trace of a thread that is stopped at some arbitrary place and was not filtered out during iteration over the thread list. Fortunately, most Java methods that are compiled with Excelsior JET have the same stack frame format. The frame is formed in the prologue of the method and then destroyed in its epilogue. If we are lucky, the suspended thread is somewhere between the prologue and epilogue of a certain Java method. If that happens to be the case, it is fairly easy to retrieve the caller of the current method and its stack frame. A slightly worse situation is when we find ourselves inside a prologue or epilogue, but with the help of some tricks we still can retrieve the caller. However, there are also methods the stack frames of which is optimized by the compiler as much as possible5. For a thread suspended inside such a method, the only way to find the caller is through heuristics. Luckily, such methods are not very common.

Of course, we worried about the impact profiling would have on the performance of applications with many threads. However, it turned out that it is almost always possible to specify a sampling interval that yields a precise enough profile with acceptable performance degradation.

The default frequency is one inspection per 1ms, and this usually causes about 10% degradation. Some benchmark results are published here.

So the Prying Profiler is not perfect, but it satisfies our requirements:

  • It produces execution profiles with good precision. We first verified it with Intel VTune, and the PGO results later confirmed that the collected profiles are precise enough;
  • It is easy to turn on, just enable a couple of VM options and you are done. No special build modes or anything;
  • It works everywhere, including Windows;
  • It works rather fast. Usually.

But can we do even better? The short answer is: yes, but you would have to give up portability:

Hardware Profiling

Many modern CPUs facilitate gathering an execution profile at the hardware level. Unfortunately, that technique is very CPU- and O/S-specific, no silver bullets here. We have prototyped a profiler based on Linux perf events. It is amazingly fast: about 1-2% performance degradation, no matter how many threads there are in the application.

However, it only works on Linux systems with kernel version >= 2.6.31, and only on CPUs where hardware counters are supported. Hardware profiling on other systems is possible, but usually only in ring 0, thereby the profiler has to be a system driver. A requirement to install a driver on each end-user machine is not something we can afford with our profiler.

The good news is that we already have our prying profiler that works everywhere. This means we can use it as a backup path, and enable the more effective profilers where it is possible and needed.

This post is already too long, so we will tell you about this profiler and also about hardware profiling in general in a separate post. The very last thing here: we called the hw profiler prototype “Clairvoyant Profiler”, because from the JVM’s point of view it looks just like some magic trick, and as any trick it works only in very specific conditions.

Try it yourself!

The profiler we’ve built into our JVM was originally designed only to gather input data for profile-guided optimizations. And it does the job, you can already try both profiling and PGO with Excelsior JET 12.0.

However, it turned out to also be a very useful tool for debugging and performance audit. At the moment, only the developers of Excelsior JET can utilize the profiler for such purposes, but we plan to make it public via JVMTI, so stay tuned.

That’s all for this post! Hope it was interesting for you. Don’t miss the upcoming post about hardware profiling on different systems.

You can post comments below, or join the discussions on Reddit and/or Hacker News.

Footnotes

  1. However, the JVM specification contains neither a definition of that “short time” nor an algorithm of safepoint placement in the code.
  2. Some JVMs allow disabling such functionality to protect your application against reverse engineering. Read more about such protection in Excelsior JET here.
  3. Every JVM that has safepoints, that is.
  4. We are not the only ones who tried to make an effective profiler that iterates over JVM’s thread list. Here is the description of a JVM profiler from IBM, which also iterates over the thread list and suspends threads for sampling, but gathers the call chains in a different way.
  5. The compiler proves that stack iteration just can’t start from such methods: no exceptions are thrown there and there are no safepoints => no call stack scanning for stack trace building or GC can be there. What the compiler doesn’t know (and doesn’t care about it actually) is that there is a spy that wants to retrieve a stack trace from this method and take it into account in the profile.