Java
Article
By Nicolai Parlog

Inside Java 9 – Performance, Compiler, and More

By Nicolai Parlog
Help us help you! You'll get a... FREE 6-Month Subscription to SitePoint Premium Plus you'll go in the draw to WIN a new Macbook SitePoint 2017 Survey Yes, let's Do this It only takes 5 min

Java 9 has a lot to offer besides modularity: new language features and a lot of new or improved APIs, GNU-style command options, multi-release JARs, improved logging, and more. Let’s explore this “more” and look at performance improvements, many thanks to string trickery, the compiler, garbage collection, and JavaDoc.

Performance Improvements

Java becomes more performant from release to release and 9 is no exception. There are a couple of interesting changes targeted at reducing CPU cycles or saving memory.

Compact Strings

When you look at a Java application’s heap and rip out all the object headers and pointers that we use to organize state, only raw data remains. What does it consist of? Primitives of course – many, many, many of which are chars, lumped together in char arrays that back String instances. As it turns out these arrays occupy somewhere between 20 % and 30 % of an average application’s live data (including headers and pointers). Any improvement in this area would be a big win for a huge portion of Java programs! And indeed, there is room for improvement.

A char takes up two bytes because it represents a full UTF-16 code unit but as it turns out the overwhelming majority of strings only require ISO-8859-1, which is a single byte. This is huge! With a new representation that only uses a single byte when possible, memory footprint caused by strings could be cut almost in half. This would reduce memory consumption for average applications by 10 % to 15 % and also reduce runtime by spending less time collecting garbage.

Of course that’s only true if it came without overhead. Free lunch anyone? JEP 254 gave it a try…

Implementation

In Java 8 String has a field char[] value – that’s the array we just discussed, which holds the string’s characters. The idea is to use a byte array instead and spend either one or two bytes per character, depending on the required encoding.

This may sound like a case for variable-sized records like UTF-8, where the distinction between one and two bytes is made per character.But then there’d be no way to predict for a single character which array slots it will occupy, thus requiring random access (e.g. charAt(int)) to perform a linear scan. Degrading random access performance from constant to linear time was an unacceptable regression.

Instead, either each character can be encoded with a single byte, in which case this is the chosen representation, or if at least one of them requires two, two bytes will be used for all of them. A new field coder will denote how the bytes encode characters and many methods in String evaluate it to pick the correct code path.

When a new string is constructed in Java 8, the char array is usually created afresh and then populated from the constructor parameters. For example when new String(myChars) is called, Arrays.copyOf is used to assign a copy of myChars to value. This is done to prevent sharing the array with user code and there are only a select few cases where the array is not copied, for example when a string is created from another. So since the value array is never shared with code outside of String the refactoring to a byte array is safe (yay for encapsulation). And because constructor arguments are copied anyways, transforming it adds no prohibitive overhead.

Here’s how that looks:

// this is a simplified version of a String constructor,
// where `char[] value` is the argument
if (COMPACT_STRINGS) {
    byte[] val = StringUTF16.compress(value);
    if (val != null) {
        this.value = val;
        this.coder = LATIN1;
        return;
    }
}
this.coder = UTF16;
this.value = StringUTF16.toBytes(value);

There are a couple of things to note here:

  • The boolean flag COMPACT_STRINGS, which is the implementation of the command line flag XX:-CompactStrings and with which the entire feature can be disabled.
  • The utility class StringUTF16 is first used to try and compress the value array to single bytes and, should that fail and return null, convert it to double bytes instead.
  • The coder field is assigned the respective constant that marks which case applies.

If you find this topic so interesting that you’re still awake at this point, I highly recommend watching Aleksey Shipilev’s instructive and entertaining talk on compact strings and indyfied string concatenation with the great subtitle:

Why those [expletive] [expletive] [expletive] cannot do the feature in a month, but spend a year instead?!

Performance

Before we really look at performance there’s a nifty little detail to observe. The JVM 8-byte-aligns objects in memory, which means that when an object takes up less than a multitude of 8 bytes the rest is wasted. In the JVM’s most common configuration, a 64-bit VM with compressed references, a String requires 20 bytes (12 for the object header, 4 for the value array, and a final 4 for the cached hash) – which leaves 4 more bytes to squeeze in the coder field without adding to the footprint. Nice.

Compact strings are largely a memory optimization so it would make sense to observe the garbage collector. Trying to make sense of G1’s logs was beyond the scope of this post, though, so I focused on runtime performance. This makes sense because if strings require less memory, creating them should also be faster.

To gauge runtime performance I ran this code:

long launchTime = System.currentTimeMillis();
List<String> strings = IntStream.rangeClosed(1, 10_000_000)
        .mapToObj(Integer::toString)
        .collect(toList());
long runTime = System.currentTimeMillis() - launchTime;
System.out.println("Generated " + strings.size() + " strings in " + runTime + " ms.");

launchTime = System.currentTimeMillis();
String appended = strings.stream()
        .limit(100_000)
        .reduce("", (left, right) -> left + right);
runTime = System.currentTimeMillis() - launchTime;
System.out.println("Created string of length " + appended.length() + " in " + runTime + " ms.");

First it creates a list of ten million strings, then it concatenates the first 100’000 of them in a spectacularly naive way. And indeed running the code either with compact strings (which is the default on Java 9) or without (with -XX:-CompactStrings) I observed a considerable difference:

# with compact strings
Generated 10000000 strings in 1044 ms.
Created string of length 488895 in 3244 ms.
# without compact strings
Generated 10000000 strings in 1075 ms.
Created string of length 488895 in 7005 ms.

Now, whenever somebody talks about microbenchmarks like this, you should immediately mistrust them if they don’t use JMH. But in this case I didn’t want to go through the potential trouble of running JMH with Java 9, so I took the easy way out. This means that the results could be total rubbish because some optimization or other screwed me over. Hence take the results with a truck load of salt and see them as a first indication as opposed to a proof for improved performance.

But you don’t have to trust me. In the talk linked above Aleksey shows his measurements, starting at 36:30, citing 1.36x better throughput and 45 % less garbage.

Indified String Concatenation

Quick repetition of how string concatenation works… Say you write the following:

String s = greeting + ", "  + place + "!"

Then the compiler will create bytecode that uses the StringBuilder to create s by first appending the individual parts and then calling toString to get the result. At runtime, the JIT compiler may recognize these append-chains and if it does, it can boost performance considerably. It will generate code that checks the arguments’ lengths, creates an array of the correct size, copies the characters straight into that array, and, et voila, wraps it into a String.

It doesn’t get better than that but recognizing these append-chains and proving they can be replaced with the optimized code is not trivial and breaks down quickly. Apparently all you need is a long or a double in that concatenation and the JIT will not be able to optimize.

But why so much effort? Why not just have a method String.concat(String... args) that the bytecode calls? Because creating a varargs array on a performance critical path is not the best idea. Also, primitives don’t really go well with that unless you toString all of them beforehand, which in turn prevents stringifying them straight into the target array. And don’t even think about String.concat(Object... args), which would box every primitive.

So another solution is needed to get better performance. The next best thing is to let javac emit better bytecode but that has drawbacks as well:

  • Every time a new optimization is implemented, the byte code changes again.
  • For users to profit from these optimizations, they have to recompile their code – something Java generally avoids if feasible.
  • Since all JVMs should be able to JIT compile all variants, the testing matrix explodes.

So what else can be done? Maybe, an abstraction is missing here? Can’t the bytecode just declare the intent of “concat these things” and let the JVM handle the rest?

Yes, this is pretty much the solution employed by JEP 280 – at least for the former part. Thanks to the magic of invokedynamic, the bytecode can express the intent and arguments (without boxing) but the JVM does not have to provide that functionality and can instead route back into the JDK for an implementation. This is great because within the JDK all kinds of private APIs can be used for various tricks (javac can only use public APIs).

Let me once again refer you to Aleksey’s talk – the second half, starting at 37:58, covers this part. It also contains some numbers, which show a speed-up of to 2.6x and up to 70 % less garbage – and this is without compact strings!

Another Mixed Bag

There’s another string-related improvement but this one I didn’t quite get. As I understand it different JVM processes can share loaded classes via class-data sharing (CDS) archives. In these archives strings in the class data (more precisely, the constant pool) are represented as UTF-8 strings and turned into String instances on demand. The memory footprint can be reduced by not always creating new instances but sharing them across different JVMs. For the garbage collector to cooperate with this mechanism it needs to provide a feature called pinned regions, which only G1 has. This understanding seems to be clashing with the JEP’s title Store Interned Strings in CDS Archives, so if this interests you, you should take a look for yourself. (JEP 250)

A basic building block of Java concurrency are monitors – each object has one and each monitor can be owned by at most one thread at a time. For a thread to gain ownership of a monitor it must call a synchronized method declared by that object or enter a synchronized block that synchronizes on the object. If several threads try to do that at, all but one are placed in a wait set and the monitor is said to be contended, which creates a performance bottleneck. For one, the application itself wastes time waiting but on top of that the JVM has to do some work orchestrating the lock contention and choosing a new thread once the monitor becomes available again. This orchestration by the JVM is refined, which should improve performance in highly contested code. (JEP 143)

In Java 2D all anti-aliasing (except for fonts) is performed by a so-called rasterizer. This is an internal subsystem with no API available to Java developers. But it lies on the hot path and its performance is crucial for many graphics intensive applications. OpenJDK uses Pisces, Oracle JDK uses Ductus, where the former shows much poorer performance than the latter. Pisces is now to be replaced with the Marlin graphics renderer, which promises superior performance at the same quality and accuracy. It is likely that Marlin will match Dustus in terms of quality, accuracy, and single thread performance and even surpass it in multi threaded scenarios. (JEP 265, some history and context)

Anecdotal evidence suggests that running an application with an active security manager degrades performance by 10 % to 15 %. An effort was undertaken to reduce this gap with various small optimizations. (JEP 232)

The SPARC and Intel CPUs recently introduced instructions that are well-suited for cryptographic operations. These were used to improve performance of GHASH and RSA computation. (JEP 246)

car-repair-inside-java-9-performance-compiler

Garbage Collection

One of Java 9’s most contested changes, second only to Project Jigsaw, is that Garbage First (G1) will become the new default garbage collector (JEP 248). Lucky for me it’s production ready since Java 8, though, so I don’t have to really discuss it now.

Quick summary: G1 limits pause times and is giving up some throughput to achieve that. Implementationwise, it does not separate the heap into continuous spaces like Eden, young and old but into fixed-sized regions, where G1 assigns a role to a region when it starts using it and resets the role once it collected the region’s entire content. Talking about collections, those focus on the regions with the most garbage, hence the name, because that promises the least work.

G1 is an interesting beast and I recommend to take some time to look at it. If you don’t want to do that by yourself, stick around because this channel will discuss it soon. One nice detail I found is string deduplication (introduced in 8u20 by JEP 192), where G1 will identify String instances that have equal value arrays and then make them share the same array instance. Apparently, duplicate strings are common and this optimization safes about 10 % heap space – that was before compact strings, though, so maybe it’s closer to 5 % now.

Finally, JEP 214 removed some GC options that JEP 173 deprecated.

--ADVERTISEMENT--

Compiler

Compile for Older Java Versions

Have you ever used the -source and -target options to compile your code to run on an older JRE only to see it crash at runtime because some method call failed with a seemingly inexplicable error? A possible reason was that you forgot to specify the -bootclasspath. Because without that the compiler links against the current version’s core library API, which can make the bytecode incompatible with older versions. To fix this common operating error the Java 9 compiler comes with a --release flag that sets all the other three options to the correct value.

JVM Compiler Interface

I found this extremely interesting! JEP 243 developed the Java Virtual Machine Compiler Interface (JVMCI) – a set of Java interfaces, implementations of which the JVM can use to perform just-in-time compilation, thus replacing the C1/C2 compiler. This is still an experimental feature and has to be explicitly activated on the command line but the trajectory is clear: to have a JIT compiler that is implemented in Java. A likely candidate is the one developed by the Graal project, which already implements JVMCI.

In case you’re asking yourself “But why?”, here’s what the JEP has to say about that:

An optimizing compiler is a complex piece of software that benefits greatly from the features provided by Java such as automatic memory management, exception handling, synchronization, excellent (and free) IDEs, excellent unit testing support and runtime extensibility via service loaders just to name a few. In addition, a compiler does not require the low-level language features required by many other JVM subsystems such as the bytecode interpreter and garbage collector. These observations strongly suggest that writing a JVM compiler in Java should allow production of a high quality compiler that will be easier to maintain and improve than existing compilers developed in C or C++.

Makes sense, right?

Ahead-of-Time Compilation

Java is all “Write Once, Run Anywhere” and that’s great but what if you’re not willing to pay for that? If you want to spin up a JVM for a single method call (did anybody say serverless?), then the JIT is not going to do you much good – for maximum performance you need machine code before you even launch.

Enter ahead-of-time compilation (JEP 295)! With it, you can use the Graal compiler shipping with your local JDK to compile the code that you’re going to use and then tell java to use those artifacts instead of the bytecode it has lying around. Here’s a little snippet from the JEP that compiles user code and the required JDK module and launches Java with them:

jaotc --output libHelloWorld.so HelloWorld.class
jaotc --output libjava.base.so --module java.base
java9 -XX:AOTLibrary=./libHelloWorld.so,./libjava.base.so HelloWorld

This raises a number of questions:

  • What if the bytecode and machine code versions of a module do not align?
  • What if the runtime is launched with different VM parameters than the code was compiled against? (Compressed object pointers, for example.)
  • Should the compiled code collect profiling information to allow further optimization?

These and more are of course being addressed and the JEP is a good source for answers.

Much like the JVMCI, this is clearly an experimental feature (I didn’t even find it in the most recent build) and it still has severe limitations – most notably that it only works on 64-bit Linux systems and can only compile the java.base module. But it’s an interesting direction Java is going into.

Internals

The compiler got a little performance boost as well. In certain scenarios (e.g. for nested lambdas) type inference would have an exponential runtime – not good. Tiered attribution fixes that. (JEP 215, 2 minute video summary)

Since the annotations pipeline was created for Java 5 it had to be extended several times to accommodate new features like repeating annotations, annotations on types, and new syntactically valid positions due to lambda expressions. “[It] could not handle such cases out of the box; as a result, the original design has been stretched in order to accommodate the new use cases, leading to a brittle and hard-to-maintain implementation.” For Java 9 a complete redesign, the annotations pipeline 2.0, was created and implemented. It does not add any features but should provide a better basis for future extensions. (JEP 217, Annotations Pipeline 2.0 Project)

Then there’s something about enabling runtime manageable, method dependent compiler flags, which I don’t get at all. (JEP 165)

Finally, JEP 237 integrated the JDK 9 port for Linux/AArch64 into OpenJDK.

JavaDoc

Did you take a look at Java 9’s provisional Javadoc? Did you notice anything new? If not, go there now, locate the text box in the upper right corner and start typing the name of a JDK class. Neat, right? (JEP 225)

Javadoc can generate HTML 5 pages now, complete with “structural elements such as header, footer, nav, etc.” and improved accessibility thanks to ARIA. (JEP 224)

Saving the best for last, at least regarding JavaDoc, I want to finish with the new Doclet API. Doclets are JavaDoc plugins that you can create yourself to process your Javadoc comments (you are commenting your code, right?). The old API was extremely esoteric with the most, err, interesting feature being that you had to create static methods with just the right names so that the tool could call into your plugin. (Was that before interfaces or what?) The new API does away with such craziness. It also gives access to the Language Model API and DocTree API to let you to navigate the source code and create output. (JEP 224)

No More!

This concludes the Java 9 tour for now. Ignoring a couple of small things, these three articles present everything Java 9 has to offer:

But so far we’ve just scratched the surface – over the course of the next months we will publish more about Java 9, going into detail on any number of the topics. So watch this space, for example via RSS, or subscribe to our newsletter.

Login or Create Account to Comment
Login Create Account
Recommended
Sponsors
Get the most important and interesting stories in tech. Straight to your inbox, daily.Is it good?