Are we There Yet? When do we Move to GraalVM?

Before I get into the subject of this week's post I have other great news. I just sent the last (14th) chapter to the publisher for my upcoming debugging book titled: “Practical Debugging at Scale. Cloud Native Debugging in Kubernetes and Production”. It’s been fun to write and I hope you’ll enjoy reading it. Follow me or watch this space for an update with a link to pre order. It should come in soon…

This is the perfect time to raise this point. Just as Spring Native is coming to the forefront. Is it time to move to GraalVM? Spoiler: it depends. Yes, if you’re building serverless, probably no if you’re building pretty much anything else.With a few exceptions for some microservices.

Before I begin, I want to qualify that I’m talking about native image (SubstrateVM) which is what most people mean when they say GraalVM. That specific feature took over a much larger and more ambitious project that includes some amazing capabilities such as polyglot programming. GraalVM native images let us compile our Java projects to native code. It performs analysis and removes unnecessary stuff, it can reduce the size and startup time of a binary significantly. I’ve seen 10-20x improvement to startup time, that’s a lot. Ram usage is also much lower sometimes by a similar scale but usually not as significant.

Perhaps ironically GraalVM could even be more secure than a typical JVM since it lacks dynamic features. An object serialization attack would probably be much harder to conduct against GraalVM. Hopefully, I didn’t inadvertently challenge every security researcher to prove me wrong right now…

The Downsides

I’m a performance geek. With these numbers I would normally rush to compile my apps to native code and get on my way to faster performance. But the situation isn’t as clear cut. A JVM might still perform better in production for longer running processes. Traditional deployments aren’t as dependent on startup time or even on RAM. There are some microservices where both might have an impact, but in most cases, this isn’t the only consideration.

First off, the performance story for GraalVM is more nuanced than just startup time. For long running large applications the startup time and memory differences aren’t as significant. Runtime performance isn’t such a clear cut story and can often favor other JVMs. This is a pretty nuanced story as native image supports profiling to generate optimizing hints for the compiler and other interesting tools.

Some libraries are challenging to adapt to native image (e.g. FreeMarker), there are tricks such as the tracing agent you can use on a regular JVM to detect dynamic code. Using the results of the tracing agent execution you can package a native app. But this is a more complex project than just adding a dependency to maven.

GraalVM compilation speed has improved considerably over the past few years, but it’s still much slower than a typical maven build. But this isn’t the worst part. The worst part is the relatively lean observability story.

To be clear, tools like JFR and other capabilities of Java are supported. Even JMX is coming. This means you can use jconsole and other amazing JVM capabilities on a native executable. That is spectacular. You can even debug the native executable, IntelliJ/IDEA just added native support for debugging the executable directly. Another tremendous improvement!

But some things still aren’t supported. The Agent API is where many JVM level extensions reside. Apparently there’s some work in bringing support for these features, but probably not everything covered in the tools. Still it would be an enormous boost.

So Should We Use It?

The last time I started a Spring Boot project was well before 3.0 and I picked an early preview of Spring Native as an option in the initializr creation wizard. So I am very much for experimenting with GraalVM, I think it’s an amazing option that’s rapidly becoming more compelling with time. In fact, for many CLI tools it’s probably the best option already.

Whether we should use it in production is a different question. For some cases it just won’t be practical. If you build GUI applications or rely on dynamic class loading, then the situation is mute. But again, Spring Boot 3 is very exciting and I’m eager to migrate projects to it (and JDK 17). When we migrate these projects should I aim for “native first”?

As it stands right now, I intend to have a GraalVM native image build target in the CI. However, we will probably deploy in the cloud with a standard JVM. The main reasons behind that are all the above, but most of all the observability and familiarity aspects. When we build for scale, the individual performance of a specific node isn’t as significant. It’s important, but the bigger picture and the ability to fix issues at scale is even more important.

Imagine going through traces between multiple servers and looking at timings and following an issue to its root cause. This is at the core of high scale production issues. Traces help us understand the root cause of a performance issue or failure. The cool thing about traces is that they’re free. Free as in, we don’t need to write much code for them. Our code gets instrumented seamlessly to include that functionality.

The rising star in the world of tracing is OpenTelemetry, and it uses an agent API. It isn’t unique in that field; the agent API is prevalent in the industry. Without the agent API many features that are essential for high scale systems (tracing, developer observability, error handling, APMs, etc.) are effectively gone.

When is it Fine?

Serverless is the ideal case. While it also needs developer observability tools, it already has some issues with such extensions. E.g. Lambda fails with some agent configurations. Notice that there are tracing solutions for AWS so that aspect is addressed. For serverless using a native image saves costs, speeds results, and does that at no price. This is a straightforward decision.

In other cases, I try to keep my finger on the pulse since these things flip overnight. That’s why I recommend experimenting with GraalVM right now. This will position you well for a future where we might shift VMs.

The reason we still deploy with standard JVMs is that our deployments don’t see a noticeable advantage from GraalVM at this time. We don’t have enough scale and spin up/down costs that would make the transition worthwhile.

Finally

In a discussion on twitter, I predicted it will take 10 years for 50% of Java developers to move to GraalVM unless Layden suddenly changes the dynamics and makes the standard JVM much more efficient. Java developers are slow to move, I consider that a feature, not a bug.

At 22 GraalVM is at a completely different position than it was a mere 3 years ago. It’s auxiliary tooling and 3rd party support are both finally picking up and it’s poised to cross the chasm. I think it already did that for CLI tools. Even if you don’t pick it up right now, you should try it because there’s a lot to learn when working with it.

One of the biggest benefits it brings with it is attention to the reflective code we have all over our applications. Reducing that code will improve the quality of the application, increase imperative logic which is easier to debug. Will make failures clearer and will probably improve performance for regular JVMs too. The work vendors need to do to support GraalVM is great for all of us.

I also barely touched the polyglot aspects of GraalVM which are some of its most exciting features. Integrating Java and python code into a native binary is a powerful proposition.