Blogg

Här finns tekniska artiklar, presentationer och nyheter om arkitektur och systemutveckling. Håll dig uppdaterad, följ oss på LinkedIn

Callista medarbetare Magnus Larsson

Faster startup with Spring Boot 3.2 and CRaC, part 1 - Automatic checkpoint

// Magnus Larsson

With Spring Boot 3.2 and Spring Framework 6.1, we get support for Coordinated Restore at Checkpoint (CRaC), a mechanism that enables Java applications to start up faster. With Spring Boot, we can use CRaC in a simplified way, known as Automatic Checkpoint/Restore at startup. Even though not as powerful as the standard way of using CRaC, this blog post will show an example where the Spring Boot applications startup time is decreased by 90%. The sample applications are based on chapter 6 in my book on building microservices with Spring Boot.

Overview

The blog post is divided into the following sections:

Let’s start learning about CRaC and its benefits and challenges.

1. Introducing CRaC, benefits, and challenges

Coordinated Restore at Checkpoint (CRaC) is a feature in OpenJDK, initially developed by Azul, to enhance the startup performance of Java applications by allowing them to restore to a previously saved state quickly. CRaC enables Java applications to save their state at a specific point in time (checkpoint) and then restore from that state at a later time. This is particularly useful for scenarios where fast startup times are crucial, such as serverless environments, microservices, and, in general, applications that must be able to scale up their instances quickly and also support scale-to-zero when not being used.

This introduction will first explain a bit about how CRaC works, then discuss some of the challenges and considerations associated with It, and finally, describe how Spring Boot 3.2 integrates with it. The introduction is divided in the following subsections:

1.1. How CRaC Works

  1. Checkpoint Creation: At a chosen point during the application’s execution, a checkpoint is created. This involves capturing the entire state of the Java application, including the heap, stack, and all active threads. The state is then serialized and saved to the file system. During the checkpoint process, the application is typically paused to ensure a consistent state is captured. This pause is coordinated to minimize disruption and ensure the application can resume correctly.

    Before taking the checkpoint, some requests are usually sent to the application to ensure that it is warmed up, i.e., all relevant classes are loaded, and the JVM HotSpot engine has had a chance to optimize the bytecode according to how it is being used in runtime.

    Commands to perform a checkpoint:

     java -XX:CRaCCheckpointTo=<some-folder> -jar my_app.jar
     # Make calls to the app to warm up the JVM...
     jcmd my_app.jar JDK.checkpoint
    
  2. State Restoration: When the application is started from the checkpoint, the previously saved state is deserialized from the file system and loaded back into memory. The application then continues execution from the exact point where the checkpoint was taken, bypassing the usual startup sequence.

    Command to restore from a checkpoint:

     java -XX:CRaCRestoreFrom=<some-folder>
    

Restoring from a checkpoint allows applications to skip the initial startup process, including class loading, warmup initialization, and other startup routines, significantly reducing startup times.

For more information, see Azul’s documentation: What is CRaC?.

1.2. Challenges and Considerations

As with any new technology, CRaC comes with a new set of challenges and considerations:

  1. State Management: Open files and connections to external resources, such as databases, must be closed before the checkpoint is taken. After the restore, they must be reopened.

    CRaC exposes a Java lifecycle interface that applications can use to handle this, org.crac.Resource, with the callback methods beforeCheckpoint and afterRestore.

  2. Sensitive information: Credentials and secrets stored in the JVM’s memory will be serialized into the files created by the checkpoint. Therefore, these files need to be protected. An alternative is to run the checkpoint command against a temporary environment that uses other credentials and replace the credentials on restore.

  3. Linux dependency: The checkpoint technique is based on a Linux feature called CRIU, “Checkpoint/Restore In Userspace. This feature only works on Linux, so the easiest way to test CRaC on a Mac or a Windows PC is to package the application into a Linux Docker image.

  4. Linux privileges required: CRIU requires special Linux privileges, resulting in Docker commands to build Docker images and creating Docker containers also requiring Linux privileges to be able to run.

  5. Storage Overhead: Storing and managing checkpoint data requires additional storage resources, and the checkpoint size can impact the restoration time. The original jar file is also required to be able to restart a Java application from a checkpoint.

I will describe how to handle these challenges in the section on creating Docker images.

1.3. Spring Boot 3.2 integration with CRaC

Spring Boot 3.2 (and the underlying Spring Framework) helps with the processing of closing and reopening connections to external resources. Before the creation of the checkpoint, Spring stops all running beans, giving them a chance to close resources if needed. After a restore, the same beans are restarted, allowing beans to reopen connections to the resources.

The only thing that needs to be added to a Spring Boot 3.2-based application is a dependency to the crac - library. Using Gradle, it looks like the following in the gradle.build file:

dependencies {
    implementation 'org.crac:crac'

Note: The normal Spring Boot BOM mechanism takes care of versioning the crac dependency.

The automatic closing and reopening of connections handled by Spring Boot usually works. Unfortunately, when this blog post was written, some Spring modules lacked this support. To track the state of CRaC support in the Spring ecosystem, a dedicated test project, Spring Lifecycle Smoke Tests, has been created. The current state can be found on the project’s status page.

If required, an application can register callback methods to be called before a checkpoint and after a restore by implementing the above-mentioned Resource interface. The microservices used in this blog post have been extended to register callback methods to demonstrate how they can be used. The code looks like this:

import org.crac.*;

public class MyApplication implements Resource {

  public MyApplication() {
    Core.getGlobalContext().register(this);
  }

  @Override
  public void beforeCheckpoint(Context<? extends Resource> context) {
    LOG.info("CRaC's beforeCheckpoint callback method called...");
  }

  @Override
  public void afterRestore(Context<? extends Resource> context) {
    LOG.info("CRaC's afterRestore callback method called...");
  }
}

Spring Boot 3.2 provides a simplified alternative to take a checkpoint compared to the default on-demand alternative described above. It is called automatic checkpoint/restore at startup. It is triggered by adding the JVM system property -Dspring.context.checkpoint=onRefresh to the java -jar command. When set, a checkpoint is created automatically when the application is started. The checkpoint is created after Spring beans have been created but not started, i.e., after most of the initialization work but before that application starts. For details, see Spring Boot docs and Spring Framework docs.

With an automatic checkpoint, we don’t get a fully warmed-up application, and the runtime configuration must be specified at build time. This means that the resulting Docker images will be runtime-specific and contain sensitive information from the configuration, like credentials and secrets. Therefore, the Docker images must be stored in a private and protected container registry.

Note: If this doesn’t meet your requirements, you can opt for the on-demand checkpoint, which I will describe in the next blog post.

With CRaC and Spring Boot 3.2’s support for CRaC covered, let’s see how we can create Docker images for Spring Boot applications that use CRaC.

2. Creating CRaC-based Docker images with a Dockerfile

While learning how to use CRaC, I studied several blog posts on using CRaC with Spring Boot 3.2 applications. They all use rather complex bash scripts (depending on your bash experience) using Docker commands like docker run, docker exec, and docker commit. Even though they work, it seems like an unnecessarily complex solution compared to producing a Docker image using a Dockerfile.

So, I decided to develop a Dockerfile that runs the checkpoint command as a RUN command in the Dockerfile. It turned out to have its own challenges, as described below. I will begin by describing my initial attempt and then explain the problems I stumbled into and how I solved them, one by one until I reach a fully working solution. The walkthrough is divided in the following subsections:

Let’s start with a first attempt and see where it leads us.

2.1. First attempt

My initial assumption was to create a Dockerfile based on a multi-stage build, where the first stage creates the checkpoint using a JDK-based base image, and the second step uses a JRE-based base image for runtime. However, while writing this blog post, I failed to find a base image for a Java 21 JRE supporting CRaC. So I changed my mind to use a regular Dockerfile instead, using a base image from Azul: azul/zulu-openjdk:21.0.3-21.34-jdk-crac

Note: Bellsoft also provides base images for CraC; see Liberica JDK with CRaC Support as an alternative to Azul.

The first version of the Dockerfile looks like this:

FROM azul/zulu-openjdk:21.0.3-21.34-jdk-crac

ADD build/libs/*.jar app.jar

RUN java -Dspring.context.checkpoint=onRefresh -XX:CRaCCheckpointTo=checkpoint -jar app.jar

EXPOSE 8080
ENTRYPOINT ["java", "-XX:CRaCRestoreFrom=checkpoint"]

This Dockerfile is unfortunately not possible to use since CRaC requires a build to run privileged commands.

2.2. Problem #1, privileged builds with docker build

As mentioned in the section 1.2. Challenges and Considerations, CRIU, which CRaC is based on, requires special Linux privileges to perform a checkpoint. The standard docker build command doesn’t allow privileged builds, so it can’t be used to build Docker images using the above Dockerfile.

Note: The --privileged - flag that can be used in docker run commands is not supported by docker build.

Fortunately, Docker provides an improved builder backend called BuildKit. Using BuildKit, we can create a custom builder that is insecure, meaning it allows a Dockerfile to run privileged commands. To communicate with BuildKit, we can use Docker’s CLI tool buildx.

The following command can be used to create an insecure builder named insecure-builder:

docker buildx create --name insecure-builder --buildkitd-flags '--allow-insecure-entitlement security.insecure'

Note: The builder runs in isolation within a Docker container created by the docker buildx create command. You can run a docker ps command to reveal the container. When the builder is no longer required, it can be removed with the command: docker buildx rm insecure-builder.

The insecure builder can be used to build a Docker image with a command like:

docker buildx --builder insecure-builder build --allow security.insecure --load .

Note: The --load flag loads the built image into the regular local Docker image cache. Since the builder runs in an isolated container, its result will not end up in the regular local Docker image cache by default.

RUN commands in a Dockerfile that requires privileges must be suffixed with --security=insecure. The --security-flag is only in preview and must therefore be enabled in the Dockerfile by adding the following line as the first line in the Dockerfile:

# syntax=docker/dockerfile:1.3-labs

For more details on BuildKit and docker buildx, see Docker Build architecture.

We can now perform the build; however, the way the CRaC is implemented stops the build, as we will learn in the next section.

2.3. Problem #2, CRaC returns exit status 137 instead of 0

On a successful checkpoint, the java -Dspring.context.checkpoint=onRefresh -XX:CRaCCheckpointTo... command is terminated forcefully (like using kill -9) and returns the exit status 137 instead of 0, causing the Docker build command to fail.

To prevent the build from stopping, the java command is extended with a test that verifies that 137 is returned and, if so, returns 0 instead. The following is added to the java command: || if [ $? -eq 137 ]; then return 0; else return 1; fi.

Note: || means that the command following will be executed if the first command fails.

With CRaC working in a Dockerfile, let’s move on and learn about the challenges with runtime configuration and how to handle them.

2.4. Problem #3, Runtime configuration

Using Spring Boot’s automatic checkpoint/restore at startup, there is no way to specify runtime configuration on restore; at least, I haven’t found a way to do it. This means that the runtime configuration has to be specified at build time. Sensitive information from the runtime configuration, such as credentials used for connecting to a database, will written to the checkpoint files. Since the Docker images will contain these checkpoint files they also need to be handled in a secure way.

The Spring Framework documentation contains a warning about this, copied from the section Automatic checkpoint/restore at startup:

As mentioned above, and especially in use cases where the CRaC files are shipped as part of a deployable artifact (a container image, for example), operate with the assumption that any sensitive data “seen” by the JVM ends up in the CRaC files, and assess carefully the related security implications.

So, let’s assume that we can protect the Docker images, for example, in a private registry with proper authorization in place and that we can specify the runtime configuration at build time.

In Chapter 6 of the book, the source code specifies the runtime configuration in the configuration files, application.yml, in a Spring profile named docker.

The RUN command, which performs the checkpoint, has been extended to include an environment variable that declares what Spring profile to use: SPRING_PROFILES_ACTIVE=docker.

Note: If you have the runtime configuration in a separate file, you can add the file to the Docker image and point it out using an environment variable like SPRING_CONFIG_LOCATION=file:runtime-configuration.yml.

With the challenges of proper runtime configuration covered, we have only one problem left to handle: Spring Data JPA’s lack of support for CRaC without some extra work.

2.5. Problem #4, Spring Data JPA

Spring Data JPA does not work out-of-the-box with CRaC, as documented in the Smoke Tests project; see the section about Prevent early database interaction. This means that auto-creation of database tables when starting up the application, is not possible when using CRaC. Instead, the creation has to be performed outside of the application startup process.

Note: This restriction does not apply to embedded SQL databases. For example, the Spring PetClinic application works with CRaC without any modifications since it uses an embedded SQL database by default.

To address these deficiencies, the following changes have been made in the source code of Chapter 6:

  1. Manual creation of a SQL DDL script, create-tables.sql

    Since we can no longer rely on the application to create the required database tables, a SQL DDL script has been created. To enable the application to create the script file, a Spring profile create-ddl-script has been added in the review microservice’s configuration file, microservices/review-service/src/main/resources/application.yml. It looks like:

     spring.config.activate.on-profile: create-ddl-script
    
     spring.jpa.properties.jakarta.persistence.schema-generation:
       create-source: metadata
       scripts:
         action: create
         create-target: crac/sql-scripts/create-tables.sql
    

    The SQL DDL file has been created by starting the MySQL database and, next, the application with the new Spring profile. Once connected to the database, the application and database are shut down. Sample commands:

     docker compose up -d mysql  
     SPRING_PROFILES_ACTIVE=create-ddl-script java -jar microservices/review-service/build/libs/review-service-1.0.0-SNAPSHOT.jar
     # CTRL/C once "Connected to MySQL: jdbc:mysql://localhost/review-db" is written to the log output
     docker compose down  
    

    The resulting SQL DDL script, crac/sql-scripts/create-tables.sql, has been added to Chapter 6’s source code.

  2. The Docker Compose file configures MySQL to execute the SQL DDL script at startup.

    A CraC-specific version of the Docker Compose file has been created, crac/docker-compose-crac.yml. To create the tables when the database is starting up, the SQL DDL script is used as an init script. The SQL DDL script is mapped into the init-folder /docker-entrypoint-initdb.d with the following volume-mapping in the Docker Compose file:

     volumes:
       - "./sql-scripts/create-tables.sql:/docker-entrypoint-initdb.d/create-tables.sql"
    
  3. Added a runtime-specific Spring profile in the review microservice’s configuration file.

    The guidelines in the Smoke Tests project’s JPA section have been followed by adding an extra Spring profile named crac. It looks like the following in the review microservice’s configuration file:

     spring.config.activate.on-profile: crac
    
     spring.jpa.database-platform: org.hibernate.dialect.MySQLDialect
     spring.jpa.properties.hibernate.temp.use_jdbc_metadata_defaults: false
     spring.jpa.hibernate.ddl-auto: none
     spring.sql.init.mode: never
     spring.datasource.hikari.allow-pool-suspension: true
    
  4. Finally, the Spring profile crac is added to the RUN command in the Dockerfile to activate the configuration when the checkpoint is performed.

2.6. The resulting Dockerfile

Finally, we are done with handling the problems resulting from using a Dockerfile to build a Spring Boot application that can restore quickly using CRaC in a Docker image.

The resulting Dockerfile, crac/Dockerfile-crac-automatic, looks like:

# syntax=docker/dockerfile:1.3-labs

FROM azul/zulu-openjdk:21.0.3-21.34-jdk-crac

ADD build/libs/*.jar app.jar

RUN --security=insecure \
  SPRING_PROFILES_ACTIVE=docker,crac \
  java -Dspring.context.checkpoint=onRefresh \
       -XX:CRaCCheckpointTo=checkpoint -jar app.jar \
  || if [ $? -eq 137 ]; then return 0; else return 1; fi

EXPOSE 8080
ENTRYPOINT ["java", "-XX:CRaCRestoreFrom=checkpoint"]

Note: One and the same Dockerfile is used by all microservices to create CRaC versions of their Docker images.

We are now ready to try it out!

3. Trying out CRaC with automatic checkpoint/restore

To try out CRaC, we will use the microservice system landscape used in Chapter 6 of my book. If you are not familiar with the system landscape, it looks like the following:

system-landscape

Chapter 6 uses Docker Compose to manage (build, start, and stop) the system landscape.

Note: If you don’t have all the tools used in this blog post installed in your environment, you can look into Chapters 21 and 22 for installation instructions.

To try out CRaC, we need to get the source code from GitHub, compile it, and create the Docker images for each microservice using a custom insecure Docker builder. Next, we can use Docker Compose to start up the system landscape and run the end-to-end validation script that comes with the book to ensure that everything works as expected. We will wrap up the try-out section by comparing the startup times of the microservices when they start with and without using CRaC.

We will go through each step in the following subsections:

3.1. Getting the source code

Run the following commands to get the source code from GitHub, jump into the Chapter06 folder, checkout the branch used in this blog post SB3.2-crac-automatic, and ensure that a Java 21 JDK is used (Eclipse Temurin is used here):

git clone https://github.com/PacktPublishing/Microservices-with-Spring-Boot-and-Spring-Cloud-Third-Edition.git
cd Microservices-with-Spring-Boot-and-Spring-Cloud-Third-Edition/Chapter06
git checkout SB3.2-crac-automatic
sdk use java 21.0.3-tem

3.2. Building the CRaC-based Docker images

Start with compiling the microservices source code:

./gradlew build

If not already created, create the insecure builder with the command:

docker buildx create --name insecure-builder --buildkitd-flags '--allow-insecure-entitlement security.insecure'

Now we can build a Docker image, where the build performs a CRaC checkpoint for each of the microservices with the commands:

docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t product-composite-crac --load microservices/product-composite-service

docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t product-crac --load microservices/product-service

docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t recommendation-crac --load microservices/recommendation-service

docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t review-crac --load microservices/review-service

3.3. Running end-to-end tests

To start up the system landscape, we will use Docker Compose. Since CRaC requires special Linux privileges, a CRaC-specific docker-compose file comes with the source code, crac/docker-compose-crac.yml. Each microservice is given the required privilege, CHECKPOINT_RESTORE, by specifying:

cap_add:
  - CHECKPOINT_RESTORE

Note: Several blog posts on CRaC suggest using privileged containers, i.e., starting them with run --privleged or adding privileged: true in the Docker Compose file. This is a really bad idea since an attacker who gets control over such a container can easily take control of the host that runs Docker. For more information, see Docker’s documentation on Runtime privilege and Linux capabilities.

The final addition to the CRaC specific Docker Compose file is the volume mapping for MySQL to add the init file described above in the section 2.5. Problem #4, Spring Data JPA:

volumes:
  - "./sql-scripts/create-tables.sql:/docker-entrypoint-initdb.d/create-tables.sql"

Using this Docker Compose file, we can start up the system landscape and run the end-to-end verification script with the following commands:

export COMPOSE_FILE=crac/docker-compose-crac.yml
docker compose up -d

Let’s start with verifying that the CRaC afterRestore callback methods were called:

docker compose logs | grep "CRaC's afterRestore callback method called..."

Expect something like:

...ReviewServiceApplication           : CRaC's afterRestore callback method called...
...RecommendationServiceApplication   : CRaC's afterRestore callback method called...
...ProductServiceApplication          : CRaC's afterRestore callback method called...
...ProductCompositeServiceApplication : CRaC's afterRestore callback method called...

Now, run the end-to-end verification script:

./test-em-all.bash

If the script ends with a log output similar to:

End, all tests OK: Fri Jun 28 17:40:43 CEST 2024

…it means all tests run ok, and the microservices behave as expected.

Bring the system landscape down with the commands:

docker compose down
unset COMPOSE_FILE

After verifying that the microservices behave correctly when started from a CRaC checkpoint, we can compare their startup times with microservices started without using CRaC.

3.4. Comparing startup times without CRaC

Now over to the most interesting part: How much faster does the microservice startup when performing a restore from a checkpoint compared to a regular cold start?

The tests have been run on a MacBook Pro M1 with 64 GB memory.

Let’s start with measuring startup times without using CRaC.

3.4.1. Startup times without CRaC

To start the microservices without CRaC, we will use the default Docker Compose file. So, we must ensure that the COMPOSE_FILE environment variable is unset before we build the Docker images for the microservices. After that, we can start the database services, MongoDB and MySQL:

unset COMPOSE_FILE
docker compose build
docker compose up -d mongodb mysql

Verify that the databases are reporting healthy with the command: docker compose ps. Repeat the command until both report they are healthy. Expect a response like this:

NAME                ... STATUS                  ... 
chapter06-mongodb-1 ... Up 13 seconds (healthy) ...
chapter06-mysql-1   ... Up 13 seconds (healthy) ...

Next, start the microservices and look in the logs for the startup time (searching for the word Started). Repeat the logs command until logs are shown for all four microservices:

docker compose up -d
docker compose logs | grep Started

Look for a response like:

...Started ProductCompositeServiceApplication in 1.659 seconds
...Started ProductServiceApplication in 2.219 seconds
...Started RecommendationServiceApplication in 2.203 seconds
...Started ReviewServiceApplication in 3.476 seconds

Finally, bring down the system landscape:

docker compose down

3.4.2. Startup times with CRaC

First, declare that we will use the CRaC-specific Docker Compose file and start the database services, MongoDB and MySQL:

export COMPOSE_FILE=crac/docker-compose-crac.yml
docker compose up -d mongodb mysql

Verify that the databases are reporting healthy with the command: docker compose ps. Repeat the command until both report they are healthy. Expect a response like this:

NAME           ... STATUS                  ...
crac-mongodb-1 ... Up 10 seconds (healthy) ...
crac-mysql-1   ... Up 10 seconds (healthy) ...

Next, start the microservices and look in the logs for the startup time (this time searching for the word Restored). Repeat the logs command until logs are shown for all four microservices:

docker compose up -d
docker compose logs | grep Restored

Look for a response like:

...Restored ProductCompositeServiceApplication in 0.131 seconds
...Restored ProductServiceApplication in 0.225 seconds
...Restored RecommendationServiceApplication in 0.236 seconds
...Restored ReviewServiceApplication in 0.154 seconds

Finally, bring down the system landscape:

docker compose down
unset COMPOSE_FILE

Now, we can compare the startup times!

3.4.3. Comparing startup times between JVM and CRaC

Here is a summary of the startup times, along with calculations of how many times faster the CRaC-enabled microservice starts and the reduction of startup times in percentage:

Microservice Without CRaC With CRaC CRaC times faster CRaC reduced startup time
product-composite 1.659 0.131 12.7 92%
product 2.219 0.225 9.9 90%
recommendation 2.203 0.236 9.3 89%
review 3.476 0.154 22.6 96%

Generally, we can see a 10-fold performance improvement in startup times or 90% shorter startup time; that’s a lot!

Note: The improvement in the Review microservice is even better since it no longer handles the creation of database tables. However, this improvement is irrelevant when comparing improvements using CRaC, so let’s discard the figures for the Review microservice.

4. Summary

Coordinated Restore at Checkpoint (CRaC) is a powerful feature in OpenJDK that improves the startup performance of Java applications by allowing them to resume from a previously saved state, a.k.a a checkpoint. With Spring Boot 3.2, we also get a simplified way of creating a checkpoint using CRaC, known as automatic checkpoint/restore at startup.

The tests in this blog post indicate a 10-fold improvement in startup performance, i.e., a 90% reduction in startup time when using automatic checkpoint/restore at startup.

The blog post also explained how Docker images using CRaC can be built using a Dockerfile instead of the complex bash scripts suggested by most blog posts on the subject. This, however, comes with some challenges of its own, like using custom Docker builders for privileged builds, as explained in the blog post.

Using Docker images created using automatic checkpoint/restore at startup comes with a price. The Docker images will contain runtime-specific and sensitive information, such as credentials to connect to a database at runtime. Therefore, they must be protected from unauthorized use.

The Spring Boot support for CRaC does not fully cover all modules in Spring’s eco-system, forcing some workaround to be applied, e.g., when using Spring Data JPA.

Also, when using automatic checkpoint/Restore at startup, the JVM HotSpot engine cannot be warmed up before the checkpoint. If optimal execution time for the first requests being processed is important, automatic checkpoint/restore at startup is probably not the way to go.

5. Next blog post

In the next blog post, I will show you how to use regular on-demand checkpoints to solve some of the considerations with automatic checkpoint/restore at startup.

Specifically, the problems with specifying the runtime configuration at build time, storing sensitive runtime configuration in the Docker images, and how the Java VM can be warmed up before performing the checkpoint.

Tack för att du läser Callistas blogg.
Hjälp oss att nå ut med information genom att dela nyheter och artiklar i ditt nätverk.

Kommentarer