Performance Tuning in Camunda 8

Performance tuning an end-to-end solution can be tricky—without clarity on what you’re fine-tuning for, it can go on and on as an infinite exercise. Before you begin, identify clear goals on the required performance targets.

It can be helpful to have clear goals on metrics like:

PI creation/sec
Task completion/sec
Latency requirements for job workers
PI completion latencies

Not every one of these metrics will be significant for every use case, but it’s definitely important to have clarity on what the requirements are for your use case in particular.

Prerequisites

But before we get much further in this guide, there are a few concepts you need to make sure you understand first:

Fundamentals of Kubernetes
Specific Camunda 8 concepts:

It is, of course, also a prerequisite to have some tooling in place to monitor the various metrics that the Camunda platform already exposes. These metrics are key to observing and fine-tuning performance and are already visualized as part of our sample Grafana chart.

To help draw correlations between the various parts of your end-to-end solution, you’ll want to have a similar dashboard with metrics exposed from the Camunda platform as well as the code implementing the business logic.

Factors influencing performance

A number of factors influence your end-to-end solution’s performance, so let’s consider a few of them one by one.

Process model

The number of task instances to be executed per second influences your hardware requirements. The higher the number of TI/sec to be executed, the more the hardware you need.

The TI/sec metric includes not only user tasks and service tasks, but also call activities. Call activities also incur significant processing since they need the creation of another PI. Therefore, you have to count them when you’re estimating the TI/sec.

For example, a process model with five service tasks and five call activities will need more resources than a process model with just five service tasks. Similarly, a process model with 20 TIs (10 service tasks and 10 call activities) will have different resource consumption from another process model with 20 TIs ( 18 service tasks and two call activities). That can be significant, especially in high-throughput scenarios.

It’s important to run your benchmarks with a process model that is more or less close to your production workload to get an accurate estimate of your hardware requirements.

Message subscriptions

When designing your message subscriptions, keep in mind that all messages with the same correlation key route to the same partition. Too many messages (thousands!) with the same correlation key will overwhelm one specific partition. Instead, ensure that correlation keys are unique per PI.

For example, when a specific PI has multiple message correlations, having a message start event with the same correlation key for both the message that creates the process instance and all subsequent message catch events will avoid interpartition communication. This can be a performance booster. Otherwise, messages are forwarded from the partition receiving the message to the partition where the process instance exists.

You can gain the highest performance from messages that are not buffered; that is, messages with TTL=0. The buffered messages must expire at some point, and this takes resources from the engine.

If message buffering is required, you can tweak the experimental ttlCheckerBatchLimit and ttlCheckerInterval values and test the performance differences.

Multi-instance call activities

If all the child instances are created on the same partition as the root PI, this creates issues. If the cardinality of the multi-instance is very high (we’ve had experience with a scenario with ~8k as the cardinality), then a specific partition is overloaded and ends up with backpressure.

If this is an occasional occurrence, the system would slow down and eventually catch up with the processing. Still, with such use cases, it is recommended to run benchmarks and size your environment using an empirical approach.

As an alternative, consider sending out multiple messages that in turn trigger PIs, which are based on a message start event.

Job workers

Note that JobTimeout must be longer than the average time that the worker code takes to complete execution.

The values of MaxJobsActive and JobTimeout must be set according to the time taken to execute the code in each worker. With a very short JobTimeout and higher MaxJobsActive, it is possible that jobs will get activated and be queued at the job worker client for a long time. Even before the worker gets to process the job, the job will timeout. This results in slowness in job completion.

There is no one size fits all recommendations for MaxJobsActive. You have to experiment with it and set it according to the dynamics in your environment, such as:

The actual time to run the worker code
The size of the data in relation to the memory allocated to the worker client pod
Job timeout

The longer the jobs take to complete, the more Zeebe resources are required. For instance, running the same test case, with a 20ms delay at the job workers. consumes less hardware at Zeebe as compared to a 3000ms delay at the job workers, with no other parameter change. So when mocking your APIs in tests, ensure that your performance tests have an equivalent delay in the mocked workers.

For high-performance use cases, Camunda’s guide on writing good workers can be helpful. For example, implementing your workers in a non-blocking way would help with high performance use cases.

In many cases, it’s a good idea to have metrics around each worker’s implementation. While the Java client provides a counter tracking the count of jobs activated and jobs handled, you could gather further metrics from individual clients (e.g., execution duration and other use-case-specific metrics). In the case of latency sensitive or high-throughput use cases, you could use these metrics for troubleshooting and evaluating performance.

To ensure that no critical transactions are lost in case of errors or backpressure when calling the Camunda APIs, you need a robust error handling and retry strategy. Spring Zeebe has this built in.

CPU type

The type of CPU used for the brokers influences the performance of Zeebe. So when running benchmarks or performance tests, perform them on the same CPU type as your production environment.

Our benchmarks have proven that the latest generation of CPUs deliver higher performance. For instance, we compared the performance of the same workload between N2D (second and thir d generation AMD) and C3D (fourth generation AMD) machines on GCP. The latter needed only 58% of CPUs to deliver the same throughput.

To compare cost benefits, run benchmarks and assess which types of machine are cost-effective and then choose the machine type accordingly. Also, assess whether running the load on a slower but cheaper CPU or faster but slightly more expensive CPU type has the better price performance for you.

We found that with on-demand VMs, the C3D type has a higher price performance. However, spot VM discounts on GCP for the N2D VMs are much higher; the cost benefit is nullified in case of spot VMs.

Disk type

Zeebe writes its state to disk, which means the disk speed influences the processing latency. We recommend high-performance SSD disks with an absolute minimum of 1,000 IOPS. Each infrastructure provider offers a variety of disk type options. Based on our observation, we’ve recommended storage classes for GCP, Azure, Openshift and AWS in our documentation.

However, for high-performance use cases, we recommend an empirical approach. Run benchmarks to compare the performance of the same test case with different disk types/IOPS variations to identify the optimal configuration.

Elasticsearch

The performance of Elasticsearch has an impact especially with high-throughput use cases. If the Zeebe records are not exported efficiently, then Zeebe’s logs grow bigger. While this is not a problem for Zeebe, this affects log compaction. If logs can’t be compacted over a long time, this affects performance.

You can see this in long-running benchmarks, where the performance drops after a point.

To define the number of shards per index, each component has environment variables you can leverage:

You’ll have to properly configure sharing for all Elasticsearch indices (Zeebe, Operate, Tasklist, and Optimize). This ensures that the Elasticsearch resources are efficiently utilized and that exporting is on par with processing in Zeebe.

To start with, have the number of shards equal to the number of Elasticsearch nodes or as multiples of them. There is no one-size-fits-all setup, unfortunately. You will need to identify the optimum number of shards based on the data size and the number of nodes in your Elasticsearch cluster.

The following Grafana screenshot showcases a few sample benchmark runs. They explore the effect of optimal sharding in Elasticsearch on the efficiency of exporting by varying only the number of shards for the same test configuration.

Grafana screenshot showcasing a few sample benchmark runs

The number of records not exported is a useful metric to identify if Elasticsearch is slow. Benchmarks demonstrate that when ensuring the exporters export efficiently, the process instance execution time improves twofold.

Let’s take a look at another Grafana screenshot. This one demonstrates that PI execution time on the ninety-ninth percentile goes up with slower exporting. This could have a significant impact on time-sensitive processes.

Grafana screenshot demonstrating PI execution time

By default, Elasticsearch heap size allocation is a percentage of the available RAM. So consider setting a heap size explicitly or allocating more RAM to Elasticsearch when you observe slowness in ES.

How fast your data can be written to and read from the disk can affect how quickly Elasticsearch can index new documents and return query results. Therefore, the disk IOPS and disk space utilization both influence the performance of Elasticsearch. Our benchmarks show this clearly.

In the following screenshot, you can see that the same test case was repeated with different disk sizes. In this case, a 1 TB disk had more IOPS, and Elasticsearch’s performance had a significant difference, although the 64 GB disk size was sufficient to hold the data for that short load test.

Screenshot showing the same test case repeated with different disk sizes

High-performance use case

To see this in action, let’s experiment a bit and find the optimal value for these parameters.

Raft Flush

Our benchmarks demonstrate that the commit and write latencies have a significant drop when the raft flush is delayed or disabled. However, completely turning off the flush may not be suitable for all cases. Only disable explicit flushing if you are willing to accept potential data loss at the expense of performance.

Before disabling it, try the delayed options. These provide a trade-off between safety and performance.

Here are some environment variables you can use to modify raft flush settings:

ZEEBE_BROKER_CLUSTER_RAFT_FLUSH_ENABLED. Totally turns off raft flush. Again, this is not recommended due to the risk of data loss.
ZEEBE_BROKER_CLUSTER_RAFT_FLUSH_DELAYTIME. You can test an interval < 30s.

Broker data LogSegmentSize

By default, the size of data log segment files is 128MB. Our benchmarks have demonstrated that having smaller log segment sizes than the default helps decrease the latency significantly.

You can use the following environment variable to define the log segment size: ZEEBE_BROKER_DATA_LOGSEGMENTSIZE.

Experiment with smaller log segment sizes for better performance.

CPUs per partition replica

Based on our benchmark observations, a ratio of the number of partition replicas per CPU in the range of 1.2–1.5 has delivered optimal performance. To understand this, consider the following example:

Number of vCPUs allocated to the brokers: 7
Number of brokers: 14
Number of partitions: 42
Replication factor: 3

Total number of partitions = Number of partitions x Replication factor = 42 x 3 = 126

Total number of CPUs = 7 x 14 = 98

Partition replica per CPU = Total number of partitions / Total number of CPUs = 126 / 98 = 1.286

This number is not set in stone; our benchmarks were based on N2D (second/third generation AMD) machines. You can use these suggestions as a starting point. Experiment further in test benchmark runs to find the optimal settings for your environment.

ioThreadCount and cpuThreadCount

Since there cannot be one-size-fits-all values for these parameters, experiment with them.

It’s advisable to have at least one CPU thread per processing (leader) partition on the broker. For instance, consider the following example:

Number of vCPUs per broker node: 7
Number of brokers: 14
Number of partitions: 42
Replication factor: 3

Total number of partitions = Number of partitions x Replication factor = 42 x 3 = 126

Total partitions per broker = 126 / 14 = 9

Total number of leaders per broker = 42 / 14 = 3

So each broker could have three processing partitions that need at least one CPU thread. The other six replicated partitions also need some CPU space, so a cpuThreadCount of four or five makes sense. Then you can run benchmarks to identify the optimal value for this setting.

Troubleshooting throughput issues

To get PI completion metrics in Grafana, set the following environment variable to `true`: ZEEBE_BROKER_EXECUTION_METRICS_EXPORTER_ENABLED.

If you’re observing backpressure, look at the latency metrics and check the disk speed for Zeebe. For example, with AWS, gp3 volumes provide a predictable, baseline performance of 3000 IOPS regardless of the volume size. In the case of performance, SSD persistent disk volumes deliver a consistent baseline IOPS performance, but it varies based on volume size.

To ensure that the required IOPS is delivered, consider allocating higher disk space, in the case of disks whose performance varies with the volume size. This applies to Zeebe as well as Elasticsearch disks.

With slower disks, the commit latency and record-write latency would be higher. Optimal values as observed from our benchmarks are:

Commit latency: ~400ms
Record-write latency :~25ms

While there are other factors that influence these latencies, it’s critical to have these latencies in the optimal ranges for Zeebe disk speed.

If disk speed is not an issue, then consider modifying the raft flush delay and log segment size.

How is CPU utilization? Is the CPU throttled? If all the allotted CPU is not optimally used but there is backpressure, that could indicate that the hardware is not fully utilized. There is more load to be processed. Consider adding a few more partitions to the node.

On the other hand, if there is CPU throttling and if the CPU allotted is fairly well used, then consider scaling up the nodes. This will help redistribute the partitions to the newly added nodes and provide more hardware resources for the partitions to consume.

Ensure that there is no throttling by allocating sufficient CPU. This will yield better performance and sometimes even reduce CPU usage. When the brokers are CPU throttled, it can lead to more load for the system in general (due to timeouts, yields, etc.), which ends up requiring more CPU that in turn causes more throttling.

If you’re not hitting the required throughput, find out if the gateway isn’t catching up. Does it have sufficient resources? We have observed in some cases that the gateway could deliver better performance if more resources are allocated. If there is no indication of issues at the engine level, you could consider increasing gateway resources—that may deliver better performance.

The Zeebe Gateway uses one thread by default. However, if the gateway doesn’t exhaust its available resources while also not keeping up with the load, set this to a higher number. Use the property zeebe.gateway.threads. The corresponding environment variable is: ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS.

Consider checking all the proxies or load balancers in your infrastructure for limitations or default short timeouts. We had a similar situation where there was an NGINX ingress controller limiting the number of connections. This affected the connectivity from the client pod to the gateway. We observed a lot of intermittent deadline-exceeded exceptions on the client, with clear indication that the client is unable to establish a connection to the gateway.

For instance, in the following error message from a Java Zeebe client, each UNAVAILABLE entry in the closed brackets refers to an unsuccessful connection attempt. The buffered_nanos property within the open brackets refers to the current number of nanoseconds that a connection attempt is in progress to the destination (mentioned in the remote_addr property). In this case, this is an indication of the client not being able to connect to the gateway.

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9.999995499s. Name resolution delay 0.000000000 seconds. [closed=[UNAVAILABLE, UNAVAILABLE, UNAVAILABLE, UNAVAILABLE, UNAVAILABLE], open=[[buffered_nanos=446461400, buffered_nanos=736575000, remote_addr=zeebe-gateway-ingress-url:443]]]

Is your job completion not on par with job creation? Check the JobLifeTime metric. Is it more than the actual time taken to run the worker code on average?

Check whether the Camunda client application has sufficient resources (CPU, memory, etc). Is there client-side backpressure? In other words, are the job workers sending a blocked message to the gateway when it tries to stream messages? If yes, then consider scaling your client applications.

Isolate the performance of Zeebe with the same BPMN/DMN but with dummy/mock workers that have a completion delay matching the time it takes for the actual worker code to run. Always run this test to ensure that your cluster sizing is appropriate for the load.

To identify the performance bottlenecks in the Zeebe client implementation, isolate the dependencies in the job workers. For example, if there are connections to DB or other messaging systems, check whether the drop-in performance is due to throttling on those endpoints.

Conclusion

We could go on and on about performance tuning with Camunda 8, but to avoid an “infinite exercise” as we first mentioned, this guide will close here. We hope it provides you with sufficient tips and metrics for fine-tuning your end-to-end solution with Camunda.

Don’t forget to check out the many resources mentioned throughout the article. For easy reference, please find them here:

A n interactive example using Grafana
Camunda community hub example, using messages instead of call activities in a high-cardinality multi-instance case
Camunda’s job workers documentation
A sample Spring Zeebe implementation
Elasticsearch’s best practices for heap size configuration
Camunda’s benchmark test case template
Camunda’s job streaming documentation

Back to the blog

Start the discussion at forum.camunda.io

Prerequisites