Scaling Workflow Engines at Intuit with Camunda 8 and Zeebe

While every organization seeks to find ways to increase their business process efficiency and reduce latency, it is critically important for Intuit, who serves 100 million customers worldwide with TurboTax, Credit Karma, QuickBooks, and Mailchimp in its mission to power prosperity throughout the world. In order to fulfill that mission, Intuit’s development team is constantly on the lookout for the cutting-edge technologies and techniques that can help them drive digital transformation.

That’s why Intuit, which has partnered with Camunda to power our process orchestration practice for five years, wanted to put Camunda 8 and its promise of endless scalability to the test and determine if Camunda 8 could help us scale our process workflow instances without sacrificing speed, particularly through extremely high-volume periods. Here’s how we designed our experiment—and what we discovered throughout the process.

Context

We have been using Camunda 7 for the last five years at Intuit, deploying it across multiple swimlanes with multi-tenancy and offering isolation.

Currently we automate workflows for a variety of use cases at Intuit, including automated complex approvals, sending reminders, and notifications for our mid-market customers from QuickBooks (QBO). At the same time, Camunda empowers our tax experts by automating tax filings for TurboTax customers. We have more than 70 offerings that use automated workflows powered by Camunda.

How are we using Camunda

Currently, we are addressing these use cases with Camunda:

High-volume simple reminders; these reminders are generated at a volume of approximately one million reminders over only a few hours each day, operating at a process creation transactions per second (TPS) of 50 and external task TPS of 150.
Complex tax filing workflows that operate at a high throughput speed during peak tax season.
Approval workflows that are smaller in scale, but also need to have a low latency.

For each of these workflow categories, we have a list of external tasks that connect with our workflow service to create tasks, send emails, and a number of custom adapters.

Current Camunda 7 scale and benchmarks

Camunda 7 scaling

As the team’s interest in implementing process orchestration at scale grew, we began to run into some challenges handling our target volume in Camunda 7. These challenges included:

Current bottlenecks with Camunda 7

External task scaling; our team at Intuit identified that external tasks did not scale beyond 150-160 TPS on average /topic.
Number of jobs; the number of jobs did not scale beyond 1000-1200 TPS.
External task TP99 (TP99) latency; the TP99 was up to four seconds in some instances.
Relational database systems (RDS) spikes during sudden traffic surges because of an increase in job surges.
RDS spikes when there was a surge in timer job execution.

Camunda 7 scale achievements

At first, the team sought to solve these challenges within Camunda 7 through thoughtful BPMN diagramming and testing. While we saw some advancements, we made some observations that led us to conclude that Camunda 7 was not the right fit for us near-real time use cases.

BPMN used

Intuit was able to scale up to 10 TPS with a single topic.

Simulation NameDuration (Hours)Max TPSAvg TPSProcess Created- ClosedExternal Task Lag (Seconds)10 TPS for 8 hours, single topic8 2701332885991.912 TPS for 8 hours, single topic1340210420000>420 TPS for 8 hours, two topics84702605771251.6

We subsequently scaled topics and hence workers and were able to scale external task fetch and lock to 700 TPS (on average) consistently.

During this testing, we observed the following:

As we increased the number of topics, the CPU input/output operations per second (IOPS) load on relational database systems(RDS) was very high. As a result, we have to increase the RDS database (DB) vertically.
It was difficult to control external task latency within one (1) second. It was deemed not suitable for near-real time use cases.

With that decision made, the team turned to Camunda 8 and its workflow and decision engine, Zeebe.

Camunda 8 and Zeebe

As we evaluated using Camunda 8 and Zeebe for our workflows, we implemented the following guidelines while completing our scale benchmarking:

TP99 of external task execution should be less than 500 milliseconds (ms).
Each process would have at least 10-15 variables each.
No spikes in Elasticsearch or any pod memory.
Each test run will last at least 45 minutes to 1 hour to avoid any anomaly. Once our test runs are optimal, we will certify by running for at least 12 hours.
Currently, we only have one active region and Disaster Recovery is not implemented.

Testing the waters

We started with a small performance test with a single external task. This was implemented with a simple BPMN model, with a start process, one external task, and an end process.

Initial configuration

With our performance test designed, we started our testing configuration with 500 process creations using a minimal amount of resources. From this test, we saw the following results:

Zeebe G/W PodsZeebe BrokersPartitionsCPU/brokerMemory (GB)/brokerAvg External Task TPSP99(Seconds)Avg(seconds)2334420050 20 2338872017 8 2398821004 1.6

Quick Observations:

Increasing CPU and memory per pod per broker helps to scale.
Increasing the number of partitions and brokers helps to scale external tasks and reduce the TP99 latency.

Gaining confidence

Since we saw an improvement related to external task TPS, we wanted to check our TP99 overhead latency to confirm it would be less than 500 ms.

We used a strategy to reduce the load to 200 TPS and decrease the latency.

We also increased the number of external tasks to four.

Additional high-level changes

We made the following high-level changes to our test:

Enabled streaming in the Zeebe client.
Increased the number of brokers and partitions.
Ensure partitions are kept in odd numbers to obtain better quorums.

In our next round of testing, we saw the following results:

BrokersPartitionsTotal Processes CreatedProcess Creation TPS (Max)Process Creation TPS (Avg)External Task MAX TPSExternal TaskTPS (AVG)External Task Round Trip P99 (in ms)External Task p95 (in ms)External Task avg (in ms)Prf Test Durations7152.4M34119729017903201201103 hours

Observations:

We experienced awesome TP99 and average latency.
2.5M processes were created and the latency was acceptable.
It was best to ensure each Zeebe gateway can handle up to 200-250 TPS. We used the following rule: number of Zeebe gateways = number of Zeebe brokers/2.

Pushing the test a bit further

BrokersPartitionsTPSTotal Processes CreatedProcess Creation TPS (Max)Process Creation TPS (Avg)External Task MAX TPSExternal TaskTPS (AVG)External Task Round Trip P99 (in ms)External Task p95 (in ms)External Task avg (in ms)Prf Test Durations715300900K45627527061096250018209101h9173001M532276306711079008503501h927300800K4003003052105711200.90.445m1127300984K4742755110109823001.751.41h1325300910K4702654120105633502.41.571h1333300737K44929929711192196011300.9345m99300300K6862873071130070032010015 min994001.3M690381320515257003801201 h

Observations

One partition per broker is optimal to get the best results.
External task TPS increased.
TP99 of external tasks was always above a second whenever the partition-to-broker size was greater than one in case of higher load. To benchmark at a higher scale, it’s recommended to have 1-1 mapping between partitions and brokers.
Increasing partitions per broker did not decrease the latency.

Trying to be greedy

Feeling encouraged by our successful testing, we decided to further push the limits and “get greedy” to see how far we could increase our TPS and maintain acceptable latency.

Using the above configuration (four external tasks and the same broker size), increased the TPS to 450.

TP99 increased to more than a minute (70 seconds).

To further debug whether this issue can be resolved, we set up observability by using Grafana (provided by the Camunda community).

From the Grafana dashboards, we were able to check, the IOPS were too high for the EC2 instances. Reviewing this information, we increased the IOPS and throughput for the EC2 instances.

Final outcome

Our final testing results were:

BrokersPartitionsTPS# Number of external tasksTotal Processes CreatedProcess Creation TPS (Max)Process Creation TPS (Avg)External Task MAX TPSExternal TaskTPS (AVG)External Task Round Trip P99 (in ms)External Task p95 (in ms)External Task avg (in ms)Prf Test DurationsRemarks994504640k8924233684170071528124030 minsIncreased the EC2 iops throughput from 125 to 600995004640k7144763608188442028121530 minsgood111150051.3M884445417722099005903101hgood131350052.5M11304835360241112302hFirst hour TP99 was good, but it started degrading towards the end131350051.6M890460420022225702501501hGood for 1 hour -> Increased the EC2 throughput from 600-1000and IOPS from 1000 to 3000131350054.3M1130471525023559005003002:30 hLast hour -TP99 was around 2 seconds. Degradation of TP99 started after first 75-90 minutes151550055.1M782469400023714203603003:00 hIncreased EC2 IOPS to 6000151550065 M741446534325437005603503 hIncreased throughput to 1000 and IOPS to 9000

Observations:

We reduced TP99 to approximately 500 ms (we make a downstream call, which takes up to 100-200) ms.
We concluded that it may be possible to scale infinitely using Camunda 8 by fine-tuning the parameters described above.
We experienced some initial delays of TP99 in initial loads. Eventually, the performance improved significantly.
Increasing from GP2 to GP3 volumes further decreased the latency.

Comparison after increasing IOPS and throughput

Before

After:

Avg of all requests were in milliseconds.

Note: Ignore the Activate Jobs Latency metric, since the streaming was not enabled in a few pods.

Tsunami test: Keeping our QuickBooks and TurboTax offerings tax-season ready

The American tax season, which concludes with an annual April 15 deadline, generates periods of extremely high process instance volumes that we need to be sure we can accommodate. During these tax-season peaks, we see extreme use cases where we often have a continuous TPS of 400-500 TPS for three to four days. It was important for us to conduct a “tsunami test” for at least 15 hours to replicate those instances and ensure our process instances could handle the demands of tax season.

As a reminder, here is the BPMN used for this process:

Attributes:

The BPMN has been adapted, but at a high level, includes the following attributes:

7 external tasks
All tasks are asynchronous
Approx 75 variables (few were maps)

Results:

Total processes created ~ 22.74M
Total process Creation Average TPS: 395 (Maximum: 498)
External task fetch and lock TPS: Maximum: 7007, Average: 2815
Overall TP99 latency of external tasks: 450ms (Downstream latency of ~150ms)
Total Number of external tasks executed: 114M

Observations

We experienced latency of external tasks increasing after four to five hours. On further inspection of server-side streaming, we found that it was not enabled for a few pods. There was a bug in the Community version of spring-zeebe. However, when we moved from Spring Zeebe to normal streaming, it improved our TP99 from 1 seconds to 450 ms.
It is important to fine tune Zeebe client parameters such as maxTasks, timeout, and thread pools based on your variable size.
We also tested with timers (which we have not shown in this BPMN model due to internal sharing policy), and we were pleased with the performance, as it ran without any spikes at the above TPS.

Camunda 7 vs Camunda 8 comparison

Performance metrics

Metrics	Camunda 7	Camunda 8
TPS of Workflow Process Creation	20	375
Total Number of workflows created in 10 hours	577K (0.5M)	15M
External Task Worker TPS per workflow (1 Camunda topic/workflow)	Max: ~200 Avg: 150	Max: 9100 Avg: 3200
External Task Worker across all workflows	Max: 900, Avg: 560	Max: 9100, Avg: 3600
TPS for Asynchronous jobs (for eg: Timer, async continuations for tasks)	Max:600 Avg: 280	Max: 7000, Avg: 2100
TP99 of Round Trip of External Task workers	~3.6 seconds	~250 ms
Database Scaling	Currently can be scaled vertically (current infra m5.8x-large)	Horizontal scaling (By adding more brokers(pods))

Current improvements we are working on with the Camunda team

During our migration and testing process, we encountered some challenges, which we are working with the Camunda team to fix.

When the variable size increases by more than 100-150KB, we experience slowdowns when exporting the data to ElasticSearch. This slowdown occurs because the size of the processing queue increases and affects the overall process speed.
Operate lags behind under high TPS and can show slightly stale data.
We saw a few (100 out of 25 Million) duplicated events when processes were created twice, although we are still determining if the duplicated events is occurring due to Camunda or our testing software setup.

Conclusion

Overall, Camunda 8 scales well with a TP99 latency of less than 250ms. It definitely outperforms Camunda 7 in terms of scaling. Our team is excited about the possibilities that Camunda 8 offers us to help scale our workflows and deliver on Intuit’s mission to power prosperity by supporting users navigating tax regulations and preparing their filings in our QuickBooks and TurboTax systems.

Back to the blog

Start the discussion at forum.camunda.io

Context

How are we using Camunda