This blogpost is about a very powerful feature in camunda BPM for operating critical processes which are in production. Given a core process which operates under high volume, almost any problem is critical in a way. The more options you have to deal with those problem the better. I am going to show you one of those options, namely job suspension.
The Process
Let’s say you have automated your order processing, i.e. the orders which are generated in your web shop. I will use a very simple example. New orders start a process in the backend, the delivery is scheduled and once the goods are delivered, the payment is scheduled. Let’s say your business is successful and you have many orders per minute.
From a technical perspective, all steps are executed asynchronously.
The Problem
Let’s assume that the payment service is down for an hour. Immediately you will observe loads of exceptions from your engine, arising from failing process instances, which cannot settle the payment.
While the payment service stays down, what do you need?
- You still need to take incoming orders, no argument about that, that’s your business
- Still, you do not want any process instance that calls the payment service to end up in a corrupted state, that has to be dealt with afterwards.
The Solution
Since you are using asynchronous continuations, the job suspension feature of camunda BPM comes into play. Using job suspension you can suspend all jobs of process instances which are about to call the payment service. This is very helpful, since you are still able to start new process instances for incoming orders but you avoid running instances from ending up in a failed state. This can be done globally from a process perspective.
Once the payment service is up again, you do the inverse and simply unsuspend the job definition. Of course it will take some time until all “waiting” jobs are executed, but in general the process engine is doing the job which would have involved manual intervention otherwise.
Other Use Cases
- you are calling java code which has a bug. you can stop all instances from calling the wrong java code, deploy a hotfix and continue
- you need to make changes to a script or business rules before you want processes to continues
- any other problem which is related to your process but beyond the control of the engine
Job Suspension vs. Job Retry
What are the different use cases for job suspension and the engine’s built-in retry mechanism? In short, I would recommend the retry mechanism for unknown and unexpected problems which occur temporarily. As soon as you know that something is not going to work, I would recommend to use job suspension to avoid too many retries/exceptions of the same problem. Of course this is a general statement that has to be decided depending on the actual problem.
Related Readings
- Cockpit provides Tooling for job suspension and bulk retry for failed jobs.
- Read the engine documentation on job suspension
- REST API on job suspension (globally for all instances)
- REST API on job suspension (for single jobs)
- Using camunda BPM as shared process engine makes this feature even more powerful, since the engine is more independent from your application logic
Getting Started
Getting started on Camunda is easy thanks to our robust documentation and tutorials