Operating Camunda

In order to successfully operate Camunda you need to take into account operation requirements when modeling business processes. Use your existing tools and infrastructure for technical monitoring and alarming. When appropriate, use Camunda Cockpit and consider extending it with plugins instead of writing your own tooling.
Operating Camunda is also related to
Operating Camunda

Modeling for Easier Operations

First make sure you understand Dealing With Problems and Exceptions in Camunda and in particular transaction management including transaction borders and the consequences of rolling back a transaction discussed there.

Using Additional Save Points Wherever Appropriate

By introducing additional transaction borders - we also call them savepoints, you can avoid operational difficulties, such as repeating steps which should better not be repeated, just because subsequent steps fail and cause multiple steps in the process to roll back. Get to know the possibilities and our recommendations by reading about Additional Save Points in the context of Dealing With Problems and Exceptions.

Avoiding to Model Retry Behavior

A common idea is to model retry behavior into your process models. This should be avoided in general. The following process model shows a typical example of this anti pattern:

All operations use cases modeled here can be directly handled via Camunda generically: learn e.g. more about BPMN configuration of Retry Strategies and Incident Management with Camunda Cockpit. Of course you can also influence all this via (Java or REST) APIs - in case you want to leverage scripting for your operations.

Installing Camunda

For a quick start, especially during development, get going by downloading a Camunda EE Full distribution leveraging the container technology of your choice. Such a full distribution includes the container of your choice and a preconfigured shared process engine and all Camunda web applications (Cockpit, Tasklist, Admin). It is configured to use file based H2 database which is ready to go without additional installations required. See the installation guide for further details. You might want to remove the example application to get rid of the invoice process. Do so by removed the corresponding WAR file from the deployment folder of the container before the first start.

For production usage we recommend to setup the container of your choice yourself, as we do not make sure we always ship the latest stable patched container version in our distribution. Also we cannot ship some containers for licensing reasons. Install Camunda into this container following the installation guide. Add required JDBC drivers for the database of your choice and configure datasources accordingly. Make sure to secure Camunda if required.

We recommend to script the installation process, to allow for an automated installation. Typical steps include

  • Setup (or extract) the container and install Camunda into it. As an alternative you might use the Camunda distribution and remove the example application.

  • Add JDBC drivers and configure the datasource for Camunda.

  • Configure identity management (e.g. to use LDAP) or add required users and groups to the database-based identity management.

  • Setup Maven build for Camunda webapp in case you want to add own plugins or customizations.

  • Install the Camunda license.

In order to script the installation you can retrieve all required artifacts also from our maven repositories. This way it is easy to switch to new Camunda versions. Integrate all pieces by leveraging a scripted configuration management and server automation tool such as e.g. Docker, Puppet, Chef or Ansible.

Setting up Monitoring and Alarming

Certain situations have to be recognized quickly in order to take appropriate action during the runtime of the system. Therefore consider monitoring and alarming up front when planning for production operations.

Distinguish between process execution related monitoring and basic systems monitoring. Do systems monitoring via normal Java or Container Tools - nothing Camunda specific is needed in that area.

Recognizing and Managing Incidents

In case a service call initiated by Camunda fails, a retry strategy will be used. By default a service task is retried three times, but learn more about retrying failed transactions with your custom retry strategy. In case the problem persists after those retries, an incident is created and Camunda will not recover without intervention from a human operator. Therefore make sure somebody is notified whenever there are any (new) incidents.

  1. You can build an active solution, where Camunda actively notifies somebody when there is a new incident. For example you could send an email or create a User Task in Camunda.

    To achieve this, you can hook in your own Incident Handler as shown in this example. The upside is, that sending emails like this is very easy, the downside is, that you have to implement Camunda specific classes.

    However, if a crucial system goes down you might end up spamming people with thousands of process instances running into the same incident.

  2. Therefore, typically a passive solution is preferred, which queries for (new) incidents from the outside leveraging the Camunda (Java or REST) API and take the desired action.

    The most common way is to query the number of incidents by the tool of your choice using the REST API: GET incident/count. More information can be found in the REST API. We prefer the REST API over more low level technologies (like JMX or PMI), as this typically works best in any environment.

    Now you can easily batch multiple incidents into one email or delegate alarming to existing tools like Nagios or Icinga. An additional advantage is that you eventually already have proper alarming groups defined in such a tool.

Monitoring Performance Indicators

Monitor the following typical performance indicators over all process definitions at once:

  • Number of open executable jobs: GET /job/count?executable=true ( REST API), as these are jobs that should be executed, but are not yet.

  • Number of open incidents: GET /incident/count ( REST API), as somebody has to manually clear incidents and increasing numbers point to problems.

  • Number of running process instances: GET /process-instance/count ( REST API). Increasing numbers might be a trigger to check the reasons, even if it can be perfectly fine (e.g. increased business).

In case you want to monitor process definition specific performance indicators you can either iterate over the process definitions - e.g. by using GET /process-definition/{id}/statistics ( REST API) or leverage GET /process-definition/statistics ( REST API) which groups overall performance indicators by process definitions. Beware that you eventually need to take into account older versions of process definitions, too. (There exists a request for a feature which would allow to directly query for open incidents by process definition key.)

Organizing Dedicated Teams for Monitoring

In general the performance indicators mentioned above can and should be monitored generically and independent of specific process applications. However, you may want to setup dedicated alarming for different operating teams with more knowledge about specific process application characteristics. For example one of those teams might already know what the typical number of open user tasks for a certain process definition is during normal runtime. There are two approaches to achieve this:

  1. The recommended approach is to configure dedicated alarming directly in your monitoring tool by creating separate monitoring jobs querying the performance indicators for specific process definitions.

    This approach does not need any operation centric adjustments in Camunda and is easy to setup and handle.
  2. An alternative approach is to define team specific bundles of process definitions in Camunda by leveraging the process definition "category" or even your own BPMN extension elements. However, this information cannot be directly used in the above mentioned queries. Hence, you have to implement additional logic to do so.

    We typically advise that you do not do so unless you have very good reasons to invest the effort.

Bonus  Creating Your Own Alarming Mechanism

In case you do not have a monitoring and alarming tool or cannot create new jobs there, simply build an easy alarming scheduler yourself. This could be a Java component called every couple of minutes to query the current performance indicators by Java API generating custom emails afterwards.

public void scheduledCheck() {
  // Query for incidents
  List<Incident> incidents = processEngine.getRuntimeService()
    .createIncidentQuery().list();
  // Prepare mailing text
  String emailContent = "There are " + incidents.size() + " incidents:<br>";
  for (Incident incident : incidents) {
    emailContent += "<a href=\""
      + cockpitBaseUrl
      + incident.getId() + "\">"
      + incident.getIncidentMessage() + "</a><br>";
  }
  emailContent += "Please have a look into Camunda Cockpit for details.";
  // Send mailing, e.g. via SimpleMail
  sendEmail(emailContent);
}

Bonus  Defining Custom Service Level Agreements

Apart from generic monitoring, you might want to define business oriented service level agreements (SLAs) for very specific aspects of your processes, like for instance - overdue tasks, missed deadlines or similar. You can achieve that by

  1. Adding custom extension attributes in your BPMN process definition, e.g. for specific tasks, message events etc which serve to define your specific business performance indicators.

  2. Reading deployed process definitions and their custom extension attributes, e.g. by means of Camunda’s BPMN Model API and interpreting their meaning for your business performance indicators, e.g. by calculating deadlines for tasks.

  3. Querying for (e.g. task or other) instances within/without the borders of your service level agreement.

This is normally implemented similar to the Java Scheduler we described above.

Intervening with Human Operator Actions

Handling Incidents

Incidents are ultimately failed jobs, for which no automatic recovery can take place anymore. Hence a human operator has to deal with incidents. Check for incidents within Camunda Cockpit and take action there. You might for example want to

It is worth to note that if you have a failing call activity in your process, you retry "bottom-up" (in the failing sub process instance), but you cancel "top-down" (the parent process instance to be canceled). Consider the following example incident visualised in Camunda Cockpit:

insurance application failed with detail

You eventually see the incident first on the parent process call activity "Request documents", but it is actually caused by the failing activity "Request documents" in the sub process - for better comprehensibility this is directly visualised in the picture above. In Cockpit, you can navigate to the call activity in the "called process instance" pane to the bottom of the screen. There you could now retry the failing step of the sub process instance:

document request failed
1 By clicking on this button, you can retry the failing step of the sub process instance. Note that a successful retry will also resolve the incident you see on the parent process instance.

On the other hand, you might also want to cancel the failing parent process instance:

insurance application failed
1 By clicking on this button, you can cancel the failing parent process instance. The cancellation will also cancel the sub process instances running in the scope of the parent process instance.

Turning on/off all Job Execution

Sometimes you might want to prevent jobs being executed at all. When for example starting up a cluster, you might want to turn off the Job Executor and start it up later manually when everything is up and running.

  1. Configure the jobExecutorActivate property as false.

  2. Start the Job Executor manually by writing a piece of Java Code and making it accessible, e.g. via a REST API:

    @POST
    public void startJobExecutor() {
      ((ProcessEngineConfigurationImpl) processEngine
        .getProcessEngineConfiguration())
        .getJobExecutor()
        .start();
    }

A similar piece of code can be implemented to allow to stop the Job Executor.

Suspending Specific Service Calls

When you want to avoid certain services to be called because they are down or faulty, you can suspend the corresponding Job Definitions, either using Cockpit or using an API ( Java or REST).

By using the API you can even automate suspension, e.g. by monitoring and recognizing when a target system goes down. By using naming conventions and accordingly customized job definition queries you can then find all job definitions for that target system (e.g. "SAP") and suspend them until the target system goes up again.

Suspending Whole Processes

Sometimes, you may want an emergency stop for a specific process instance or all process instances of a specific process definition, because something behaves strange. Suspend it using Cockpit or using an API ( Java or REST) until you have clarified what’s going on.

Backing up Camunda

  1. Camunda stores all state information in its database. Therefore backup your database by means of your database vendors tools or your favorite tools.

  2. The Camunda container installation as well as the process application deployments are fully static from point of view of Camunda. Instead of backing up this data, we rather recommend doing a script based, automated installation of containers as well as process applications in order to recover easily in case anything goes wrong.

Updating Camunda

For updating Camunda to a new version, please follow the guide for patch level updates or one of the dedicated minor version update guides provided for each minor version release.

A rolling upgrade feature has been introduced in version 7.6. This allows users to update Camunda without having to stop the system. Outdated engine versions are be able to continue to access an already updated database, allowing updates to clustered application servers one by one, without any downtime.

Preparation

  1. Before touching the servers, all unit tests should be executed with the desired Camunda version.

  2. Check running processes in Cockpit

    1. Handle open incidents

    2. Cancel undesired process instances if any

  3. Make a backup (see above)

Rollout

  1. Shutdown all application server(s) (unless performing a rolling update in which only one cluster node is taken down at a time after the database has been updated)

  2. Update database using SQL scripts provided in the distro (all distros contain the same scripts)

    • Ensure you also execute all patch level scripts

    • I’m not sure if update scripts are idempotent and one can simply run all of them.

    • To check which version is in the db, check for missing tables, indexes or columns from the update scripts

SELECT TABLE_NAME, INDEX_NAME FROM SYS.USER_INDEXES WHERE INDEX_NAME like 'ACT_IDX_%' ORDER BY TABLE_NAME, INDEX_NAME;
SELECT TABLE_NAME FROM SYS.USER_TABLES WHERE TABLE_NAME LIKE 'ACT_%' ORDER BY TABLE_NAME;
  1. Update applications and application server(s) or container(s)

  2. Start application server(s) or container(s)

  3. Check logfile for exceptions

  4. Check Cockpit for incidents

  5. Test application using UI or API

  6. Repeat in all stages

Migrating from Activiti to Camunda

For migrating from Activiti 5.x to Camunda have a look at the user guide for Activiti Migration and at the blog post: How to migrate from Activiti 5.21 to Camunda 7.5.

  • Activiti Parser does not perform XML schema validation and allows attributes without namespaces ⇒ You should perform a Schema validation on all deployed models (including history versions that should be displayed in Cockpit)

  • Activiti designer stores Sub-Processes as collapsed although they should be expanded

  • The following Activiti Extensions are not supported by Camunda

    • <scriptTask activiti:autoStoreVariables="true"

    • <signal activiti:scope="global"

    • <signal activiti:scope="processInstance"

    • <sequenceFlow activiti:skipExpression

    • <sendTask activiti:type="mail" (might still work, but is certainly deprecated)

  • In order to edit the models in Camunda Modeler, you have to switch the namespace. However, the engine will continue to support both namespaces

No guarantee - The statements made in this publication are recommendations based on the practical experience of the authors. They are not part of Camunda’s official product documentation. Camunda cannot accept any responsibility for the accuracy or timeliness of the statements made. If examples of source code are shown, a total absence of errors in the provided source code cannot be guaranteed. Liability for any damage resulting from the application of the recommendations presented here, is excluded.

Copyright © Camunda Services GmbH - All rights reserved. The disclosure of the information presented here is only permitted with written consent of Camunda Services GmbH.