In October 2021, we migrated our 10-year-old artifact storage to a new cloud-based setup. To prepare our users for the change, we shared our reasoning behind it and what to do as a result.
The migration went quite smoothly from a technical perspective. The process was completed more quickly than anticipated, and the first tests with our Continuous Integration (CI) showed that everything was working correctly.
Despite things going well overall, we did encounter some internal and external issues along the way. In this blog post, you will learn about the challenges our team members and users faced, and our plans to offer more transparency and collaboration with troubleshooting incidents in the future.
TL;DR – Migrating our 10-year-old artifact storage came with some lessons learned, especially when we had almost zero insight into how our users consume artifacts. Therefore, we’d like to invite you to participate in a short survey for us to learn more about your setup. In addition, we’ve launched a status page for us to report on incidents and keep you up-to-date on the current status and availability of our artifact storage.
Getting to know our users
One can look at our team’s service from two different perspectives as we serve both internal and external users with our CI platform.
The internal users are our colleagues and the Camunda products they work on. We’ve partially built their release processes with them or maintained their products at some point. We understand how our teams use the artifact service internally. Additionally, we can quickly roll out changes and introduce new features that make our lives easier, as we can directly communicate with our teams and iterate with them.
On the other hand, we have very little insight into how those artifacts are being consumed by our external users. Our communication with customers is scattered over multiple internal teams, and the public artifact configuration (Maven) hasn’t changed since its introduction.
We’ve learned from support requests that some users replicate our artifact storage in everything they have access to, others only cache the artifacts they use, and others consume without caching. In addition, we also don’t have insight into the different tools our external users are employing, such as Maven, Gradle, Curl, etc.
After the migration
Looking at the outcome after the migration, we had very few internal problems except introducing a new authentication method. We introduced Okta as a new authentication method for internal users and deprecated Lightweight Directory Access Protocol (LDAP) as part of Camunda’s internal security effort. While this was an initial inconvenience that didn’t work as planned, it had no direct impact on our teams as they could continue using LDAP for the time being.
Over three months, we had few additional issues except for one of our teams having problems with the redirect proxy and their outdated toolchain. If this applies to you as well, consider upgrading to the latest versions of JDK and Maven to comply with the latest security standards.
In December, our team encountered another issue. To avoid network errors, we switched our internal Nexus CI cache to directly resolve artifacts from Artifactory, as it previously used to do so indirectly through our URL rewriting proxy. The switch resulted in Nexus no longer being able to resolve any artifacts.
It happened at the worst time one could imagine, simultaneously with the Log4J incident, which meant our teams needed a working artifact storage to test and push artifacts quickly. We resolved this issue by recreating the proxy repository in Nexus.
While debugging the problem, we noticed that Nexus internally saved a flag that the remote artifact storage was a Nexus instance, which can cause Nexus to send additional headers for synergies. This resulted in S3 and Nexus having a header mismatch that made Nexus unable to retrieve any files.
The solution was to recreate the proxy repository in Nexus as it will remove all hidden flags, but be aware that it will also remove all cached artifacts.
From the external perspective, we had two straightforward and imminent issues:
1. Many users had problems resolving artifacts from the camunda-bpm-ee repository, which caused some initial confusion. To simply put it, some had the option enabled to always send their authentication, which later turned out to be a configuration issue on our end. We then quickly communicated a workaround using a virtual repository, which worked well.
Here is where we have some technical debt. As explained in our previous blog post, the artifact storage has gone through a lot of hands over the years, but how external users accessed enterprise artifacts was never adjusted. Internally, we switched our teams to use virtual repositories a long time ago, while the direct usage approach was still being offered to our users. Even with the introduction of Optimize enterprise artifacts, our users were still given direct access instead of being switched to a virtual repository.
2. We didn’t anticipate that some users were directly retrieving artifacts with Wget and Curl. Artifactory uses AWS S3 as a storage backend, meaning artifacts reside in an S3 bucket, while all metadata files live on the Artifactory instance. In particular, this means that the direct retrieval of metadata files was not an issue. On the other hand, when trying to retrieve artifacts directly, a user would be redirected to S3 by a temporary authorized link that grants access to the requested file. While Wget should follow the redirects itself, Curl required the additional usage of the –location option to follow the redirect; otherwise, a user would receive an empty file.
From a technical perspective, the migration went very well with all artifacts being available as read-only for the duration of it. Our support team did a great job at quickly responding to user requests and filtering out new issues for the infrastructure team to resolve. We provided workarounds relatively fast, but it was apparent that communication was pull-based as most instances were users directing questions to our support team and Camunda forum. In contrast, reaching users always required an engineer to submit a message to them through our support team and, even then, only Enterprise Edition users would see it.
Therefore, we’ve launched a status page for our public services to provide transparent reporting and reach a broader range of our users. You can subscribe to this page for updates on incidents and the current status and availability of our artifact storage.
In addition, we invite you to participate in our survey to share how you are using our artifacts, so we can better support you.