How OpenTelemetry Improved Its Code Integrity for Arm64 by Working With Ampere

Snapshot

Challenge

Software developers and IT managers need instrumentation and metrics to measure software behavior. When developers and DevOps professionals assume that software will run on a single hardware architecture, they may be overlooking architecture-specific behavior. Arm64-based servers, including the Ampere Altra family of processors, offer performance improvements and energy savings over x86, but the underlying architecture is Arm64, which behaves differently to the x86 architecture at a very low level.

At the time, mid-2023, OpenTelemetry did not formally support Arm64 deployments. As the popularity of Arm64 instances increased because of their competitive price-performance, monitoring those systems was critical for observability vendors.

Solution

To help rectify that situation, Ampere Computing donated Ampere Altra-powered servers to the OpenTelem- etry team. With these processors, the team could begin retrofitting their telemetry instrumentation for Arm64, and adapting their Node.js, Java, and Python code for the Arm64 architecture.

“Ampere gave us a leg up to understand how to best instrument the code, and run it in that setup,” remarked Antoine Toulmé, who maintains the OpenTelemetry Collector project while serving as a senior engineering manager at Splunk. “It was an interesting experience because it’s really powerful hardware.”

For the OpenTelemetry team to bring their CI/CD support for Arm64 up to parity with x86, they used Actuated. Actuated enabled the OpenTelemetry team to stage a self-hosted GitHub Actions environment within which they could build pipelines that tested code in both architectures for the same conditions.

This way, the project could run their complete test suite for all architectures, without forcing the project’s developers to select different tests for each architecture. As a result, the project’s support for Arm64 is approaching parity with x86.

Results

OpenTelemetry has now given both Arm64 and x86 developers and IT managers the instrumentation and metrics they need. As a result, customers running OpenTelemetry in production are experiencing more reliable, more stable code.

This is true not only for all processor architectures, but all operating systems: Identifying and fixing bugs like race conditions, which can be easier to trigger on Arm64, has the benefit of making the project better for every architecture and operating system. OpenTelemetry’s Toulmé says his team has seen 15 percent cost savings just from reducing the quantity, size, scale, and memory allocation of deployment instances, after moving from x86 to Arm64.

Developer Story

One class of software whose performance characteristics are most likely to differ across processor architectures is the observability platform. Here’s how OpenTelemetry made observability better for everyone by making its integration testing for Arm more robust.

Up until a few short years ago, software developers and IT operators disagreed about which aspects of an application needed to be measured most. It wasn’t called “observability” back then, but rather “application performance management (APM), which was used interchangeably with “business performance monitoring” (BPM).

Developers wanted detailed traces and logs of transactions and activity in memory. Operators wanted a stopwatch to be triggered when some process appeared to begin and appeared to end, and to measure the shortness of the interval between the two events.

OpenTelemetry (OTel) has given both groups the instrumentation and metrics they need, or at the very least, the tools with which to devise those metrics. It provides a front-end which can be used with modern observability and instrumentation systems that have replaced the APM systems of old, including from long-time vendors such as Dynatrace and New Relic, but also new service providers such as Honeycomb, Splunk, and Datadog, and the open source Prometheus monitoring system. OpenTelemetry has become the second-largest project of the Cloud Native Computing Foundation (CNCF) by number of contributors, after Kubernetes.

For OpenTelemetry’s instrumentation to be robust and reliable, CNCF developers must test it on all server platforms capable of running it. Arm64-based servers, including the Ampere Altra family of processors, offer performance improvements and energy savings. But the underlying architecture of these processors is Arm64, which behaves differently to the x86 (AMD64) architecture at a very low level. Testing OpenTelemetry for Arm64 has the additional benefit of revealing potential problems which had not shown up in the project’s test suites when tested only on x86.

Balancing the scales

In mid-2023, CNCF contributing developers were facing increasing pressure from users to support the monitoring of Arm64-based servers. As the popularity of Arm64 instances increased because of their competitive price-performance, monitoring those systems was critical for observability vendors. As OpenTelemetry provides a common interface for Kubernetes application developers, there was community pressure to add support to OpenTelemetry for Arm64 processors with up to 128 cores, such as Ampere Altra.

At that time, OpenTelemetry did not formally support Arm64 deployments. To help rectify that situation, Ampere donated Ampere Altra-powered servers to the OpenTelemetry team. With these processors, the team could begin retrofitting their telemetry instrumentation for Arm64, and adapting their Node.js, Java, and Python code for the Arm64 architecture.

“Ampere gave us a leg up to understand how to best instrument the code, and run it in that setup,” remarked Antoine Toulmé, who maintains the OpenTelemetry Collector project while serving as a senior engineering manager at Splunk. “It was an interesting experience because it’s really powerful hardware.”

Toulmé noted that his team had little trouble adopting Arm architecture and ecosystem from the standpoint of code development. Testing presented the biggest challenges, specifically when integrating code with third-party frameworks, applications, and libraries.

“We would see, for example, Docker images that claimed they were Arm-compliant,” Toulmé continued, “and when you run them in a CI/CD environment and you actually mean to run them on an Arm server, you realize they just repackaged amd64 code, and they just made it run as if it were Arm. That was a bit of a letdown.”

*Photo: Antoine Toulmé presenting Open Telemetry on a System76 Thelio Astra, powered by a 128 core Ampere Altra Max, at KubeCon EU 2025 (Credits: Dave Neary)*

When developers and DevOps professionals assume that software will run on a single hardware architecture, they may be overlooking architecture-specific behavior. They may also miss issues with the code that do not show up frequently on that architecture.

As a result, they may not find certain simple anomalies such as race conditions, because the hardware is behaving in a way that conceals potential issues when two or more processes attempt to access the same resource asynchronously.

OpenTelemetry’s replacement for the APM agents that used to gather in the back of memory like lint on a brush, is the Collector component. Written in Golang, Collector is an agent that serves as a destination point for instrumentation libraries to export their telemetry data.

When Collector was first compiled for Arm64, recalls Toulmé, several race condition issues were discovered, because of the different way that x86 and Arm64 processor pipelines are handled, and the number of cores available on the CPU. It was the OTel team’s first indicator that Arm architecture handles race conditions in a very different way.

“We had some early feedback from customers that some of the OpenTelemetry instrumentations were not working well on Arm because there were so many cores. You go from four cores to 128, 256 sometimes.”

The project maintainers tested and resolved these issues using Ampere’s servers for all of their Node.js, Java, and Python code. “In the last two years,” said Toulmé, “we’ve seen a huge improvement in support for Arm.”

The microVM solution

For the OpenTelemetry team to bring their CI/CD support for Arm64 up to parity with x86, they collaborated with Actuated principal developer Alex Ellis. Actuated is a platform that provides hosted runners for one of the most common CI/CD systems, GitHub Actions, using one’s choice of processor architectures. This makes it easier to build and test projects in heterogenous server environments. Actuated accomplishes this by running processes within microVMs that are isolated from other workloads running on the same host.

“We’ve seen this from customers who’ve tried GitHub’s Kubernetes operator,” noted Ellis, who is also the creator of serverless microservices framework OpenFaaS. “It’s okay until the point you build or run a container, and then you need the privileges elevated so high that you can compromise every node in the entire cluster. And many people just put their head in the sand about it.”

“That’s what Actuated is about,” Ellis continued. “Instead, microVMs are used which have their own Docker instances, that are completely isolated and only exist for the lifetime of the build — then they’re completely destroyed. There is some overhead with using a microVM, but mainly, CI is more about CPU speed and having enough RAM to fit your programs in, than raw I/O.”

Staging all application code components inside virtualized packages separates them from broader networks, especially the public Internet, with at least one layer of abstraction. This results in a safer running environment for software components for all processor architectures, including x86 and Arm64.

Payoff

Now the OpenTelemetry team can spot behavioral issues that were being missed by tests on x86. As a result, customers running OpenTelemetry in production are experiencing more reliable, more stable code. This is true not only for all processor architectures, but all operating systems: Identifying and fixing bugs like race conditions, which can be easier to trigger on Arm64, has the benefit of making the project better for every architecture and operating system.

OpenTelemetry’s Toulmé says his team has seen 15 percent cost savings just from reducing the quantity, size, scale, and memory allocation of deployment instances, after moving from x86 to Arm64. Now, the team can work toward a situation where they can respond to Arm64-based customer issues with the same care and attention they pay to x86-based customer issues. That’s OpenTelemetry’s goal: tier-1 support by the end of 2025.

“We’re very happy about the results,” said Toulmé. “We see the performance on Arm is much higher than what we would get with lega- cy x86 servers. For our customers, we have published Docker images that support both Linux/AMD64, but also all the Arm64 variants. We are seeing a great uptake in terms of Arm64 downloads. We see a cost reduction of fifteen percent across the board. I can say, without a doubt, I’m a convert.”