Cloud Week 2021: 5 key (and emerging) trends in observability

In a world of highly abstracted, typically virtualised, often ephemeral and always dynamic cloud computing resources, the need to achieve continuous observability is key. However, the cloud was not created with observability of internal systems in mind; it was initially sold as a key route to IT agility through resource flexibility and cost manageability.

Now that cloud is here and adoption is growing, we need to stand back and assess our observability capability. In addition, as cloud-native implementations now span public, private, hybrid, multi-cloud (multiple vendors) instances, we can start to think about poly cloud, where different parts of an application and data service workloads are separated out over various Cloud Service Providers (CSPs).

APM is everywhere

Many ask what’s the difference between cloud observability and APM (Application Performance Monitoring). We used to ‘just simply’ have virtual machines, which meant that blocks or instances of compute could be comparatively easily exposed to observability.

As APM has become almost synonymous with observability, we now see it extend to every tier and structure throughout the IT stack. We need APM for applications, obviously, but we also need infrastructure APM (iAPM, if you will) and it needs to be capable of being directed at any of the stars in the virtualised galaxy we now exist in.

CHECK IT OUT: ITP.net Cloud Week: Spotlight on the burgeoning opportunities

A federated centralised orchestrated view

In a world where we have multiple different cloud providers and many different cloud instances from different CSPs, we need an orchestrated federated level of observability with a centralised view and ability to filter and aggregate across multiple clouds in multiple clusters, if we want to be able to stay in control.

Federating observability data to a centralised place is a common technique and process these days. This has been proven to be the best way to look for cloud overloads, bad provisioning and ‘zombie’ cloud wastage where instances are left idle. When we bring all of these signals together, we can drive more efficient cloud resources to service our Content Delivery Networks (for example) and work at a smarter level all-around.

Connected correlation inside the firehose

The amount of data we are consuming and producing right now enables us to get many more signals to track our observability requirements. If we think about the fact that the Internet of Things (IoT) is exponentially increasing our data points, we are drinking from a firehose in terms of data flow… and that can make observability far more difficult.

To address this challenge, we need to think about connected correlation. Because there is so much out there to observe, connected correlation helps provide vital links between the data sources that are actually mission-critical to the IT function’s operation.

Bartłomiej Płotka, principal software engineer, Red Hat

Continuous profiling

Our observability goals see us continually looking for optimisations that will increase performance efficiency. This means we will need to look for, track and analyse different observability signals. One of the best ways to do this is by profiling. This technique enables us to know what part of the application is using how much compute resources (CPU time, memory, disk or network IO) without having to guess it when looking at total resource usage for our process.

Continuous profiling enables us to look at the applications and see past performance characteristics during interesting cases. It’s especially useful if it is about to run out of memory and perhaps crash the whole node. If we can look at application profiles every 60 seconds (or perhaps even more regularly), then we can see where a function in the application source code might need optimisation or augmentation.

A hive of activity with eBPF

Lastly then to eBPF, or extended Berkeley Packet Filter to use its full name. This is a mechanism that allows us to execute additional code in the Linux kernel. When we can look at specific functions inside the kernel using this ‘special spy agency’ technique, then we can gain new controls over observability. As an additional benefit, we can also note that eBPF does not require app- level instrumentation to start capturing metrics.

Even though it was initially designed for security, it can now be used more proactively for exposing the metrics of the application. We used to think of using a service mesh as a way to put proxies around an application, but service mesh can be replaced with eBPF, which has much lower overhead and more capabilities.

A ‘canary deployment’ might still require a service mesh and we should note that there are still non-observability use cases for service meshes like those in canary deployments (where tight control of traffic occurs) and authorisation (by mutual TLS). There is currently no attempt of eBPF to adjust traffic on such level, currently eBPF use cases are security and observability only.

If we can consider some (ideally all) of these factors and functionalities in our quest to achieve observability in modern IT stacks, then we just might be able to pop our head above the clouds and see what’s coming next.

Make sure to keep checking ITP.net for the latest cloud news and stories.

APM is everywhere

A federated centralised orchestrated view

Connected correlation inside the firehose

Continuous profiling

A hive of activity with eBPF

Related