Operations in distributed apps, part 1: metrics

Observability is a topic that appears in multiple disciplines, typically from the engineering, IT Ops or DevOps point of view. Your organisation might also have an SRE team whose engaged with you on this.

We'd like to explore how some of our learnings whilst employing a few of these pillars in a distributed environment (metrics, tracing, logging), beginning with metrics today.

The developer's bread & butter

Often as devs, our focus is on delivery. A common cycle could be described as:

Build & ship.
Move on to the next task.
Fix any issues (then we rinse and repeat).

We focus on our code reviews, making sure whether the touch-points of our architecture makes sense, ensuring that we work with QAs to test this thoroughly. (Processes & practices here are pretty much industry standard for most organisations.)

Yet reaching Production is only Step 1 in the life cycle of our systems' journey. For our teams, ongoing servicing of customers & fixing of issues is crucial, so the next step is to figure out what we can do to prime ourselves for sustaining our systems' operations.

System maturity

When teams & systems start for the first time, logging is the first thing we rely on in a largely monolithic environment. It's usually sufficient.

Corporations are mature and typically have dedicated functions & departments with well-established processes for engineers to jump into as-is. Some aspects include:

Raising the incident.
Ownership & first responder for the incident;
Playbooks for different incident types.

Phases of systems growth

As you can tell, as we distribute systems, we're also distributing the number of places where things can go wrong. For us, as we grow, we need to start thinking about our services and how to constitute that process of monitoring & observability in our systems.

Our stakeholders

Our features service everyone in the ecosystem. What you see here is a sample set of our given stakeholders. Each party here has a different use, expectations & skillsets as participants of operations. For example:

Customer service: Our 1st line of support. Timely data and good visibility is key.
Operations: Triaging complicated issues. Expect to have access to aggregated metrics. Help track anomalous situations in our systems.

and so on ...

enter image description here

The 3 pillars (metrics today)

Metrics are the numbers that help us summarise behaviour & performance over time. Giving us the insight to help establish the benchmarks of "normal" operations. We sample, summarise, correlate & aggregate them in different ways during reviews to yield patterns of play in our systems. However, they won't give us the "zoomed in" insights of an event log. We won't know why the issue happened.

If your goal (as 1 of the above stakeholders) is just blackbox monitoring then this is probably enough (knowing just the symptoms).

enter image description here

(Image courtesy of https://peter.bourgon.org/)

For the full picture of us using the 3 pillars, check out our upcoming posts (for now we go deeper into our use of metrics).

A look at some of our metrics

We use Datadog as one of our key tools here. As you can see, these dashboards tell us:

What's going on now?
Allowing us to be alerted when something goes off-tangent.

enter image description here

For example: an alert has been triggered, notifying you that the number of incoming connections is higher than a specified threshold. Does that mean users are having a poor experience in production? That’s unclear. But this first lens can tell us some very important things we may need to know.

At any point in time, is each measure performing within the thresholds we care about?
Can we group them in different ways to yield the performance issue to focus it on a specific "grouping" data point?

Metrics become even more important when we can slice down our time series and use a different one for each machine, or each endpoint.

Types of metrics

There are 3 kinds of metrics you can expect to get with most tools out-of-the-box.

enter image description here

System metrics: This is an industry standard. E.g. node level info like container utilisation rates, throughput, etc. Sysadmins have gotten this down to where it’s standard practice to access these info for any IaaS/ PaaS service you use.

Application metrics: Similarly, this is a standard out there we can get with most APM tools. These tell us:

Where are we slow?
When is this happening?

This scope includes anything you app does or interfaces with when executing your app’s logic. (E.g. query performance, API latencies and response success rates.)

Business logic: This is customer-level info - looking into the data points that help us figure out the customer/process impact of the different systems' actions. This takes the most targeted effort for building visibility in, and also requires a good understanding of the product & collaboration with the ops or business stakeholders.

Making sense of our metrics across apps

When it comes to collecting metrics, it's important we follow the event throughout it's lifetime. Say a hypothetical situation like adding an item to a shopping cart:

enter image description here

To accurately trace this sequence, we should already standardise the naming sequence of our metrics:

What is happening? (cart.add_item).
For what entities is this happening? Or representing? (e.g. customer) - this is our data dimensions we can tag our metrices for.
When is this happening? (i.e. add_item.initiate, add_item.check_stock, add_item.finalised).

When we first add metrics, it's easy to just track the API request and response but its usually too broad to help us dice into the key checkpoints of processes. Hence in a distributed environment, we'd want to ensure our metric conventions here are opinionated enough to help monitor interactions across apps too.

To get into the white-box part of ops servicing (why the system is behaving in the way the metric indicates) we'll need to arm ourselves with logging and tracing.

Final notes on metrics

When it comes to sustaining ops with metrics we find ourselves regularly checking in on these aspects:

Not all metrics should have alerts. We aim to focus these mostly on customer-impact specific symptoms.
Watch your cardinality. As with most tools, its generally more expensive (since its compute intensive) to have very high cardinality data points. Its costly to drill down to very granular slices of your data (its not impossible, we just have to be prepared to fund it if we're going to tag our metrics and query by say user_ids).
Real time querying means tradeoffs with retention costs. Metrics are meant to aid with recent/near real time monitoring. Also data evolves in systems over time, so we shouldn't expect very consistent datapoints for long-term trend analysis.

We're continuously fine tuning our efforts to improve systems monitoring to facilitate our daily sustainment of activities for our different teams & stakeholders.

If you're interested to find out more about what we do, check us out at https://www.ascendaloyalty.com/