Operations in distributed apps, part 2: logging & tracing

In our previous post, we gave an overview of our Ops and the challenges of acquiring information for day to day support of the different teams beginning with metrics.

Metrics serve as a first check of health at scale. Alone however, they're merely a symptomatic indicator without enabling teams to dive into why issues happen.

Enter logging & tracing to help us suss out the reasons to our issues.

enter image description here

(Image courtesy of https://peter.bourgon.org/)

Our challenge with logs

We have plenty of logs (we produce & destroy GBs of them on a monthly basis) across our different systems

Immediately given our stakeholder commitments, its is crucial that our logs are:

Centralised
Searchable
Relatable

enter image description here

But given our systems are distributed, it’s important to collect these logs in a centralised manner such that it can be easily searchable and traceable to help us make sense of what is happening within and between systems.

Without this, it’s gonna be a very long game of catching up across systems. From an individual dev's POV - there'll be difficulties trying to understand what’s going on in each system before we can even get to the problem

General idea: Gather information through centralised logging, link them up as best we can to help us dig in when we need to find the root cause of what went wrong, where it went wrong & what our customers are experiencing.

Correlation is vital

After our step 1 with logging centralisation comes correlation. Here's a simplified abstraction of logging correlation where a request_id is used to connect the dots between each application.

This allows us to inspect the user’s request flow through various systems by filtering out the noise from the millions of records that we have, and only nailing down what’s interesting

enter image description here

With logging correlation, we can also identify if the issue stems from which system and plays an essential part to help us (and even other departments in the organisation) understand the cause-effect when an abnormal event appears.

Logging successful actions could help us understand the user’s journey in our system, and kind of act like an audit trail to verify that certain requests have succeeded.

As you can imagine, we can quite easily correlate traces beyond just request_id; The more contextual tags we include, the wider the variety of contextualised data views we can expose to our different teams.

A peek at how this could look like

Across our apps, you can identify the user journey of a particular customer from start to finish, what apps were involved in the requests and even narrow these to the classes or APIs involved in our systems' actions

enter image description here

Tags are important additional information that we have included in the logs to aid with on the fly correlations in day-to-day Ops.

For example, our Customer Service team could click on for example, the customer tag, to view more information about the customer such as the items that they have purchased recently.

So if the customer calls in to raise an issue that they have purchased an item but is not reflected in their account, our customer service team is able to traverse from the customer's journey to our other data views to verify whether the customer has purchased the item successfully and rectify the issue quickly.

This is how we can make logging even useful not just for us as developers to debug but also other stakeholders to rectify issues involving our customers.

Going back to the 3 pillars of observability mentioned previously, what we've just shown is an example of tracing where you would make sense out of the large amount of logs.

Hi-Cardinality Events

Structured logging and tracing allows us to perform a lot of actions like the day-to-day customer servicing, ops and developer investigations

Customer tracing is one example of high-cardinality event that we did today where we can dive into the logs for a given customer for example, using tags.

Some other examples include provider tracing, location tracing, agent tracking. E.g. Helping you get notified when you sign into a new device or from a new location in our apps.

enter image description here

In metrics, we can’t really have high cardinality data because if we were to include user_id in our metrics, and we have millions of users, what you see in the metric chart won’t be a few lines like the database size chart you’re seeing here, but millions of lines which is something that we can’t make sense of.

But in logs, we should do that as much as possible to provide more information in order to help us deep dive into certain customer.

Final Notes

We're still improving the visibility and resolutions offered across our systems. And our teams are already reaping the benefits of this

Can we allow CS to respond in a timely manner?
Can we optimise escalations to Ops or Developers to deep-dive into the behaviours of our apps?

Tooling is cheap to instrument out-of-box but we should think about well-defined formats across apps.

To make it easy to explore, we should finesse the agreed data points across teams and stakeholders.

Observability that is fast & accessible requires continuous investments into it by our different teams for our different stakeholders.

This helps us to go beyond instinctual decision making but provide us the metrics and data visibility to justify decision making.

If you're interested to find out more about what we do, check us out at https://www.ascendaloyalty.com/