Risk management for live partner projects

Find out how our team managed risks in an existing live partner project using various tools and processes.

Building a product from scratch is hard work. But building things on top of an existing live product is even harder. How do you make sure you can deliver additional features without breaking existing functionalities, and only enabled exactly on the planned launch date few months into the future, while making sure the development isn’t dragged till a week before launch?

Luckily for us in the Loyalty Data (LD) team, things have been made much easier for engineers. With a few tools and processes in place, we have been able to build new features for one of our client banks, without letting all these changes sit in our Staging environment for a few months, and only enable them in Production on the planned launch date, without fear of such features breaking any Production functionality.

This post shares our learnings and experience working with a new project that introduces new processes into existing workflow, and discusses a few strategies for risk management.

Feature Flags

In LD team, we heavily rely on feature flags to toggle certain features in any environment. For example, we were supposed to deliver a new feature that will immediately credit all pending transactions if a particular account is now ineligible for further earn due to its assigned earn rule having been prematurely removed. We can’t deploy this feature to Production without it being tested and signed off by the Quality Assurance (QA) team. On the other hand, we do not wish the code changes due to this feature sit in an open Pull Request (PR) for months, as it can quickly make PRs unmanageable.

Repository with 700+ PRs spanning 32 pages (1)

Repository with 700+ PRs spanning 32 pages

What we do instead, is to make use of a feature flag to indicate whether such a feature should be allowed to run in a particular environment. It will be disabled in Production, and enabled in all other environments, so that we can take our time to test the feature in Staging. And because it’s disabled in Production, we can safely deploy the code changes without waiting for QA regression testing.

Configurable Batch Processing

On top of feature flags, we have another mechanism that allows us to configure only to run certain steps in a batch job. For some background, the Loyalty Engine (LE) application operates mainly on a scheduled batch job basis, where the job runs at a certain time of the day, processes any incoming files from the client banks, and delivers handbacks in return. This batch job consists of multiple smaller and contained steps that each executes a part of the business logic.

Batch job template for ANZ Production

Batch job template for ANZ Production

Going back to the example above, other than using a feature flag to disable immediate points crediting, we also make sure this step is not configured in the Production batch job. This acts as an additional layer of protection against us triggering this logic unintentionally before it is meant to be live.

Single Unit of Operation

As described in the previous point, we rely on composing multiple small and contained steps in a batch job to complete the processing for each client bank. This allows us to mix-and-match the steps according to the needs of various client banks.

In the case of immediate points crediting, we already have existing steps in place for querying transactions that are due for points crediting and execute the crediting immediately. What we need to do to support immediate crediting due to ineligible earn rule can be contained in a few additional steps:

Determine which account’s earn rule is no longer applicable and update the account’s earn rule flag to mark it as inapplicable.
Query points transactions belonging to accounts whose earn rule flag is inapplicable and set the points crediting date of such transactions to current date.

That’s all that we need to implement the immediate crediting feature. We just plug these additional 2 steps into the batch job, right before the points crediting step, and everything will work seamlessly.

Batch job template including new steps

Batch job template including new steps

Integration Tests

If you have read so far, you might have a lingering question: how do we know if the features deployed actually work if we haven’t received the blessings from the QA team? The answer to that is: automated integration tests.

In LD team, we have the habit of writing integration tests whenever a substantial feature is developed. As we have a weekly release cycle, it is important that each release will not break any existing functionality. These automated integration tests are what give us the peace of mind that the existing processing flow still works, and we just have to focus on manually testing the new features or fixes.

Using the same example above again, to ensure that the immediate crediting logic works, we write integration tests that simulate the batch jobs that would normally happen in Production. The assertions ensure that when an earn rule becomes inapplicable, we immediately credit all transactions for the account.

Automated test setup in RSpec

Automated test setup in RSpec

Automated test assertion

Automated test assertion in YAML

These integration tests also help us plan the QA roadmap better, because our QA engineers do not need to split among multiple concurrent projects, and can focus on doing an end-to-end test for this particular client bank launch during the UAT period as determined by the bank.

Sneak peek: The LD team is embarking on a new project to allow QA engineers to write their test plans using similar format as our RSpec tests, so that the test runner can execute their scenarios automatically, thus saving manual QA effort for each project. Stay tuned!

Conclusion

With the tools and processes as described in place, the project team has managed to work on the 4-month long project that introduces many additional features to our existing LD suite, without the nightmare of PRs sitting stale because they can’t get deployed to Production, and Project Managers are happy because they can see the “DONE” status in the JIRA (project tracking software) tickets.

More importantly, all the code changes that have been deployed to Production have so far not yet broken any existing functionality, thanks to the feature flags, configurable batch job using self-contained steps and automated integration tests.

Disclaimer: Views and opinions expressed in this post are the author’s own and do not necessarily represent the company’s.