Most applications today involve some type of background processing (such as sending notification to a customer) in the form of jobs. Sidekiq is one of the popular background job processor out in the community and we use it extensively at Ascenda.
Jobs vary by many factors; A job could be compute intensive, or it could be very hi-throughput etc. 1 of the common issues faced by applications early in their business’ growth is the ability to scale their jobs processing in tandem with their customer utilization rates. We faced similar challenges recently in preparing our application to meet big spikes in customer activity for some of our largest customers.
In this post, we share some of our learnings in optimising our applications’ ability to process jobs and to scale in line with our application containers deployed.
Default Sidekiq behavior
We begin with the most basic queuing setup - Sidekiq uses a single queue named “default” in Redis and processes jobs using First In First Out (FIFO) model.
What's the downside of default behavior?
For example, "Job A" processes payments and "Job B" sends customer marketing emails. "Job A" is more important than "Job B" as a time-sensitive action.
Say during a marketing campaign, there’re plenty of "Job B"s (sends marketing emails) in the queue ahead "Job A" (processes payment). "Job A" which is more critical will have to wait in the queue and causes delay.
From a queuing standpoint, the ability to process Job A quickly and in a timely manner could be even worse if "Job B" takes a long time to service.
How can we solve this?
Scale the infrastructure
The initial (naive or expense rich) option is to scale vertically by growing our CPU and concurrency or horizontally by adding more containers to process more jobs.
Unfortunately, this only kicks the problem down the road by increasing our throughput at any moment in time. We aren’t solving for the problem of prioritising jobs;
Once you have another bigger spike of “Job B”s hitting the system, we will end up facing the same problem of “Job A” being blocked out by “Job B”. Do we scale the infrastructure again then?
Since we know “Job A” is more important, we can do better by looking at how can we focus processing critical jobs first.
Setting priorities will allow our applications to utilise the resources to jobs that are of highest importance and focus on lower priority tasks later.
We can prioritise jobs by defining queues and assign weights to them. We’ll shortly look at the different facets we consider when prioritising said jobs.
A sample config for multiple queues
See https://github.com/mperham/sidekiq/wiki/Advanced-Options#queues for detailed options for configuring multiple queues. configuration file:
:queues: - critical - default
The configuration above is strict ordering, which means critical queue will be processed first and default queue will be processed when the critical queue is empty.
If we configure “Job A” (perform payment) in critical queue and “Job B” (sends marketing emails) in default queue. “Job A” will always take precedence over “Job B”.
In other words, “Job B” will be ignored until critical queue is empty (i.e. all of “Job A” is completed)
This configuration is good as long as there are no default queue jobs with very long service time, because that would prevent critical jobs being picked up.
Whilst this ensures that critical jobs (”Job A”) are handled first, default jobs (”Job B”) could be unreasonably delayed if there are a lot of jobs (”Job A”) in critical queue.
Could we have a better solution where both types of jobs can be handled in a more balanced manner? Enter queue weights.
Sidekiq allows us to define priority (optional weight) to a queue. configuration file:
:queues: - [critical, 2] - [default, 1]
To understand the weighting configuration above: Add up all weighting factors (2+1=3) and then divide the weight factor of an individual queue to get the %.
That’s the % chance that the next job pulled by a process will be from that queue.
So, 66.6% chance the next job will be from critical queue and a 33.3% chance it’s from the default queue.
Weighting is primarily useful for preventing one job type from “blocking out” another.
By applying weightage, we balance the types of jobs being run and focus our existing container resources on critical jobs without compromising our systems’ latency significantly when processing day-to-day jobs.
Essentially, more heavily weighted queues or jobs are proportionally more likely to be processed and the rest will not be ignored entirely too.
Evaluating what to prioritise
We have 3 major types of jobs: - Long running & relatively compute intensive operations - Hi-volume but generally quick to process calls - Auxiliary jobs
These are definitely the kind of processes that’re time sensitive; In Ascenda, this typically includes reporting based processes with our partners which involves files or bulk data processing
These are also relatively more compute intensive individually to other jobs. In combination with the need to process in a timely manner, we then give these kind of processes the highest priority.
Another example would be state changes to our site settings which are time-restricted that are also given such priority.
Most of our business centric operations fall in this group. Our 3rd party API actions tend to have very high throughput in line with customer traffic usage (such as processing their redemption transactions).
Given that such actions are also customer facing, it is also easy to classify them as hi-priority. A catch with treating these as hi-priority is that would also end up treating most actions as hi-priority (which defeats the goals of prioritisation)
Essentially, this reduces the collective processing of jobs back to the same queuing “clogging” mentioned at the start of this post.
On a related note, we also find that performance issues from high concurrency are usually not a result of Sidekiq overhead, its almost always due to intensive database querying/ writes that under heavy load results in failures. In our experience, the optimisation we’d strive for would be to analyse whether there’s a good reason for having a high throughput of data intensive jobs (or whether it might be worth it to batch those into fewer jobs instead)
As described, this is for everything else that can wait.
We started with introducing the use of prioritisations in Sidekiq. Quickly, you can tell that in practice, the CPU and memory utilisation plus the IO overhead due to customer traffic introduces complexity in what to prioritise.
Another area we’ve spent effort on (and should be covered in another post without making this too long), are the optimisations to achieve higher concurrency. A big part of work here involves measuring and improving our database utilisation in the jobs we run - such as batching of hi-frequency job actions or aligning our database concurrency to jobs concurrency.
If you're interested to find out more about what we do, check out our other blog posts or you can visit our company site at https://www.ascendaloyalty.com/