Google Cloud Workflows polling: Backoff ! don’t sleep 😁
How to use exponential backoff for retrying / polling long running operation
In this post I will assume that you have already started playing with Google Cloud Workflows, you liked it so much that its reference documentation have no more secret for you.
Please notice that every sentence I quoted below, was a copy paste from that documentation.
A typical example of a long running operation
One of Google Workflows useful architecture patterns, is handling long-running jobs and polling for status. It’s well explained with 2 others patterns on Google Cloud Blog by Workflows Product Manager, here.
A typical use case for this pattern: a BigQuery Job status polling, where we:
- Submit a BigQuery job (jobs.insert) and get the unique jobId
- Use the jobId to poll the job status either through:
- jobs.get and checking
status.state == “DONE”for other job types (LOAD, EXTRACT, …)
- jobs.getQueryResults and checking
jobComplete == truefor QUERY jobs
I saw many examples of handling this pattern with Google Cloud Workflows. Every example used a simple
sys.sleep between retries, keeping a constant wait interval between retries before rechecking the transient status of the job status. Below,
bq_polling_through_sleep.yaml snippet, shows a complete working example polling BigQuery job status. The step
check_job_stats (L49) is responsible for the active waiting through a constant wait interval of 10 seconds using
status.state != "DONE" (L49-L55)
But how Google Cloud Workflows handles retries for long-running operations ?
They provide “ built-in behaviour for handling retries and long-running operations”. The retry policy uses “an exponential backoff of 1.25 when polling, starting with 1 second and slowing to 60 seconds between polls.” Unfortunately, at the moment of writing, there is no BigQuery Connector to provide such retry policy while polling for BigQuery job status.
In concrete terms, to retry one or many steps, we need to enclose them in
try block with its
retry block, setting the retry policy that defines :
- “The maximum number of retry attempts”
- “A backoff model to increase the likelihood of success”
Contrary to what you might think, retries are not used only to “retry steps that return a specific error code, for example, particular HTTP status codes.”
To prove it, I rewrote the previous example but this time using exponential backoff while retrying polling for BigQuery job status as you can see in
bq_polling_through_exponential_backoff.yaml snippet below.
Remember, at any step we can
raise our own exception that “can be either a string or a dictionary.”
So if we consider a transient state, – a BigQuery job
status.state != “DONE"in our case – as an error, we can raise
state.status value as an error inside the
(L58). This will cause the steps inside this
(L33-L58)to be retried if proper retry policy is defined.
Fortunately, Cloud Workflows provides the ability to define a custom retry policy to handle our transient state / error.
To do so:
- Use a subworkflow to define a predicate:
job_state_predicate (L79-L87), This predicate checks the transient error — the
status.statevalue in our case — . If
status.state != “DONE", return
truemeaning the retry policy will be called. Else, return
falsethus stopping the retries.
- Call the predicate subworkflow
- Set the retry configuration
(L62-L66)with an initial delay of 1 second, then the delay is doubled on each attempt, with a maximum delay of 45 seconds. In this case, the delays between subsequent attempts are: 1, 2, 4, 8, 16, 32, 45, 45, 45, 45 (time given in seconds).
In this post I focused on showing how, with Google Cloud Workflows, we can use retry mechanism to implement exponential backoff polling for long-running operations that are not returning errors. So do not forget to handle such errors 😉
Thanks for reading !