Google Cloud Workflows: Using exponential backoff for long running operations
How to use exponential backoff for retrying / polling long running operation
In this post I will assume that you have already started playing with Google Cloud Workflows, you liked it so much that its reference documentation have no more secret for you.
Please notice that every sentence I quoted below, was a copy paste from that documentation.
A typical example of a long running operation
One of Google Workflows useful architecture patterns, is handling long-running jobs and polling for status. It’s well explained with 2 others patterns on Google Cloud Blog by Workflows Product Manager, here.
A typical use case for this pattern: a BigQuery Job status polling, where we:
- Submit a BigQuery job (jobs.insert) and get the unique jobId
- Use the jobId to poll the job status either through:
- jobs.get and checking
status.state == “DONE”
for other job types (LOAD, EXTRACT, …) - jobs.getQueryResults and checking
jobComplete == true
for QUERY jobs
I saw many examples of handling this pattern with Google Cloud Workflows. Every example used a simple sys.sleep
between retries, keeping a constant wait interval between retries before rechecking the transient status of the job status. Below, bq_polling_through_sleep.yaml
snippet, shows a complete working example polling BigQuery job status. The step check_job_stats (L49)
is responsible for the active waiting through a constant wait interval of 10 seconds using sys.sleep
when status.state != "DONE" (L49-L55)
But how Google Cloud Workflows handles retries for long-running operations ?
Using Connectors
They provide “ built-in behaviour for handling retries and long-running operations”. The retry policy uses “an exponential backoff of 1.25 when polling, starting with 1 second and slowing to 60 seconds between polls.” Unfortunately, at the moment of writing, there is no BigQuery Connector to provide such retry policy while polling for BigQuery job status.
Using retry
In concrete terms, to retry one or many steps, we need to enclose them in try
block with its retry
block, setting the retry policy that defines :
- “The maximum number of retry attempts”
- “A backoff model to increase the likelihood of success”
Contrary to what you might think, retries are not used only to “retry steps that return a specific error code, for example, particular HTTP status codes.”
To prove it, I rewrote the previous example but this time using exponential backoff while retrying polling for BigQuery job status as you can see in bq_polling_through_exponential_backoff.yaml
snippet below.
Remember, at any step we can raise
our own exception that “can be either a string or a dictionary.”
So if we consider a transient state, – a BigQuery job status.state != “DONE"
in our case – as an error, we can raise state.status
value as an error inside the try
/retry
block (L58)
. This will cause the steps inside this try
block (L33-L58)
to be retried if proper retry policy is defined.
Fortunately, Cloud Workflows provides the ability to define a custom retry policy to handle our transient state / error.
To do so:
- Use a subworkflow to define a predicate:
job_state_predicate (L79-L87)
, This predicate checks the transient error — thestatus.state
value in our case — . Ifstatus.state != “DONE"
, returntrue
meaning the retry policy will be called. Else, returnfalse
thus stopping the retries. - Call the predicate subworkflow
(L60-L61)
- Set the retry configuration
(L62-L66)
with an initial delay of 1 second, then the delay is doubled on each attempt, with a maximum delay of 45 seconds. In this case, the delays between subsequent attempts are: 1, 2, 4, 8, 16, 32, 45, 45, 45, 45 (time given in seconds).
Final thoughts
In this post I focused on showing how, with Google Cloud Workflows, we can use retry mechanism to implement exponential backoff polling for long-running operations that are not returning errors. So do not forget to handle such errors 😉
Thanks for reading !