Google Cloud Workflows: Using exponential backoff for long running operations

How to use exponential backoff for retrying / polling long running operation

Mehdi BHA
3 min readMar 18, 2021

In this post I will assume that you have already started playing with Google Cloud Workflows, you liked it so much that its reference documentation have no more secret for you.

Please notice that every sentence I quoted below, was a copy paste from that documentation.

A typical example of a long running operation

One of Google Workflows useful architecture patterns, is handling long-running jobs and polling for status. It’s well explained with 2 others patterns on Google Cloud Blog by Workflows Product Manager, here.

A typical use case for this pattern: a BigQuery Job status polling, where we:

  1. Submit a BigQuery job (jobs.insert) and get the unique jobId
  2. Use the jobId to poll the job status either through:
  • jobs.get and checking status.state == “DONE”for other job types (LOAD, EXTRACT, …)
  • jobs.getQueryResults and checking jobComplete == true for QUERY jobs

I saw many examples of handling this pattern with Google Cloud Workflows. Every example used a simple sys.sleep between retries, keeping a constant wait interval between retries before rechecking the transient status of the job status. Below, bq_polling_through_sleep.yaml snippet, shows a complete working example polling BigQuery job status. The step check_job_stats (L49) is responsible for the active waiting through a constant wait interval of 10 seconds using sys.sleep when status.state != "DONE" (L49-L55)

Polling with a constant wait interval of 10 seconds

But how Google Cloud Workflows handles retries for long-running operations ?

Using Connectors

They provide “ built-in behaviour for handling retries and long-running operations”. The retry policy uses “an exponential backoff of 1.25 when polling, starting with 1 second and slowing to 60 seconds between polls.” Unfortunately, at the moment of writing, there is no BigQuery Connector to provide such retry policy while polling for BigQuery job status.

Using retry

In concrete terms, to retry one or many steps, we need to enclose them in try block with its retry block, setting the retry policy that defines :

  • “The maximum number of retry attempts”
  • “A backoff model to increase the likelihood of success”

Contrary to what you might think, retries are not used only to “retry steps that return a specific error code, for example, particular HTTP status codes.”

To prove it, I rewrote the previous example but this time using exponential backoff while retrying polling for BigQuery job status as you can see in bq_polling_through_exponential_backoff.yaml snippet below.

Remember, at any step we can raise our own exception that “can be either a string or a dictionary.”

So if we consider a transient state, – a BigQuery job status.state != “DONE"in our case – as an error, we can raise state.status value as an error inside the try /retry block (L58). This will cause the steps inside this try block (L33-L58)to be retried if proper retry policy is defined.

Fortunately, Cloud Workflows provides the ability to define a custom retry policy to handle our transient state / error.

To do so:

  1. Use a subworkflow to define a predicate: job_state_predicate (L79-L87), This predicate checks the transient error — the status.state value in our case — . If status.state != “DONE", return true meaning the retry policy will be called. Else, return false thus stopping the retries.
  2. Call the predicate subworkflow(L60-L61)
  3. Set the retry configuration (L62-L66) with an initial delay of 1 second, then the delay is doubled on each attempt, with a maximum delay of 45 seconds. In this case, the delays between subsequent attempts are: 1, 2, 4, 8, 16, 32, 45, 45, 45, 45 (time given in seconds).
Polling with exponential backoff with 1, 2, 4, 8, 16, 32 and 45 seconds delays

Final thoughts

In this post I focused on showing how, with Google Cloud Workflows, we can use retry mechanism to implement exponential backoff polling for long-running operations that are not returning errors. So do not forget to handle such errors 😉

Thanks for reading !

--

--

Mehdi BHA
Mehdi BHA

No responses yet