There are multiple reasons why a workflow can fail, below are some of them:
- Some dependent resources (tables for example) are not present, expired during the run or not created by the time of its run
- On rare occasions: like resource exceeded [for example in case of badly written legacy queries]
- Random Google internal error [e.g. : message:Error encountered during execution. Retrying may solve the problem., reason:backendError]
- Concurrent limits exceeded
- Query timeouts [for example due to unavailability of enough slots - after newly introduction of reserved slots]
Resume workflow feature allows users easily resume workflow execution after workflow itself or environment / conditions fixed. This is helpful for workflows with more than handful of tasks in it and especially for workflows that have cascading calls to various other workflows, chained workflows.
Overview
After a workflow fails, workflow owner, delegates, or admins have up to 48 hours from the execution start time to resume it.
During a resume run:
- Execution starts from the failed task.
- Execution is performed under the original run-user, regardless who initiates the resume.
- All built-in, custom, and magic parameters retain their values from the time of failure.
- The most current version of the workflow definition is used.
So, if you have modified the workflow after the failure and right before the resume run, the modified version of the workflow is used in the resume run.
You can see the status of the resume run in History Browser.
What happens in a resume run?
Magnus loads the most current version of the workflow definition
For example, if your workflow failed at a BigQuery Task because of a misspelling in the query, you can correct the query mistake, save, and then resume the workflow. Since the most current version of the workflow is used for the resume run, the corrected query will take effect.
Magnus loads the context of the workflow at the time of failure
The context includes
Run-User
This means that the resume run will be executed under the original run-user from the time of failure. For example, you and your colleague collaborated on a workflow. Your colleague is the owner of the workflow, and you are the delegate. During a scheduled run, which is always executed under the owner, the workflow has failed. Your colleague is out of the office, so as a delegate, you resume the workflow. Since this is a resume run, it will be executed under the original run-user, which is your colleague. It will not be executed as you. You simply initiate the resume.
If the workflow contains any tasks (such as BigQuery Task or GS-Export Task) that require Google OAuth credential and the original run-user has not yet granted offline access, the resume run will fail at the task that requires Google OAuth credential.
All built-in parameters
During resume run, all built-in parameters retain their values from the time of failure. For example, if you have a workflow that uses the built-in parameter <var_daydate> in a query:
During a scheduled run on 2019-12-13, the workflow failed. But you did not resume the workflow until the next day on 2019-12-14. Then on the resume run, the built-in parameter <var_daydate> will still have the value 2019-12-13, not 2019-12-14. So, during the resume run, the query will be evaluated to:
SELECT partnerId, customerId FROM orders WHERE orderDate='2019-12-13'
All custom parameters
During resume run, all custom parameters retain their values from the time of failure.
That means, if your workflow has failed and before resuming, you change the value of an existing custom parameter in the Parameter Panel, this change will not take effect during the resume run.
If you need to change the value of an existing parameter and want that change to take effect in a resume run, below options are recommended:
- If the failed task uses the custom parameter and you want to change the custom parameter’s value and have that take effect in the failed task, you can add a new custom parameter, assign the correct value to the new parameter, and use that new parameter instead in the failed task.
- If the failed task does not use the custom parameter, add a Script Task or BigQuery Task right after the failed task. In the Script Task or BigQuery Task, assign the new value to the custom parameter. Then during resume run, after the failed task and the Script Task or BigQuery Task are executed, the custom parameter will be assigned the new value.
All magic parameters
During resume run, all magic parameters retain their values from the time of failure. This includes table ID of the resulting anonymous table from BigQuery Task via <var_taskId_output> and return value from Script Task via <var_taskId_return>.
Note that BQ anonymous table is a temporary table that is deleted approximately 24 hours after the query is run. So, if you resume more than 24 hours later, that BQ anonymous table <var_taskId_output> is likely already deleted by BigQuery, and the resume run will fail since that anonymous table is no longer available.
Magnus skips to the failed task, and executes the workflow starting from this task onward
During resume run, Magnus first locates the failed task, and resumes the workflow starting with the failed task. Magnus executes this failed task differently depending on the task type:
Failed Task Type | On resume |
BigQuery |
The task will be executed again. The query will be submitted again as a new BQ job.
|
Go-To GS-Export Misc MySQL FTP FTP-GS |
The task will be executed again. |
Loop | Magnus skips to the failed iteration, then skips to the failed nested task and starts execution from there. |
Workflow | The failed child workflow is resumed. If the failure is caused by “Permission denied to run workflow” or “Another instance is already running”, then the child workflow will be executed as a brand new execution, not a resume run. That means, context from the time of failure will not be loaded. For example, <var_daydate> will reflect the current date, not the failure date. |
Hub Task for Magnus Workflow | The failed child workflows are resumed. |
Hub Task for BQ Job |
The status of the BQ jobs will be polled again. If there is any failed BQ jobs, the Hub Task will fail again. Hub Task for BQ Job does not re-submit the failed BQ jobs. It simply polls the BQ job status again. |
Note: During resume run, Magnus skips to the failed task. That is the starting point for Magnus. If you remove the failed task from the workflow definition before resuming, the resume run will fail. The failed task is identified by the Task Id and Task Type.
If for whatever reason, you do not want the failed task to execute during resume run, you can simply disable that task before resuming.
Considerations
Failed task is a Hub Task for Magnus Workflows
When a Hub Task failed because a child workflow failed, changes made to the respective API Task that does the Remote Workflow Execution call will have no effect in resume run.
For example, say you have a workflow w1 that does Remote Workflow Execution to child workflows w2 and w3 via API Tasks. And you have a Hub Task that waits for w2 and w3 to complete.
Say w2 completed successfully, and w3 failed.
This will cause the Hub Task to fail, and thus the parent workflow w1 to fail. Say the workflow owner then initiates resume on the parent workflow w1.
On resume run, Magnus sees that w2 ran successfully and w3 failed, thus Magnus will resume w3. During the resume run for w3, all built-in, custom, and magic parameters retain their values from the time of failure. Thus, if there were any input parameters to w3, they will retain the same values as the time of failure. If you make changes to the API Task that remotely calls w3, it will not have any effect in the resume run.
Resuming a workflow chain
When child workflows within a workflow chain failed, user can initiate resume at the root workflow level. User cannot initiate resume at the child workflow level.
During resume run, Magnus will implicitly resume the failed child workflows. Any child workflows that were not executed before will be run as brand-new executions.
For example, say you have a workflow chain w1 -> w2 -> w3 -> w4 -> w5, where w1 calls w2, w2 calls w3, w3 calls w4, and w4 calls w5.
During a regular scheduled run, w1, w2, and w3 failed. The workflows w4 and w5 were not even attempted.
Then user initiates the resume at the root workflow level, which is w1.
During the resume run, Magnus will execute w1, w2, w3 as resume run. Thus, context from the time of failure is loaded. That means, built-in parameters such as <var_daydate> will have the date from the time of failure.
Magnus will execute w4 and w5 as a brand-new execution, not resume runs. Thus, built-in parameters such as <var_daydate> will have the current date.
Explicit resume vs implicit resume
Explicit resume
Explicit resume is a resume that is initiated by user. For example, user can click on the [Resume from failure] button to initiate a resume. This is an explicit resume.
Explicit resume restrictions
A child workflow cannot be resumed explicitly. You must start the resume at the root workflow. Then Magnus will trigger to resume the child workflow implicitly for you.
Workflow that failed with error “permission denied to run workflow” or “another instance is already running” cannot be resumed explicitly. It can only be resumed implicitly by Magnus in a workflow chain. During the resume run, Magnus will execute the workflow as brand-new execution, not resume run.
Implicit resume
Implicit resume is a resume that is triggered by Magnus because of a parent workflow resume.
For example, if you have a workflow chain w1 -> w2 -> w3. Workflow w1 calls w2, and w2 calls w3. Say w3 failed, thus causing w2 and w1 to fail also. Then to resume, you explicitly resume w1 by clicking on the [Resume from failure] button. Then Magnus will trigger to resume w2 and w3 implicitly.
Frequently Asked Questions
Who can resume a workflow?
Workflow owner, workflow delegates, or admins can resume a workflow. Regardless who initiates the resume, the resume run will be executed under the original run-user.
When can a workflow be resumed?
After a workflow fails, you have up to 48 hours from the execution start time to resume it. For example, the workflow started at 2019-12-16 10:00 am, then 15 minutes into the execution, it failed at 2019-12-16 10:15 am. Then you have up until 2019-12-18 10:00 am to resume this execution.
If I have a workflow chain and a child workflow failed, can I resume starting at the child workflow?
No, resume must start at the root workflow level. For example, if you have a workflow chain w1 -> w2 -> w3. Workflow w1 calls w2, and w2 calls w3. Say w3 failed, thus causing w2 and w1 to fail. Then to resume, you have to start the resume at w1. Magnus will first start the resume on w1, that in turn will trigger the resume on w2 and w3.
Can I resume the same failed execution more than once?
Failed execution can be resumed only once. Meantime, if that resumed run failed on its own – user can resume it. Meaning, technically, it can take few times of fixing / resuming before workflow will finally be successfully completed, but each time – the most recent failed execution will need to be resumed