Are retries for user errors supported for raw cont...
# flyte-support
c
Are retries for user errors supported for raw container tasks? The behavior we see seems to suggest no and I can understand since its not using
pyflyte-execute
but wanted to confirm since I don't think this is documented
g
it you write an error file, that will be consider as user error.
and it should retry
c
Thank you!
Looking at this code it doesn't seem to be run anywhere now so it seems like we might have to modify this file so that it decides to add the
upload
command right?
Oh wait it looks like it might run as part of the sidecar cmd if the task has outputs declared
d
cc @full-toddler-99513 do you want to help?
f
Sorry, I have other tasks to handle at the moment, so I’m unavailable for now.
w
@glamorous-carpet-83516: I am little confused with the suggestion. Are you suggesting to write the error to
output_data_dir
path? There could be cases like exception, OOM where we can not write to output_data_dir. How to handle cases like this?
g
do you want to retry on OOM?
w
Yes, OOM or some random Exception or timeouts.
g
IIRC, oom is system error, and propeller will retry.
Yes, OOM or some random Exception or timeouts. (edited)
yes, in these cases, you don’t need to write error file. propeller should lookup the error code and handle it.
c
In our case we are sending the compute to Armada so we might have to get a little more specific about which failures are retry-able and which ones are not. However I'm not sure retrying on an OOM makes sense. Example code from our Armada plugin
Copy code
func mapExecutionStateToPhaseInfo(state ExecutionState, cfg *Config, clock clock.Clock) core.PhaseInfo {
	t := clock.Now()
	taskInfo := constructTaskInfo(state, cfg, clock)
	reason := fmt.Sprintf("Armada job in state %s", state.JobState)

	switch state.JobState {
	case api.JobState_UNKNOWN, api.JobState_QUEUED, api.JobState_SUBMITTED:
		return core.PhaseInfoQueuedWithTaskInfo(t, core.DefaultPhaseVersion, reason, taskInfo)

	case api.JobState_PENDING, api.JobState_LEASED:
		return core.PhaseInfoInitializing(t, core.DefaultPhaseVersion, reason, taskInfo)

	case api.JobState_RUNNING:
		return core.PhaseInfoRunningWithReason(core.DefaultPhaseVersion, taskInfo, reason)

	case api.JobState_SUCCEEDED:
		return core.PhaseInfoSuccessWithReason(taskInfo, reason)

	case api.JobState_PREEMPTED, api.JobState_CANCELLED:
		// This likely means Armada or someone cancelled these jobs. Marking them as retryable.
		return core.PhaseInfoRetryableFailure(errors.DownstreamSystemError, reason, taskInfo)

	case api.JobState_FAILED, api.JobState_REJECTED:
		return core.PhaseInfoFailure(errors.TaskFailedUnknownError, "Job failed", taskInfo)
	default:
		reason = fmt.Sprintf("unhandled job state %s", state.JobState)
		return core.PhaseInfoFailure(errors.RuntimeFailure, reason, taskInfo)
	}
}
An OOM probably ends up as job state failed, which would turn into
PhaseInfoFailure
w
@glamorous-carpet-83516: We are observing that if a
ContainerTask
returns a
non-zero
error code, flyte is not honoring
TaskMetadata.retries
set in
ContainerTask
.
c
@worried-iron-1001 is that with the outputs enabled like we discussed internally would need to be set?
w
Yes. I added a dummy output, but now flyte expects all container tasks images to emit the output else it throws
Outputs not generated by task execution
.
I am little confused why flyte is not able to catch the containertask return code to retry though it is document in flytekit clearly.
g
We are observing that if a
ContainerTask
returns a
non-zero
error code, flyte is not honoring
TaskMetadata.retries
set in
ContainerTask
that’s because non-zero error becomes system error for some reasons.
TaskMetadata.retries
is used for the user error