Are retries for user errors supported for raw container task Flyte #flyte-support

Are retries for user errors supported for raw cont...

clean-glass-36808

11/27/2024, 6:16 PM

Are retries for user errors supported for raw container tasks? The behavior we see seems to suggest no and I can understand since its not using

pyflyte-execute

but wanted to confirm since I don't think this is documented

glamorous-carpet-83516

11/27/2024, 8:18 PM

it you write an error file, that will be consider as user error.

glamorous-carpet-83516

11/27/2024, 8:18 PM

and it should retry

glamorous-carpet-83516

11/27/2024, 8:19 PM

https://github.com/flyteorg/flyte/blob/5f4199899922ca63f7690c82dfca42a783db64c3/flytecopilot/data/upload.go#L121

clean-glass-36808

11/27/2024, 8:31 PM

Thank you!

clean-glass-36808

11/27/2024, 8:49 PM

Looking at this code it doesn't seem to be run anywhere now so it seems like we might have to modify this file so that it decides to add the

upload

command right?

clean-glass-36808

11/27/2024, 8:50 PM

Oh wait it looks like it might run as part of the sidecar cmd if the task has outputs declared

damp-lion-88352

11/28/2024, 12:44 AM

cc @full-toddler-99513 do you want to help?

full-toddler-99513

11/28/2024, 7:11 PM

Sorry, I have other tasks to handle at the moment, so I’m unavailable for now.

worried-iron-1001

12/02/2024, 5:55 PM

@glamorous-carpet-83516: I am little confused with the suggestion. Are you suggesting to write the error to

output_data_dir

path? There could be cases like exception, OOM where we can not write to output_data_dir. How to handle cases like this?

glamorous-carpet-83516

12/02/2024, 5:57 PM

do you want to retry on OOM?

worried-iron-1001

12/02/2024, 5:57 PM

Yes, OOM or some random Exception or timeouts.

glamorous-carpet-83516

12/02/2024, 6:02 PM

what’s the value of

MaxNodeRetriesOnSystemFailures

in your settings?https://github.com/flyteorg/flyte/blob/92f8abb2f34b648a3430dc6e2262b4fdf625ad39/flytepropeller/pkg/controller/config/config.go#L102

glamorous-carpet-83516

12/02/2024, 6:02 PM

IIRC, oom is system error, and propeller will retry.

glamorous-carpet-83516

12/02/2024, 6:03 PM

Yes, OOM or some random Exception or timeouts. (edited)

yes, in these cases, you don’t need to write error file. propeller should lookup the error code and handle it.

clean-glass-36808

12/02/2024, 6:07 PM

In our case we are sending the compute to Armada so we might have to get a little more specific about which failures are retry-able and which ones are not. However I'm not sure retrying on an OOM makes sense. Example code from our Armada plugin

Copy code

func mapExecutionStateToPhaseInfo(state ExecutionState, cfg *Config, clock clock.Clock) core.PhaseInfo {
	t := clock.Now()
	taskInfo := constructTaskInfo(state, cfg, clock)
	reason := fmt.Sprintf("Armada job in state %s", state.JobState)

	switch state.JobState {
	case api.JobState_UNKNOWN, api.JobState_QUEUED, api.JobState_SUBMITTED:
		return core.PhaseInfoQueuedWithTaskInfo(t, core.DefaultPhaseVersion, reason, taskInfo)

	case api.JobState_PENDING, api.JobState_LEASED:
		return core.PhaseInfoInitializing(t, core.DefaultPhaseVersion, reason, taskInfo)

	case api.JobState_RUNNING:
		return core.PhaseInfoRunningWithReason(core.DefaultPhaseVersion, taskInfo, reason)

	case api.JobState_SUCCEEDED:
		return core.PhaseInfoSuccessWithReason(taskInfo, reason)

	case api.JobState_PREEMPTED, api.JobState_CANCELLED:
		// This likely means Armada or someone cancelled these jobs. Marking them as retryable.
		return core.PhaseInfoRetryableFailure(errors.DownstreamSystemError, reason, taskInfo)

	case api.JobState_FAILED, api.JobState_REJECTED:
		return core.PhaseInfoFailure(errors.TaskFailedUnknownError, "Job failed", taskInfo)
	default:
		reason = fmt.Sprintf("unhandled job state %s", state.JobState)
		return core.PhaseInfoFailure(errors.RuntimeFailure, reason, taskInfo)
	}
}

clean-glass-36808

12/02/2024, 6:08 PM

An OOM probably ends up as job state failed, which would turn into

PhaseInfoFailure

worried-iron-1001

12/02/2024, 6:29 PM

@glamorous-carpet-83516: We are observing that if a

ContainerTask

returns a

non-zero

error code, flyte is not honoring

TaskMetadata.retries

set in

ContainerTask

clean-glass-36808

12/02/2024, 6:33 PM

@worried-iron-1001 is that with the outputs enabled like we discussed internally would need to be set?

worried-iron-1001

12/02/2024, 6:52 PM

Yes. I added a dummy output, but now flyte expects all container tasks images to emit the output else it throws

Outputs not generated by task execution

worried-iron-1001

12/02/2024, 6:59 PM

I am little confused why flyte is not able to catch the containertask return code to retry though it is document in flytekit clearly.

glamorous-carpet-83516

12/02/2024, 10:38 PM

We are observing that if a
ContainerTask
returns a
non-zero
error code, flyte is not honoring
TaskMetadata.retries
set in
ContainerTask

that’s because non-zero error becomes system error for some reasons.

TaskMetadata.retries

is used for the user error

8 Views

Open in Slack

Previous Next