clean-glass-36808
11/27/2024, 6:16 PMpyflyte-execute
but wanted to confirm since I don't think this is documentedglamorous-carpet-83516
11/27/2024, 8:18 PMglamorous-carpet-83516
11/27/2024, 8:18 PMglamorous-carpet-83516
11/27/2024, 8:19 PMclean-glass-36808
11/27/2024, 8:31 PMclean-glass-36808
11/27/2024, 8:49 PMupload
command right?clean-glass-36808
11/27/2024, 8:50 PMdamp-lion-88352
11/28/2024, 12:44 AMfull-toddler-99513
11/28/2024, 7:11 PMworried-iron-1001
12/02/2024, 5:55 PMoutput_data_dir
path? There could be cases like exception, OOM where we can not write to output_data_dir. How to handle cases like this?glamorous-carpet-83516
12/02/2024, 5:57 PMworried-iron-1001
12/02/2024, 5:57 PMglamorous-carpet-83516
12/02/2024, 6:02 PMMaxNodeRetriesOnSystemFailures
in your settings?https://github.com/flyteorg/flyte/blob/92f8abb2f34b648a3430dc6e2262b4fdf625ad39/flytepropeller/pkg/controller/config/config.go#L102glamorous-carpet-83516
12/02/2024, 6:02 PMglamorous-carpet-83516
12/02/2024, 6:03 PMYes, OOM or some random Exception or timeouts. (edited)yes, in these cases, you don’t need to write error file. propeller should lookup the error code and handle it.
clean-glass-36808
12/02/2024, 6:07 PMfunc mapExecutionStateToPhaseInfo(state ExecutionState, cfg *Config, clock clock.Clock) core.PhaseInfo {
t := clock.Now()
taskInfo := constructTaskInfo(state, cfg, clock)
reason := fmt.Sprintf("Armada job in state %s", state.JobState)
switch state.JobState {
case api.JobState_UNKNOWN, api.JobState_QUEUED, api.JobState_SUBMITTED:
return core.PhaseInfoQueuedWithTaskInfo(t, core.DefaultPhaseVersion, reason, taskInfo)
case api.JobState_PENDING, api.JobState_LEASED:
return core.PhaseInfoInitializing(t, core.DefaultPhaseVersion, reason, taskInfo)
case api.JobState_RUNNING:
return core.PhaseInfoRunningWithReason(core.DefaultPhaseVersion, taskInfo, reason)
case api.JobState_SUCCEEDED:
return core.PhaseInfoSuccessWithReason(taskInfo, reason)
case api.JobState_PREEMPTED, api.JobState_CANCELLED:
// This likely means Armada or someone cancelled these jobs. Marking them as retryable.
return core.PhaseInfoRetryableFailure(errors.DownstreamSystemError, reason, taskInfo)
case api.JobState_FAILED, api.JobState_REJECTED:
return core.PhaseInfoFailure(errors.TaskFailedUnknownError, "Job failed", taskInfo)
default:
reason = fmt.Sprintf("unhandled job state %s", state.JobState)
return core.PhaseInfoFailure(errors.RuntimeFailure, reason, taskInfo)
}
}
clean-glass-36808
12/02/2024, 6:08 PMPhaseInfoFailure
worried-iron-1001
12/02/2024, 6:29 PMContainerTask
returns a non-zero
error code, flyte is not honoring TaskMetadata.retries
set in ContainerTask
.clean-glass-36808
12/02/2024, 6:33 PMworried-iron-1001
12/02/2024, 6:52 PMOutputs not generated by task execution
.worried-iron-1001
12/02/2024, 6:59 PMglamorous-carpet-83516
12/02/2024, 10:38 PMWe are observing that if athat’s because non-zero error becomes system error for some reasons.returns aContainerTask
error code, flyte is not honoringnon-zero
set inTaskMetadata.retries
ContainerTask
TaskMetadata.retries
is used for the user error