Nicholas LoFaso
09/30/2022, 4:14 PMFailed to find the Resource with name: dpp-default/g20210730154015-yjww-n0-0-dn4-0-dn108-0. Error: pods \"g20210730154015-yjww-n0-0-dn4-0-dn108-0\" not found
Flyte restarts the task and it succeeds on the 2nd or 3rd try, but this is obviously wasted work. I’m curious if this is FlytePropeller needing more CPU/Memory to accommodate or if we are overwhelming the k8s metadata server. Any thoughts would be appreciated{"json":{"exec_id":"g20210730154015-yjww","node":"n0/dn4/dn108","ns":"dpp-default","res_ver":"100219704","routine":"worker-17","src":"plugin_manager.go:267","tasktype":"sidecar","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"Failed to find the Resource with name: dpp-default/g20210730154015-yjww-n0-0-dn4-0-dn108-0. Error: pods \"g20210730154015-yjww-n0-0-dn4-0-dn108-0\" not found","ts":"2022-09-28T01:44:03Z"}
{"json":{"exec_id":"g20210730154015-yjww","node":"n0/dn4/dn108","ns":"dpp-default","res_ver":"100221328","routine":"worker-17","src":"task_event_recorder.go:27","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"dpp\" domain:\"default\" name:\"msat.level2.proxy.run_splat\" version:\"dpp-918327b\" node_id:\"n0-0-dn4-0-dn108\" execution_id:\u003cproject:\"dpp\" domain:\"default\" name:\"g20210730154015-yjww\" \u003e 0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2022-09-28T01:44:07Z"}
Dan Rammer (hamersaw)
09/30/2022, 4:17 PMinject-finalizers
configured on FlytePropeller? Often what happens in scenarios like your stress test here is that a Pod completes and before FlytePropeller has time to detect successful completion k8s garbage collects the Pod. So when FlytePropeller checks Pod status and it's missing, the only thing to do is restart it. The finalizer will tell k8s to not delete the Pod until FlytePropeller removes it as part of the task finalize steps.Nicholas LoFaso
09/30/2022, 4:18 PMDan Rammer (hamersaw)
09/30/2022, 4:20 PMplugins:
k8s:
inject-finalizer: true
Nicholas LoFaso
09/30/2022, 4:24 PM"No plugin found for Handler-type [python-task], defaulting to [container]"
This doesn’t seem to be a big deal but it’s all over the logs so if we could remove that would be niceStale
statements in the log
{
"json": {
"exec_id": "g00000000000003-4ezh",
"ns": "dpp-default",
"routine": "worker-40",
"src": "handler.go:181"
},
"level": "warning",
"msg": "Workflow namespace[dpp-default]/name[g00000000000003-4ezh] Stale.",
"ts": "2022-09-30T13:47:26Z"
}
Dan Rammer (hamersaw)
09/30/2022, 4:27 PMNicholas LoFaso
09/30/2022, 4:29 PMStale
I found this code in flyte propeller
if r.isResourceVersionSameAsPrevious(ctx, namespace, name, w.ResourceVersion) {
return nil, ErrStaleWorkflowError
}
func (r *resourceVersionCaching) isResourceVersionSameAsPrevious(ctx context.Context, namespace, name, resourceVersion string) bool {
if v, ok := r.lastUpdatedResourceVersionCache.Load(resourceVersionKey(namespace, name)); ok {
strV := v.(string)
if strV == resourceVersion {
r.metrics.workflowStaleCount.Inc(ctx)
return true
}
}
return false
}
But not really sure what resource is the same as a previous version and why it’s a problem 😄Dan Rammer (hamersaw)
09/30/2022, 4:31 PMNicholas LoFaso
09/30/2022, 4:34 PMDan Rammer (hamersaw)
09/30/2022, 4:34 PMNicholas LoFaso
09/30/2022, 4:34 PMDan Rammer (hamersaw)
09/30/2022, 4:34 PM