Hi we’re doing some performance testing and when w...
# ask-the-community
n
Hi we’re doing some performance testing and when we start a large number of tasks at once it seems that FlytePropeller loses track of some of the running pods. For example a pod will be successful, but FlytePropeller logs the following. Full log in thread
Copy code
Failed to find the Resource with name: dpp-default/g20210730154015-yjww-n0-0-dn4-0-dn108-0. Error: pods \"g20210730154015-yjww-n0-0-dn4-0-dn108-0\" not found
Flyte restarts the task and it succeeds on the 2nd or 3rd try, but this is obviously wasted work. I’m curious if this is FlytePropeller needing more CPU/Memory to accommodate or if we are overwhelming the k8s metadata server. Any thoughts would be appreciated
Copy code
{"json":{"exec_id":"g20210730154015-yjww","node":"n0/dn4/dn108","ns":"dpp-default","res_ver":"100219704","routine":"worker-17","src":"plugin_manager.go:267","tasktype":"sidecar","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"Failed to find the Resource with name: dpp-default/g20210730154015-yjww-n0-0-dn4-0-dn108-0. Error: pods \"g20210730154015-yjww-n0-0-dn4-0-dn108-0\" not found","ts":"2022-09-28T01:44:03Z"}
{"json":{"exec_id":"g20210730154015-yjww","node":"n0/dn4/dn108","ns":"dpp-default","res_ver":"100221328","routine":"worker-17","src":"task_event_recorder.go:27","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"dpp\" domain:\"default\" name:\"msat.level2.proxy.run_splat\" version:\"dpp-918327b\"  node_id:\"n0-0-dn4-0-dn108\" execution_id:\u003cproject:\"dpp\" domain:\"default\" name:\"g20210730154015-yjww\" \u003e  0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2022-09-28T01:44:07Z"}
d
Hey Nicholas! Great to hear from you again. Do you have
inject-finalizers
configured on FlytePropeller? Often what happens in scenarios like your stress test here is that a Pod completes and before FlytePropeller has time to detect successful completion k8s garbage collects the Pod. So when FlytePropeller checks Pod status and it's missing, the only thing to do is restart it. The finalizer will tell k8s to not delete the Pod until FlytePropeller removes it as part of the task finalize steps.
n
Hmm let me check
d
Yeah, so here it is in the code. so should be in the k8s plugin config - something like:
Copy code
plugins:
  k8s:
    inject-finalizer: true
n
I think we used to have it enabled, but it was turned off at some point. Let me reenable and try again. Thanks!
While I have you two other questions about logs I’ve seen in FlytePropeller (1)
Copy code
"No plugin found for Handler-type [python-task], defaulting to [container]"
This doesn’t seem to be a big deal but it’s all over the logs so if we could remove that would be nice
(2) I reran a workflow that had 95% completed. Flyte successfully used the cached results and it eventually completed but it seemed to hang for a while and I saw a bunch of these
Stale
statements in the log
Copy code
{
  "json": {
    "exec_id": "g00000000000003-4ezh",
    "ns": "dpp-default",
    "routine": "worker-40",
    "src": "handler.go:181"
  },
  "level": "warning",
  "msg": "Workflow namespace[dpp-default]/name[g00000000000003-4ezh] Stale.",
  "ts": "2022-09-30T13:47:26Z"
}
d
😄 yeah, the way that plugins are registered enables a "default" plugin. so if the task type doesn't match any registered plugin it will fallback to the "container" plugin. at some point this was relied on - hence the messages you're seeing every time the plugin is called for a python task. Agreed, lets get this removed.
n
Cool thanks obviously not critical
The other one was a bit more pertinent. Not sure why it was saying the workflow was
Stale
I found this code in flyte propeller
Copy code
if r.isResourceVersionSameAsPrevious(ctx, namespace, name, w.ResourceVersion) {
			return nil, ErrStaleWorkflowError
		}

func (r *resourceVersionCaching) isResourceVersionSameAsPrevious(ctx context.Context, namespace, name, resourceVersion string) bool {
	if v, ok := r.lastUpdatedResourceVersionCache.Load(resourceVersionKey(namespace, name)); ok {
		strV := v.(string)
		if strV == resourceVersion {
			r.metrics.workflowStaleCount.Inc(ctx)
			return true
		}
	}

	return false
}
But not really sure what resource is the same as a previous version and why it’s a problem 😄
d
checkin on stale quick here. so i know we use a k8s informer to keep track of the flyte workflow CRDs internally. this basically maintains a cache that is informed of changes. so when FlytePropeller updates a node status, etc if updates the CRD in etcd. we maintain the resource version of the CRD to ensure thta the next time FlytePropeller processes the workflow (read from the informer cache) that we don't accidently process a stale workflow.
^^^ exactly what you linked to.
the informer cache is only updated by k8s (using an internal watch API). so the "stale" message is saying. "we previously updated the flyteworkflow CRD, but we have not yet been notified by k8s that the CRD was updated".
n
Ok interesting so for some reason k8s was slow in updating the informer cache
d
that's what it sounds like.
n
but looks like it eventually did and the workflow resumed successfully
That’s all my questions for now. Going to retest with inject-finalizer on and see what happens. Thanks for your help!
d
yeah, there will not be any issues with correctness in executing the workflow. this is actually a safeguard in place to ensure correctness.
202 Views