Hi we re doing some performance testing and when we start a Flyte #flyte-support

Hi we’re doing some performance testing and when w...

rough-rose-81585

09/30/2022, 4:14 PM

Hi we’re doing some performance testing and when we start a large number of tasks at once it seems that FlytePropeller loses track of some of the running pods. For example a pod will be successful, but FlytePropeller logs the following. Full log in thread

Copy code

Failed to find the Resource with name: dpp-default/g20210730154015-yjww-n0-0-dn4-0-dn108-0. Error: pods \"g20210730154015-yjww-n0-0-dn4-0-dn108-0\" not found

Flyte restarts the task and it succeeds on the 2nd or 3rd try, but this is obviously wasted work. I’m curious if this is FlytePropeller needing more CPU/Memory to accommodate or if we are overwhelming the k8s metadata server. Any thoughts would be appreciated

rough-rose-81585

09/30/2022, 4:14 PM

Copy code

{"json":{"exec_id":"g20210730154015-yjww","node":"n0/dn4/dn108","ns":"dpp-default","res_ver":"100219704","routine":"worker-17","src":"plugin_manager.go:267","tasktype":"sidecar","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"Failed to find the Resource with name: dpp-default/g20210730154015-yjww-n0-0-dn4-0-dn108-0. Error: pods \"g20210730154015-yjww-n0-0-dn4-0-dn108-0\" not found","ts":"2022-09-28T01:44:03Z"}
{"json":{"exec_id":"g20210730154015-yjww","node":"n0/dn4/dn108","ns":"dpp-default","res_ver":"100221328","routine":"worker-17","src":"task_event_recorder.go:27","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"dpp\" domain:\"default\" name:\"msat.level2.proxy.run_splat\" version:\"dpp-918327b\"  node_id:\"n0-0-dn4-0-dn108\" execution_id:\u003cproject:\"dpp\" domain:\"default\" name:\"g20210730154015-yjww\" \u003e  0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2022-09-28T01:44:07Z"}

hallowed-mouse-14616

09/30/2022, 4:17 PM

Hey Nicholas! Great to hear from you again. Do you have

inject-finalizers

configured on FlytePropeller? Often what happens in scenarios like your stress test here is that a Pod completes and before FlytePropeller has time to detect successful completion k8s garbage collects the Pod. So when FlytePropeller checks Pod status and it's missing, the only thing to do is restart it. The finalizer will tell k8s to not delete the Pod until FlytePropeller removes it as part of the task finalize steps.

rough-rose-81585

09/30/2022, 4:18 PM

Hmm let me check

hallowed-mouse-14616

09/30/2022, 4:20 PM

Yeah, so here it is in the code. so should be in the k8s plugin config - something like:

Copy code

plugins:
  k8s:
    inject-finalizer: true

rough-rose-81585

09/30/2022, 4:24 PM

I think we used to have it enabled, but it was turned off at some point. Let me reenable and try again. Thanks!

rough-rose-81585

09/30/2022, 4:25 PM

While I have you two other questions about logs I’ve seen in FlytePropeller (1)

Copy code

"No plugin found for Handler-type [python-task], defaulting to [container]"

This doesn’t seem to be a big deal but it’s all over the logs so if we could remove that would be nice

rough-rose-81585

09/30/2022, 4:26 PM

(2) I reran a workflow that had 95% completed. Flyte successfully used the cached results and it eventually completed but it seemed to hang for a while and I saw a bunch of these

Stale

statements in the log

Copy code

{
  "json": {
    "exec_id": "g00000000000003-4ezh",
    "ns": "dpp-default",
    "routine": "worker-40",
    "src": "handler.go:181"
  },
  "level": "warning",
  "msg": "Workflow namespace[dpp-default]/name[g00000000000003-4ezh] Stale.",
  "ts": "2022-09-30T13:47:26Z"
}

hallowed-mouse-14616

09/30/2022, 4:27 PM

😄 yeah, the way that plugins are registered enables a "default" plugin. so if the task type doesn't match any registered plugin it will fallback to the "container" plugin. at some point this was relied on - hence the messages you're seeing every time the plugin is called for a python task. Agreed, lets get this removed.

rough-rose-81585

09/30/2022, 4:29 PM

Cool thanks obviously not critical

rough-rose-81585

09/30/2022, 4:31 PM

The other one was a bit more pertinent. Not sure why it was saying the workflow was

Stale

I found this code in flyte propeller

Copy code

if r.isResourceVersionSameAsPrevious(ctx, namespace, name, w.ResourceVersion) {
			return nil, ErrStaleWorkflowError
		}

func (r *resourceVersionCaching) isResourceVersionSameAsPrevious(ctx context.Context, namespace, name, resourceVersion string) bool {
	if v, ok := r.lastUpdatedResourceVersionCache.Load(resourceVersionKey(namespace, name)); ok {
		strV := v.(string)
		if strV == resourceVersion {
			r.metrics.workflowStaleCount.Inc(ctx)
			return true
		}
	}

	return false
}

But not really sure what resource is the same as a previous version and why it’s a problem 😄

hallowed-mouse-14616

09/30/2022, 4:31 PM

checkin on stale quick here. so i know we use a k8s informer to keep track of the flyte workflow CRDs internally. this basically maintains a cache that is informed of changes. so when FlytePropeller updates a node status, etc if updates the CRD in etcd. we maintain the resource version of the CRD to ensure thta the next time FlytePropeller processes the workflow (read from the informer cache) that we don't accidently process a stale workflow.

hallowed-mouse-14616

09/30/2022, 4:32 PM

^^^ exactly what you linked to.

hallowed-mouse-14616

09/30/2022, 4:33 PM

the informer cache is only updated by k8s (using an internal watch API). so the "stale" message is saying. "we previously updated the flyteworkflow CRD, but we have not yet been notified by k8s that the CRD was updated".

rough-rose-81585

09/30/2022, 4:34 PM

Ok interesting so for some reason k8s was slow in updating the informer cache

hallowed-mouse-14616

09/30/2022, 4:34 PM

that's what it sounds like.

rough-rose-81585

09/30/2022, 4:34 PM

but looks like it eventually did and the workflow resumed successfully

rough-rose-81585

09/30/2022, 4:34 PM

That’s all my questions for now. Going to retest with inject-finalizer on and see what happens. Thanks for your help!

🙏 1

hallowed-mouse-14616

09/30/2022, 4:34 PM

yeah, there will not be any issues with correctness in executing the workflow. this is actually a safeguard in place to ensure correctness.

👍 1

233 Views

Open in Slack

Previous Next