Hi team I m seeing this error in a slightly long running fly Flyte #flyte-support

Hi team… I’m seeing this error in a slightly long ...

little-cricket-84530

09/23/2022, 4:41 PM

Hi team… I’m seeing this error in a slightly long running flyte task, which results in the task getting triggered again and again.

object [my_domainctw5x2vdnkvd-n0-0] terminated in the background, manually

little-cricket-84530

09/23/2022, 4:42 PM

Looks like it was discussed here but there’s no conclusion

little-cricket-84530

09/23/2022, 4:47 PM

I’m going to try with

interruptible=False

to see if it alleviates the problem since it’s running on spot instances

hallowed-mouse-14616

09/23/2022, 4:48 PM

@little-cricket-84530 thanks for diving into past conversations! So yeah, as mentioned this happens when Flyte attempts to check a Pod status and the Pod is missing. This means that some external system (typically resource manager) has deleted the Pod. Therefore, there is nothing that Flyte can do to figure out what happened and relaunching the task is our best effort.

hallowed-mouse-14616

09/23/2022, 4:49 PM

Oh if it's using spot instances that explains a lot. the long running task is probably being evicted.

freezing-airport-6809

09/23/2022, 5:15 PM

ya you need to enable

finalizer

freezing-airport-6809

09/23/2022, 5:15 PM

this is a flat in flytepropeller

freezing-airport-6809

09/23/2022, 5:16 PM

https://docs.flyte.org/en/latest/deployment/cluster_config/scheduler_config.html#inject-finalizer-bool

little-cricket-84530

09/23/2022, 5:40 PM

Even with

interruptible=False

it got evicted… will look into this

little-cricket-84530

09/26/2022, 9:40 PM

The job is still getting evicted 😞

hallowed-mouse-14616

09/26/2022, 9:59 PM

@little-cricket-84530 just to recap. you guys are running Flyte tasks on spot instances right?

little-cricket-84530

09/26/2022, 9:59 PM

correct

hallowed-mouse-14616

09/26/2022, 10:00 PM

how long running are the tasks?

hallowed-mouse-14616

09/26/2022, 10:16 PM

so basically, there is no way for flyte to tell the spot instances that it is unable to evict the pods. spot instances will give a warning (ie. mark the pod as deleted), and if the pod controller does not cleanup the resource within a predefined period then the spot instance will force delete the pod. this is what you're seeing, where flyte is looking for a pod that already started and can't find it. there are a few mechanisms within flyte that we use to combat this: (1) injecting finalizers. this is a way to tell k8s to not delete the resource until the finalizer is removed. however, in the case of spot instances these are ignored and the pod is deleted anyways. (2) system retries. the idea of using spot instances is typically to reduce the cost of compute. a simple mechanism to enable tasks here is to retry many times - which is configurable. so if you have system retries set to something like 50, flyte can retry 50 times. additionally, there is configuration for an 'interruptibleThreshold'. this means that when using interruptible you can say mark the first N retries as 'interruptible' but once it exceeds N retries the remainder are no longer marked as 'interruptible'. as you mentioned, this will likely not work either, as the spot instance is deleting the pods. (3) intra-task checkpointing. if the task is running some iterative operation the use of intra-task checkpointing allows the task to periodically save it's state. so even if it is interrupted, the next instance picks up at some mid-point rather than re-doing all of the work. afaik the only other fallback is to evaluate this workload on a non-spot instance.

hallowed-mouse-14616

09/26/2022, 10:37 PM

happy to hop on a call and discuss options if you want!

little-cricket-84530

09/27/2022, 6:22 AM

System retries: I do have those already. Problem is it’s a task that triggers a job running somewhere else and monitors it. Now everytime the task restarts, it retriggers the job. For now I’ve gotten around it by splitting the triggering and monitoring to completion into 2 separate steps.

little-cricket-84530

09/27/2022, 6:24 AM

intra task checkpointing may be a good approach for me instead of splitting the task into 2 separate tasks. I can try incorporating it into my code

hallowed-mouse-14616

09/27/2022, 12:09 PM

it's a task that triggers a job running somewhere else and monitors it

sounds like Flyte is just missing a plugin here? so you're using spot instances because they are not actually doing any work, just starting and monitoring an external job?

little-cricket-84530

09/27/2022, 3:52 PM

Flyte is triggering a Jenkins job in this case

little-cricket-84530

09/27/2022, 3:53 PM

and the orchestration and some data manipulation is what resides in Flyte

little-cricket-84530

09/27/2022, 3:55 PM

So I tried using the checkpoint and ran into this error

Expected exactly one checkpoint - found 0

little-cricket-84530

09/27/2022, 4:04 PM

What I did… Try reading checkpoint.. if empty trigger jenkins job and save the url of jenkins job (checkpoint.write()). Wait for job to finish

little-cricket-84530

09/27/2022, 4:26 PM

Question: Once a checkpoint is read, can the same data be “re-read”?

hallowed-mouse-14616

09/27/2022, 11:08 PM

Is the Jenkins job a webAPI call? It really sounds like implementing a backend plugin for Jenkins is the correct way to handle this scenario. Essentially, right now we're starting a k8s pod to start a jenkins job, and then using the pod to monitor the jenkins job. With a backend plugin this could all be managed by FlytePropeller and the eviction of pods on spot instances here would not be an issue.

hallowed-mouse-14616

09/27/2022, 11:10 PM

Also the intra-task checkpointing functionality can be a little unintuitive sometimes. Basically, each attempt of the task is provided the previous attempts checkpoint path and that is used. So the first attempt it should write the external job id to the checkpoint, and then in subsequent attempts you should read from the old, and write to the new immediately - this way it will be available in new attempts. This functionality is slated to be cleaned up in this issue . However, I will reiterate that this is pretty hacky, a plugin would be a better path.

little-cricket-84530

09/28/2022, 1:55 AM

Great.. that’s what I figured that I need to read it and write it back

little-cricket-84530

09/28/2022, 1:56 AM

and yes.. triggering jenkins is a web api call

little-cricket-84530

09/28/2022, 1:57 AM

I suppose I can look into writing the Jenkins plugin

freezing-airport-6809

09/28/2022, 2:23 AM

@little-cricket-84530 / @hallowed-mouse-14616 can we meet tomorrow? I want to see if there is a faster route

little-cricket-84530

09/28/2022, 2:53 AM

for now the checkpoint method works.. but yes definitely inefficient. We can meet tomorrow

👍 1

167 Views

Open in Slack

Previous Next