Hi team… I’m seeing this error in a slightly long ...
# ask-the-community
Hi team… I’m seeing this error in a slightly long running flyte task, which results in the task getting triggered again and again.
object [my_domainctw5x2vdnkvd-n0-0] terminated in the background, manually
Looks like it was discussed here but there’s no conclusion
I’m going to try with
to see if it alleviates the problem since it’s running on spot instances
@Rupsha Chaudhuri thanks for diving into past conversations! So yeah, as mentioned this happens when Flyte attempts to check a Pod status and the Pod is missing. This means that some external system (typically resource manager) has deleted the Pod. Therefore, there is nothing that Flyte can do to figure out what happened and relaunching the task is our best effort.
Oh if it's using spot instances that explains a lot. the long running task is probably being evicted.
ya you need to enable
this is a flat in flytepropeller
Even with
it got evicted… will look into this
The job is still getting evicted 😞
@Rupsha Chaudhuri just to recap. you guys are running Flyte tasks on spot instances right?
how long running are the tasks?
so basically, there is no way for flyte to tell the spot instances that it is unable to evict the pods. spot instances will give a warning (ie. mark the pod as deleted), and if the pod controller does not cleanup the resource within a predefined period then the spot instance will force delete the pod. this is what you're seeing, where flyte is looking for a pod that already started and can't find it. there are a few mechanisms within flyte that we use to combat this: (1) injecting finalizers. this is a way to tell k8s to not delete the resource until the finalizer is removed. however, in the case of spot instances these are ignored and the pod is deleted anyways. (2) system retries. the idea of using spot instances is typically to reduce the cost of compute. a simple mechanism to enable tasks here is to retry many times - which is configurable. so if you have system retries set to something like 50, flyte can retry 50 times. additionally, there is configuration for an 'interruptibleThreshold'. this means that when using interruptible you can say mark the first N retries as 'interruptible' but once it exceeds N retries the remainder are no longer marked as 'interruptible'. as you mentioned, this will likely not work either, as the spot instance is deleting the pods. (3) intra-task checkpointing. if the task is running some iterative operation the use of intra-task checkpointing allows the task to periodically save it's state. so even if it is interrupted, the next instance picks up at some mid-point rather than re-doing all of the work. afaik the only other fallback is to evaluate this workload on a non-spot instance.
happy to hop on a call and discuss options if you want!
System retries: I do have those already. Problem is it’s a task that triggers a job running somewhere else and monitors it. Now everytime the task restarts, it retriggers the job. For now I’ve gotten around it by splitting the triggering and monitoring to completion into 2 separate steps.
intra task checkpointing may be a good approach for me instead of splitting the task into 2 separate tasks. I can try incorporating it into my code
it's a task that triggers a job running somewhere else and monitors it
sounds like Flyte is just missing a plugin here? so you're using spot instances because they are not actually doing any work, just starting and monitoring an external job?
Flyte is triggering a Jenkins job in this case
and the orchestration and some data manipulation is what resides in Flyte
So I tried using the checkpoint and ran into this error
Expected exactly one checkpoint - found 0
What I did… Try reading checkpoint.. if empty trigger jenkins job and save the url of jenkins job (checkpoint.write()). Wait for job to finish
Question: Once a checkpoint is read, can the same data be “re-read”?
Is the Jenkins job a webAPI call? It really sounds like implementing a backend plugin for Jenkins is the correct way to handle this scenario. Essentially, right now we're starting a k8s pod to start a jenkins job, and then using the pod to monitor the jenkins job. With a backend plugin this could all be managed by FlytePropeller and the eviction of pods on spot instances here would not be an issue.
Also the intra-task checkpointing functionality can be a little unintuitive sometimes. Basically, each attempt of the task is provided the previous attempts checkpoint path and that is used. So the first attempt it should write the external job id to the checkpoint, and then in subsequent attempts you should read from the old, and write to the new immediately - this way it will be available in new attempts. This functionality is slated to be cleaned up in this issue . However, I will reiterate that this is pretty hacky, a plugin would be a better path.
Great.. that’s what I figured that I need to read it and write it back
and yes.. triggering jenkins is a web api call
I suppose I can look into writing the Jenkins plugin
@Rupsha Chaudhuri / @Dan Rammer (hamersaw) can we meet tomorrow? I want to see if there is a faster route
for now the checkpoint method works.. but yes definitely inefficient. We can meet tomorrow