Hi, when is Flyte deleting Pods? I just had a fail...
# ask-the-community
f
Hi, when is Flyte deleting Pods? I just had a failed task (due to image pull backoff, since some pull creds were missing in that namespace), but the Pod was not removed. Once I added the pull creds, my collegue relaunched the task, but now there were two Pods running! The "old" one which was marked as failed in Flyte and the new relaunched one. Had to manually remove the Pod that initially failed, but which in k8s started once the pull secret was there...
I guess I'll open an issue, in the flytepropeller log I can see:
Copy code
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393559","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  0 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:26:57Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393628","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  1 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:02Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393690","routine":"worker-0","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution {resource_type:TASK project:\"cadmatch-training\" domain:\"development\" name:\"train_aae\" version:\"2fbd809\"  node_id:\"trainaae\" execution_id:\u003cproject:\"cadmatch-training\" domain:\"development\" name:\"af89l2cwzfjfnjpbxqfh\" \u003e  2 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!","ts":"2023-01-17T14:27:08Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","node":"trainaae","ns":"cadmatch-training-development","res_ver":"44393750","routine":"worker-1","wf":"cadmatch-training:development:.flytegen.train_aae"},"level":"warning","msg":"Trying to abort a node in state [Failed]","ts":"2023-01-17T14:27:14Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-0"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:27:48Z"}
{"json":{"exec_id":"af89l2cwzfjfnjpbxqfh","ns":"cadmatch-training-development","routine":"worker-1"},"level":"warning","msg":"Workflow namespace[cadmatch-training-development]/name[af89l2cwzfjfnjpbxqfh] has already been terminated.","ts":"2023-01-17T14:28:02Z"}
b
Just joined this slack because of this problem. ^_^ Wrong image name cause the pod to get stuck - flyte marked it as failed, but it still consumed resource requests and so, nothing else could run.
Manually removing the pods resolved the issue.
(as in your case)
f
Yes, that is another aspect of the problem that I encountered before but did not fully grok
b
Totally new to flyte so it took some time for me as well, but great you created a ticket - filled in my info as well 🙂
f
Yeah, I'm fairly new to Flyte as well and stumbled across a few issues... Thankfully @Eduardo Apolinario (eapolinario) and the team is fairly responsive, so hoping we can get these things fixed 🤞 and run this in prod
d
Thanks for looking into this both of you! From the log in the issue it seems that Flyte is attempting to abort the task (which would delete the Pod), but it is unable to because Flyte has already marked the state as terminal. This should be a relatively quick fix (I hope 🙏). Since it effect correctness of the system it is high priority, will attempt to get to this in the next few days.
104 Views