Hi everyone, I am trying to update the node group...
# announcements
h
Hi everyone, I am trying to update the node group instance type through env.yaml while deploying with opta. Currently it's at default t3.medium. However, adding node_instance_type with the desired instance type seems to have no effect. I am adding this under - type: k8s-cluster. Do I need to create a separate node group using opta env to achieve this?
s
@Yuvraj, can you help?
y
@Harshit Sharma yes that is correct, Can you post the output of
opta apply
?
FYI opta is deprecated and now they don’t provide any support, We don’t recommend opta for production deployment
h
Hi, @Yuvraj I was able to resolve the issue. Thanks for the information though.
@Yuvraj We are running a new issue though, where your help would be greatly appreciated. We were able to change the instance type for our operations. We have a workflow X with 4 tasks to be completed. task 1 basically fetched a CSV from s3 and converts it to a list. However, when we run the workflow, the task keeps running indefinitely in the console while the pod/container status changes to complete. Also, it does not move on to the next task and remains stuck. This situation occurs when we load a 25 Mbs CSV to list in python. The workflow works perfectly for a sample CSV of 5 MBs. Any reason why would the system behave differently while passing different-sized data between tasks?
y
I don’t think data size matter, Can you post your example here cc: @Samhita Alla
h
Sure let me get the SS of the code. let me know if you need anything else.
s
@Harshit Sharma, can you set resource requests on your first task just in case it’s getting stuck due to resource constraints?
Copy code
from flytekit import Resources, task

@task(requests=Resources(cpu="2", mem="1000Mi"))
def fetch_domain_list_df(...):
    ...
h
@Samhita Alla The pod being generated to undertake the task shows a requested resource of 7000Mi already, which happens to also be the limit for task resources.
cc: @Ekku Jokinen
s
Can you also check the pod logs even if it’s successful?
h
yes that's showing success. For now we were able to work around this using pandas df schema provided by flyte and combining task 1 and 2.
Thanks for the discussion. Really helped us explore new options.
Hi @Samhita Alla @Yuvraj we are running into another issue. We are trying to run a task with a large list of input (~1 million) using map task and splitting. One of the pods in the task split goes down with a random error saying "[8]: code:"UnexpectedObjectDeletion" message:"object [searchco-data-pipeline-development/ak9ht2qn2wtrsrznc2nt-n1-0-8] terminated in the background, manually". However, when we run it with a separate smaller list (~200K) it seems to work fine. Any reason why would the pod terminate randomly? How can we avoid this incident?
s
@Dan Rammer (hamersaw), can you help? @Harshit Sharma, can you share with us the full pod log?
h
From the event logs, it seems like the pod is getting terminated due to pre-emption.
@Samhita Alla is there a way to disable preemption for task pods when deploying with flyte?
s
d
@Harshit Sharma I think there are few things we can work with here: (1) Are you using map task with a list of 1 million inputs? This would result in 1 million Pods being executed - this is beyond the capacity of k8s, are you using some sort of batching mechanism? (2) Flyte has an
inject-finalizer
configuration option which uses k8s finalizers to stop external systems from deleting the Pods, but some external systems do not respect this scheme and delete anyway. So this is a "best effort" by Flyte to keep Pods around, if they are still being deleted then unforunately there is not much we can do to stop it. A few ideas here are to look at the k8s logs to determine what is deleting the pods, additionally you can reduce the concurrency configuration on the map task so fewer Pods are spawned (typically these are deleted by some sort of resource manager). (3) The issue above (2788) has a submitted fix - but it resulted in a panic on externally deleted pods during interruptible failures. Are you designating the task as "interruptible"?
h
We are actually passing splitted lists as input and not each element. That contains map tasks to the number of split.
d
Oh OK, do you know how many splits there are?
h
Yeah actually, we are splitting them in 15 parts
s
@Harshit Sharma, have you been able to resolve the issue?
h
@Samhita Alla will be trying out the sol soon
@Dan Rammer (hamersaw) which file and section do we add the inject-finaliser to?
s
107 Views