Hi everyone I am trying to update the node group instance ty Flyte #announcements

Hi everyone, I am trying to update the node group...

nice-doctor-54581

09/06/2022, 12:03 AM

Hi everyone, I am trying to update the node group instance type through env.yaml while deploying with opta. Currently it's at default t3.medium. However, adding node_instance_type with the desired instance type seems to have no effect. I am adding this under - type: k8s-cluster. Do I need to create a separate node group using opta env to achieve this?

tall-lock-23197

09/06/2022, 4:33 AM

@great-school-54368, can you help?

great-school-54368

09/06/2022, 8:48 AM

@nice-doctor-54581 yes that is correct, Can you post the output of

opta apply

great-school-54368

09/06/2022, 8:49 AM

FYI opta is deprecated and now they don’t provide any support, We don’t recommend opta for production deployment

nice-doctor-54581

09/06/2022, 10:09 AM

Hi, @great-school-54368 I was able to resolve the issue. Thanks for the information though.

nice-doctor-54581

09/06/2022, 10:24 AM

@great-school-54368 We are running a new issue though, where your help would be greatly appreciated. We were able to change the instance type for our operations. We have a workflow X with 4 tasks to be completed. task 1 basically fetched a CSV from s3 and converts it to a list. However, when we run the workflow, the task keeps running indefinitely in the console while the pod/container status changes to complete. Also, it does not move on to the next task and remains stuck. This situation occurs when we load a 25 Mbs CSV to list in python. The workflow works perfectly for a sample CSV of 5 MBs. Any reason why would the system behave differently while passing different-sized data between tasks?

great-school-54368

09/06/2022, 11:03 AM

I don’t think data size matter, Can you post your example here cc: @tall-lock-23197

nice-doctor-54581

09/06/2022, 11:04 AM

Sure let me get the SS of the code. let me know if you need anything else.

nice-doctor-54581

09/06/2022, 11:11 AM

@great-school-54368 The process is stuck on running the fetch domain task for a CSV of size 25 Mbs. It works fine for 5Mb sample CSV

tall-lock-23197

09/06/2022, 11:32 AM

@nice-doctor-54581, can you set resource requests on your first task just in case it’s getting stuck due to resource constraints?

Copy code

from flytekit import Resources, task

@task(requests=Resources(cpu="2", mem="1000Mi"))
def fetch_domain_list_df(...):
    ...

nice-doctor-54581

09/06/2022, 11:34 AM

@tall-lock-23197 The pod being generated to undertake the task shows a requested resource of 7000Mi already, which happens to also be the limit for task resources.

nice-doctor-54581

09/06/2022, 11:38 AM

cc: @strong-quill-48244

tall-lock-23197

09/06/2022, 12:35 PM

Can you also check the pod logs even if it’s successful?

nice-doctor-54581

09/06/2022, 12:46 PM

yes that's showing success. For now we were able to work around this using pandas df schema provided by flyte and combining task 1 and 2.

nice-doctor-54581

09/06/2022, 12:47 PM

Thanks for the discussion. Really helped us explore new options.

👍 1

nice-doctor-54581

09/07/2022, 12:29 AM

Hi @tall-lock-23197 @great-school-54368 we are running into another issue. We are trying to run a task with a large list of input (~1 million) using map task and splitting. One of the pods in the task split goes down with a random error saying "[8]: code:"UnexpectedObjectDeletion" message:"object [searchco-data-pipeline-development/ak9ht2qn2wtrsrznc2nt-n1-0-8] terminated in the background, manually". However, when we run it with a separate smaller list (~200K) it seems to work fine. Any reason why would the pod terminate randomly? How can we avoid this incident?

tall-lock-23197

09/07/2022, 6:06 AM

@hallowed-mouse-14616, can you help? @nice-doctor-54581, can you share with us the full pod log?

nice-doctor-54581

09/07/2022, 9:33 AM

Since the pod split 13 has been killed, I do not have the logs. Sharing some SS for reference though.

nice-doctor-54581

09/07/2022, 11:17 AM

From the event logs, it seems like the pod is getting terminated due to pre-emption.

nice-doctor-54581

09/07/2022, 11:40 AM

@tall-lock-23197 is there a way to disable preemption for task pods when deploying with flyte?

tall-lock-23197

09/07/2022, 12:02 PM

Found a related issue: https://github.com/flyteorg/flyte/issues/2788.

hallowed-mouse-14616

09/07/2022, 2:01 PM

@nice-doctor-54581 I think there are few things we can work with here: (1) Are you using map task with a list of 1 million inputs? This would result in 1 million Pods being executed - this is beyond the capacity of k8s, are you using some sort of batching mechanism? (2) Flyte has an

inject-finalizer

configuration option which uses k8s finalizers to stop external systems from deleting the Pods, but some external systems do not respect this scheme and delete anyway. So this is a "best effort" by Flyte to keep Pods around, if they are still being deleted then unforunately there is not much we can do to stop it. A few ideas here are to look at the k8s logs to determine what is deleting the pods, additionally you can reduce the concurrency configuration on the map task so fewer Pods are spawned (typically these are deleted by some sort of resource manager). (3) The issue above (2788) has a submitted fix - but it resulted in a panic on externally deleted pods during interruptible failures. Are you designating the task as "interruptible"?

nice-doctor-54581

09/07/2022, 3:29 PM

We are actually passing splitted lists as input and not each element. That contains map tasks to the number of split.

hallowed-mouse-14616

09/07/2022, 4:23 PM

Oh OK, do you know how many splits there are?

nice-doctor-54581

09/07/2022, 4:32 PM

Yeah actually, we are splitting them in 15 parts

tall-lock-23197

09/08/2022, 5:20 AM

@nice-doctor-54581, have you been able to resolve the issue?

nice-doctor-54581

09/08/2022, 9:31 AM

@tall-lock-23197 will be trying out the sol soon

nice-doctor-54581

09/08/2022, 9:51 AM

@hallowed-mouse-14616 which file and section do we add the inject-finaliser to?

tall-lock-23197

09/08/2022, 10:10 AM

You may have to modify in https://github.com/flyteorg/flyte/blob/091cb93606cbb0a62b6d4f962e0afae021616101/charts/flyte-core/values.yaml#L639-L646 section.

Copy code

k8s:
  inject-finalizer: true
  ...

👍 3

159 Views

Open in Slack

Previous Next