Hello I just setup a Flyte server running in EKS for the fir Flyte #flyte-support

Hello, I just setup a Flyte server running in EKS ...

salmon-refrigerator-32115

02/20/2024, 6:44 PM

Hello, I just setup a Flyte server running in EKS for the first time. I can run a remote flyte workflow fine. However, if I try to run another flyte workflow at about the same time, it will show the following message during pod initialization. And the second workflow pod’s initialization will be delayed and my custom image will be pulled again instead of saying image already exists, causing more delay. Have you experienced this? What do you suggest me to change? Thanks!

Copy code

Warning  FailedScheduling  40s   default-scheduler  0/18 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 2 node(s) had untolerated taint {node-group: istio-ingress}, 5 node(s) had untolerated taint {CriticalAddonsOnly: true}, 8 node(s) had untolerated taint {node-group: mem-intense}. preemption: 0/18 nodes are available: 15 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.

My config:

Copy code

task_resource_defaults:
              # -- Task default resources parameters
              task_resources:
                defaults:
                  cpu: 2
                  memory: 1Gi
                  storage: 1Gi
                limits:
                  cpu: 20
                  memory: 500Gi
                  storage: 100Gi
                  gpu: 1
....
              cluster_resources:
                refreshInterval: 5m
                customData:
                  - development:
                      - projectQuotaCpu:
                          value: "100"
                      - projectQuotaMemory:
                          value: "1800Gi"

And my task’s resources:

Copy code

@task(
    requests=Resources(cpu="5", mem="50Gi"), 
)

average-finland-92144

02/20/2024, 9:45 PM

Hi Frank! So a number of issues here: 1. ImagePull with every execution: what

tag

are you using for your custom image? If set to

latest

or empty, the

imagePullPolicy

will be set to

Always

2. K8s scheduler errors. So considering that a good part of your node group seem to be tainted, are you setting up the corresponding tolerations on the pods?

salmon-refrigerator-32115

02/20/2024, 9:55 PM

Hi @average-finland-92144, Image: 876262748715.dkr.ecr.us-east-1.amazonaws.com/mlforge/flyte:0.4.0-pr-70-5c334549 I think I know why. If a new workflow is assigned to a EC2 node that has previously run a workflow, it will not pull the image but re-use.

average-finland-92144

02/21/2024, 4:10 PM

ok, that's an expected behavior I guess

salmon-refrigerator-32115

02/21/2024, 6:18 PM

Hi David, I still don’t know how to do that after reading https://docs.flyte.org/en/latest/deployment/configuration/generated/flyteadmin_config.html#resource-tolerations-map-v1-resourcename-v1-toleration Could you share an example code / repo? Thanks!

salmon-refrigerator-32115

02/21/2024, 6:21 PM

Right now I have:

Copy code

k8s:
    plugins:
      # -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
      k8s:
        default-env-vars: []
        #  DEFAULT_ENV_VAR: VALUE
        default-cpus: 100m
        default-memory: 100Mi

average-finland-92144

02/21/2024, 6:33 PM

I think it should be something like this example:

Copy code

plugins:
  k8s:
    resource-tolerations:
      - key: "nodetype"
        operator: "Equal"
        value: "Standard_B8ms"
        effect: "NoExecute"

The map has to match what your nodes have configured. This is a platform-wide config. Nevertheless, I think I remember you have different taints throughout your node group so, you'd probably need PodTemplates

average-finland-92144

02/21/2024, 6:34 PM

This is assuming your use case has to do with running tasks on different nodes in your node group depending on matching taints and tolerations

13 Views

Open in Slack

Previous Next