Hello, I just setup a Flyte server running in EKS ...
# ask-the-community
f
Hello, I just setup a Flyte server running in EKS for the first time. I can run a remote flyte workflow fine. However, if I try to run another flyte workflow at about the same time, it will show the following message during pod initialization. And the second workflow pod’s initialization will be delayed and my custom image will be pulled again instead of saying image already exists, causing more delay. Have you experienced this? What do you suggest me to change? Thanks!
Copy code
Warning  FailedScheduling  40s   default-scheduler  0/18 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 2 node(s) had untolerated taint {node-group: istio-ingress}, 5 node(s) had untolerated taint {CriticalAddonsOnly: true}, 8 node(s) had untolerated taint {node-group: mem-intense}. preemption: 0/18 nodes are available: 15 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
My config:
Copy code
task_resource_defaults:
              # -- Task default resources parameters
              task_resources:
                defaults:
                  cpu: 2
                  memory: 1Gi
                  storage: 1Gi
                limits:
                  cpu: 20
                  memory: 500Gi
                  storage: 100Gi
                  gpu: 1
....
              cluster_resources:
                refreshInterval: 5m
                customData:
                  - development:
                      - projectQuotaCpu:
                          value: "100"
                      - projectQuotaMemory:
                          value: "1800Gi"
And my task’s resources:
Copy code
@task(
    requests=Resources(cpu="5", mem="50Gi"), 
)
d
Hi Frank! So a number of issues here: 1. ImagePull with every execution: what
tag
are you using for your custom image? If set to
latest
or empty, the
imagePullPolicy
will be set to
Always
2. K8s scheduler errors. So considering that a good part of your node group seem to be tainted, are you setting up the corresponding tolerations on the pods?
f
Hi @David Espejo (he/him), Image: 876262748715.dkr.ecr.us-east-1.amazonaws.com/mlforge/flyte:0.4.0-pr-70-5c334549 I think I know why. If a new workflow is assigned to a EC2 node that has previously run a workflow, it will not pull the image but re-use.
d
ok, that's an expected behavior I guess
f
Hi David, I still don’t know how to do that after reading https://docs.flyte.org/en/latest/deployment/configuration/generated/flyteadmin_config.html#resource-tolerations-map-v1-resourcename-v1-toleration Could you share an example code / repo? Thanks!
Right now I have:
Copy code
k8s:
    plugins:
      # -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
      k8s:
        default-env-vars: []
        #  DEFAULT_ENV_VAR: VALUE
        default-cpus: 100m
        default-memory: 100Mi
d
I think it should be something like this example:
Copy code
plugins:
  k8s:
    resource-tolerations:
      - key: "nodetype"
        operator: "Equal"
        value: "Standard_B8ms"
        effect: "NoExecute"
The map has to match what your nodes have configured. This is a platform-wide config. Nevertheless, I think I remember you have different taints throughout your node group so, you'd probably need PodTemplates
This is assuming your use case has to do with running tasks on different nodes in your node group depending on matching taints and tolerations