salmon-refrigerator-32115
02/20/2024, 6:44 PMWarning FailedScheduling 40s default-scheduler 0/18 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 2 node(s) had untolerated taint {node-group: istio-ingress}, 5 node(s) had untolerated taint {CriticalAddonsOnly: true}, 8 node(s) had untolerated taint {node-group: mem-intense}. preemption: 0/18 nodes are available: 15 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
My config:
task_resource_defaults:
# -- Task default resources parameters
task_resources:
defaults:
cpu: 2
memory: 1Gi
storage: 1Gi
limits:
cpu: 20
memory: 500Gi
storage: 100Gi
gpu: 1
....
cluster_resources:
refreshInterval: 5m
customData:
- development:
- projectQuotaCpu:
value: "100"
- projectQuotaMemory:
value: "1800Gi"
And my task’s resources:
@task(
requests=Resources(cpu="5", mem="50Gi"),
)
average-finland-92144
02/20/2024, 9:45 PMtag
are you using for your custom image? If set to latest
or empty, the imagePullPolicy
will be set to Always
2. K8s scheduler errors. So considering that a good part of your node group seem to be tainted, are you setting up the corresponding tolerations on the pods?salmon-refrigerator-32115
02/20/2024, 9:55 PMaverage-finland-92144
02/21/2024, 4:10 PMsalmon-refrigerator-32115
02/21/2024, 6:18 PMsalmon-refrigerator-32115
02/21/2024, 6:21 PMk8s:
plugins:
# -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
k8s:
default-env-vars: []
# DEFAULT_ENV_VAR: VALUE
default-cpus: 100m
default-memory: 100Mi
average-finland-92144
02/21/2024, 6:33 PMplugins:
k8s:
resource-tolerations:
- key: "nodetype"
operator: "Equal"
value: "Standard_B8ms"
effect: "NoExecute"
The map has to match what your nodes have configured. This is a platform-wide config.
Nevertheless, I think I remember you have different taints throughout your node group so, you'd probably need PodTemplatesaverage-finland-92144
02/21/2024, 6:34 PM