Hey quick question, how would you add a `nodeSelec...
# announcements
s
Hey quick question, how would you add a
nodeSelector
for a specific workflow/task? I’m not sure if taints are enough to trigger our cluster-autoscaler. I wanted to edit the pods on the fly but I can’t 😕
k
Ohh you mean tolerations or default node selectors?
Cc @jeev do you guys do something similar
s
Default node selector, we tried tolerations but we’re not sure that it’s triggering the autoscaler so we wanna try nodeSelector
k
Don't love it, but only other Option is to use pod task
s
So I can’t do it for now correct?
Also I have a question about taints. So we have the same config for Propeller as the one defined https://github.com/flyteorg/flytepropeller/blob/master/propeller-config.yaml#L51-L56 We defined our taints in Terraform as being
Copy code
taints = {
        dedicated = {
          key    = "flyte/gpu"
          value  = "dedicated"
          effect = "NO_SCHEDULE"
        }
      }
but I’m not sure if it’s correct or it should be
Copy code
taints = {
        <http://nvidia.com/gpu|nvidia.com/gpu> = {
          key    = "flyte/gpu"
          value  = "dedicated"
          effect = "NO_SCHEDULE"
        }
      }
🤔 Terraform plan has the same output
k
Cc @Haytham Abuelfutuh
h
The propeller config looks correct.. Which cluster autoscaler do you use?
s
FYI, using a NodeSelector does trigger a scale-up on a dummy deployment
Copy code
Events:
  Type     Reason            Age                From                Message
  ----     ------            ----               ----                -------
  Warning  FailedScheduling  27s (x2 over 28s)  default-scheduler   0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector.
  Normal   TriggeredScaleUp  27s                cluster-autoscaler  pod triggered scale-up: [{eks-nodes-g4dn-xlarge-9cbf749a-540f-373d-21cb-800bbc70bea0 0->1 (max: 1)}]
NodeSelector:
Copy code
nodeSelector:
    <http://eks.amazonaws.com/nodegroup|eks.amazonaws.com/nodegroup>: nodes-name-GPU
j
@Stephen: is this resolved now?
tolerations should suffice to scale up a node pool assuming all other requirements match. is it scaling up from 0 by any chance?
s
Not yet but we are still investigating, we are checking this issue and trying to fix couple of things https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1558
h
@Stephen just got the chance to try out tolerations with gpu nodes and I see them scaling up and down just fine… have you managed to get to the root cause here?
s
Hey sorry I forgot to reply — I stopped working on that for now because I had to focus on something else but I should come back to it soon(ish). I realised we also had to install the different Nvidia drivers etc, I naively assumed that we didn’t need to do that 😅
192 Views