• Stephen

    Stephen

    6 months ago
    Hey quick question, how would you add a
    nodeSelector
    for a specific workflow/task? I’m not sure if taints are enough to trigger our cluster-autoscaler. I wanted to edit the pods on the fly but I can’t 😕
  • Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    Ohh you mean tolerations or default node selectors?
  • Cc @jeev do you guys do something similar
  • Stephen

    Stephen

    6 months ago
    Default node selector, we tried tolerations but we’re not sure that it’s triggering the autoscaler so we wanna try nodeSelector
  • Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
  • Don't love it, but only other Option is to use pod task
  • Stephen

    Stephen

    6 months ago
    So I can’t do it for now correct?
  • Also I have a question about taints. So we have the same config for Propeller as the one defined https://github.com/flyteorg/flytepropeller/blob/master/propeller-config.yaml#L51-L56 We defined our taints in Terraform as being
    taints = {
            dedicated = {
              key    = "flyte/gpu"
              value  = "dedicated"
              effect = "NO_SCHEDULE"
            }
          }
    but I’m not sure if it’s correct or it should be
    taints = {
            <http://nvidia.com/gpu|nvidia.com/gpu> = {
              key    = "flyte/gpu"
              value  = "dedicated"
              effect = "NO_SCHEDULE"
            }
          }
    🤔 Terraform plan has the same output
  • Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    Cc @Haytham Abuelfutuh
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    6 months ago
    The propeller config looks correct.. Which cluster autoscaler do you use?
  • Stephen

    Stephen

    5 months ago
  • FYI, using a NodeSelector does trigger a scale-up on a dummy deployment
    Events:
      Type     Reason            Age                From                Message
      ----     ------            ----               ----                -------
      Warning  FailedScheduling  27s (x2 over 28s)  default-scheduler   0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector.
      Normal   TriggeredScaleUp  27s                cluster-autoscaler  pod triggered scale-up: [{eks-nodes-g4dn-xlarge-9cbf749a-540f-373d-21cb-800bbc70bea0 0->1 (max: 1)}]
    NodeSelector:
    nodeSelector:
        <http://eks.amazonaws.com/nodegroup|eks.amazonaws.com/nodegroup>: nodes-name-GPU
  • j

    jeev

    5 months ago
    @Stephen: is this resolved now?
  • tolerations should suffice to scale up a node pool assuming all other requirements match. is it scaling up from 0 by any chance?
  • Stephen

    Stephen

    5 months ago
    Not yet but we are still investigating, we are checking this issue and trying to fix couple of things https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1558
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    5 months ago
    @Stephen just got the chance to try out tolerations with gpu nodes and I see them scaling up and down just fine… have you managed to get to the root cause here?
  • Stephen

    Stephen

    5 months ago
    Hey sorry I forgot to reply — I stopped working on that for now because I had to focus on something else but I should come back to it soon(ish). I realised we also had to install the different Nvidia drivers etc, I naively assumed that we didn’t need to do that 😅