Thread
#flyte-deployment
    Anna Cunningham

    Anna Cunningham

    5 months ago
    Hi! I’m trying to optimize the FlytePropeller configuration to avoid the worse-case scenario described in this doc. I’m still running into the issue - there are too many nodes running at once, so by the time the propeller gets around to checking whether a task finished, the pod was already completed and cleaned up long ago. I added more workers, increased the kube-client-config qps, and reduced maxParallelism, but this problem is still happening. I was wondering if I could get some more advice?
    Ketan (kumare3)

    Ketan (kumare3)

    5 months ago
    best way would be to
    inject-finalizer
    or set maxPArallelism
    inject-finalizer will prevent K8s from claiming back the pod
    Anna Cunningham

    Anna Cunningham

    5 months ago
    thanks, I’ll look into
    inject-finalizer
    more! I set
    maxParallelism
    to 3 (from 25) and it did not help. am I right in my understanding that setting it lower would help, as it would limit the number of nodes per workflow the propeller launches?
    @Ketan (kumare3) thanks for the
    inject-finalizer
    suggestion, I tried it out and my original issue seems solved, but now I’m having a backlog of pods that are stuck in Terminating (for 4+ hours so far) and still have the finalizer on them. have you seen this before?
    Ketan (kumare3)

    Ketan (kumare3)

    5 months ago
    Hmm, that is odd, cc @Haytham Abuelfutuh / @Dan Rammer (hamersaw) do you guys know
    Would love to dive deeper
    Anna Cunningham

    Anna Cunningham

    5 months ago
    It looks like actually the pods do eventually terminate, it just takes a really long time. There are still some pods that have been Terminating for 16+ hours from yesterday, there are almost 10,000 Terminating pods that have accumulated in total. Any tips for how I could speed it up? Currently I have workers set to 1000.
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    Hi @Anna Cunningham, this is odd. Just so I'm sure that I understand: you were having issues where the pods were being cleaned up too quickly, you set the
    inject-finalizer
    configuration and that fixed it, only now the Flyte pods are taking a very long time to terminate? And the pods still have the "flyte" finalizer set on them?
    Anna Cunningham

    Anna Cunningham

    5 months ago
    correct!
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    Interesting, are there errors in any of the pod logs? did the tasks which launched the pods themselves fail?
    Also, what version of FlytePropeller are you using?
    Anna Cunningham

    Anna Cunningham

    5 months ago
    v0.16.19
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    Can you dump the FlytePropeller configuration? It sounds like the finalizer is doing it's job of keeping the pod around - but FlytePropeller is supposed to remove it on completion, obviously this is not happening for some reason.
    Also on the pods is the deletion timestamp set?
    Anna Cunningham

    Anna Cunningham

    5 months ago
    what is the best way for me to dump the configuration?
    the deletion timestamp is set on the pods, here’s an example:
    apiVersion: v1
    items:
    - apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
        creationTimestamp: "2022-04-19T07:30:49Z"
        deletionGracePeriodSeconds: 0
        deletionTimestamp: "2022-04-19T10:36:03Z"
        finalizers:
        - flyte/flytek8s
        labels:
          domain: main
          execution-id: f11qb62y-n1-0-dn0-0
          interruptible: "true"
          node-id: n2
          project: sunflower
          shard-key: "4"
          task-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
          workflow-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
        name: f11qb62y-n1-0-dn0-0-n2-0
        namespace: sunflower-main
        ownerReferences:
        - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
          blockOwnerDeletion: true
          controller: true
          kind: flyteworkflow
          name: f11qb62y-n1-0-dn0-0
          uid: e0525cc8-17c2-463b-a5dd-342776c69bbd
        resourceVersion: "266395947"
        uid: 393c4d2b-42ad-4061-8c95-a5d4f6eabb67
      spec:
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - preference:
                matchExpressions:
                - key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
                  operator: Exists
              weight: 1
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
                  operator: In
                  values:
                  - flyte-worker
                - key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
                  operator: In
                  values:
                  - "true"
        containers:
        - args:
          - pyflyte-execute
          - --inputs
          - <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/inputs.pb>
          - --output-prefix
          - <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/0>
          - --raw-output-data-prefix
          - <gs://freenome-orchid-staging-flyte-data/ak/f11qb62y-n1-0-dn0-0-n2-0>
          - --resolver
          - flytekit.core.python_auto_container.default_task_resolver
          - --
          - task-module
          - sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow
          - task-name
          - align
          env:
          - name: FLYTE_INTERNAL_CONFIGURATION_PATH
            value: /usr/src/app/sunflower/workflows/config/workflows.config
          - name: FLYTE_INTERNAL_IMAGE
            value: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
          - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
            value: sunflower:main:sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.trim_fastqs_and_align_wf
          - name: FLYTE_INTERNAL_EXECUTION_ID
            value: f11qb62y-n1-0-dn0-0
          - name: FLYTE_INTERNAL_EXECUTION_PROJECT
            value: sunflower
          - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
            value: main
          - name: FLYTE_ATTEMPT_NUMBER
            value: "0"
          - name: FLYTE_INTERNAL_TASK_PROJECT
            value: sunflower
          - name: FLYTE_INTERNAL_TASK_DOMAIN
            value: main
          - name: FLYTE_INTERNAL_TASK_NAME
            value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
          - name: FLYTE_INTERNAL_TASK_VERSION
            value: "20220414.2"
          - name: FLYTE_INTERNAL_PROJECT
            value: sunflower
          - name: FLYTE_INTERNAL_DOMAIN
            value: main
          - name: FLYTE_INTERNAL_NAME
            value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
          - name: FLYTE_INTERNAL_VERSION
            value: "20220414.2"
          - name: SUNFLOWER_STATIC_DATA_GCS_PATH
            value: <gs://freenome-orchid-staging-static-data>
          image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
          imagePullPolicy: IfNotPresent
          name: f11qb62y-n1-0-dn0-0-n2-0
          resources:
            limits:
              cpu: "16"
              ephemeral-storage: 52Gi
              memory: 57Gi
            requests:
              cpu: "16"
              ephemeral-storage: 26Gi
              memory: 57Gi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: FallbackToLogsOnError
          volumeMounts:
          - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
            name: kube-api-access-6h4gh
            readOnly: true
        dnsPolicy: ClusterFirst
        enableServiceLinks: true
        nodeName: gke-orchid-west1-flyte-worker-a3b955b5-pccx
        preemptionPolicy: PreemptLowerPriority
        priority: 0
        restartPolicy: Never
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: default
        serviceAccountName: default
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
          operator: Equal
          value: flyte-worker
        - effect: NoExecute
          key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
          operator: Exists
          tolerationSeconds: 300
        - effect: NoExecute
          key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
          operator: Exists
          tolerationSeconds: 300
        volumes:
        - name: kube-api-access-6h4gh
          projected:
            defaultMode: 420
            sources:
            - serviceAccountToken:
                expirationSeconds: 3607
                path: token
            - configMap:
                items:
                - key: ca.crt
                  path: ca.crt
                name: kube-root-ca.crt
            - downwardAPI:
                items:
                - fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
                  path: namespace
      status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2022-04-19T07:30:49Z"
          reason: PodCompleted
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: "2022-04-19T08:24:46Z"
          reason: PodCompleted
          status: "False"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: "2022-04-19T08:24:46Z"
          reason: PodCompleted
          status: "False"
          type: ContainersReady
        - lastProbeTime: null
          lastTransitionTime: "2022-04-19T07:30:49Z"
          status: "True"
          type: PodScheduled
        containerStatuses:
        - containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
          image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
          imageID: <http://gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5|gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5>
          lastState: {}
          name: f11qb62y-n1-0-dn0-0-n2-0
          ready: false
          restartCount: 0
          started: false
          state:
            terminated:
              containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
              exitCode: 0
              finishedAt: "2022-04-19T08:24:45Z"
              reason: Completed
              startedAt: "2022-04-19T07:30:49Z"
        hostIP: 172.31.0.27
        phase: Succeeded
        podIP: 172.20.53.23
        podIPs:
        - ip: 172.20.53.23
        qosClass: Guaranteed
        startTime: "2022-04-19T07:30:49Z"
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""
    I exec’d into the flytepropeller pod and here are the contents of all our config yamls:
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    This is very odd. Do you have access to the propeller logs? I'd be happy to jump on a call and run through them quick to see if there is anything interesting there. It would come up with warnings like "Failed to clear finalizers for Resource with name" or "Failed in finalizing get Resource with name".
    Anna Cunningham

    Anna Cunningham

    5 months ago
    that would be so helpful. when are you free? I’m free now til 1 and then after 2 (PST)
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    Now works good for me.
    To join the video meeting, click this link: https://meet.google.com/yvu-kffs-gqh Otherwise, to join by phone, dial +1 720-598-2817 and enter this PIN: 193 537 382# To view more phone numbers, click this link: https://tel.meet/yvu-kffs-gqh?hs=5
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    cc @Haytham Abuelfutuh - We're not seeing anything interesting in the logs unfortunately, was wondering if you might have any insight? Just to scope this a little bit, launching about 800 workflows which consist of a dynamic task which executes ~10 launch plans each. So a large number of concurrent workflows (8k ish) but a relatively small number of tasks.
    Haytham Abuelfutuh

    Haytham Abuelfutuh

    5 months ago
    @Dan Rammer (hamersaw), are you seeing the workflow getting marked as terminated but the Pods still have the finalizer on them?
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    The one workflow we looked at was still running, with the pod completed but the Flyte node still in the running phase.
    So then intuitively propeller just is not processing the nodes, and in turn removing the finalizers. It's most likely not a bug, just a bottleneck. It would probably benefit to look at some prometheus metrics on the etcd query times and workflow round latencies in flytepropeller to see if it's k8s or Flyte and what we can do to reduce it.
    @Anna Cunningham is there a specific reason you are using launchplans instead of just subworkflows (beyond stress testing)? I ask because I don't believe the max parallelism configuration from the parent workflow is adopted by launchplans. Therefore, each workflow may execute 3 launchplans in parallel, each of which executes 3 tasks. When you extrapolate that by the 800 workflows it's something like 7200 pods simultaneously. With the resource requests on the sample pod (16 cpu / 51Gi memory) that adds up quickly.
    Anna Cunningham

    Anna Cunningham

    5 months ago
    I believe when we originally wrote these workflows, launchplans were the only way to invoke sub-workflows, and we never updated it when we upgraded flytekit. We currently call our sub-workflows by creating a launchplan using
    launch_plan = LaunchPlan.create("name", workflow)
    and then calling
    launch_plan()
    . Instead should we just be calling
    workflow()
    directly?
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    I think it depends. If the goal is the run 800+ of these workflows in parallel then using subworkflows will give a little better control over the parallelism and may help reduce some of the overhead in managing additional workflows.
    Anna Cunningham

    Anna Cunningham

    5 months ago
    I see. I’ll read a bit more about it, found this helpful documentation 🙂
    in terms of the prometheus metrics, I found the prometheus dashboard for flytepropeller, here are some screenshots of round latency and etcd in the last 24 hours:
    I think maybe I was just overloading the Kubernetes API. I reduced workers drastically (to 10 instead of 1000) and also reduced the kube-client-config qps, and suddenly the pods stuck in Terminating started getting cleaned up. Do you have tips on how to balance FlytePropeller performance with Kubernetes constraints?
    Ketan (kumare3)

    Ketan (kumare3)

    5 months ago
    hmm, this is interesting, 10 sounds too low. I think you should be able to increase the workers and increase the qps. Make the workers like 40-100
    Anna Cunningham

    Anna Cunningham

    5 months ago
    👍 I increased the workers little by little to 75 and things looked good. I’ll also experiment with increasing qps a little tomorrow!
    Ketan (kumare3)

    Ketan (kumare3)

    5 months ago
    i can try and work with Dan to see, understand
    Haytham Abuelfutuh

    Haytham Abuelfutuh

    5 months ago
    I’ve noticed that when kube-api-client gets throttled, you don’t get helpful logs it just gets backed up… you should be able to get metrics using from api server for writes… You should check the Update calls for both Pods (propeller calls to clear the finalizers) and Workflows… This may be helpful.
    Anna Cunningham

    Anna Cunningham

    5 months ago
    thank you all for the help and advice! I was able to settle on a good amount of workers and qps that didn’t overload the kubernetes api. not seeing the issue with pods stuck in Terminating anymore 🙂
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    That's great to hear! Can you say a little more about it? ie. did you look at the kubeapi metrics and just run experiments with varying numbers until it looked good? We should certainly update the documentation to include this info, and it would help to have something more solid than “don't overload the kube api server”.
    Anna Cunningham

    Anna Cunningham

    5 months ago
    @Dan Rammer (hamersaw) sorry I missed this until now! I started by dialing down the workers and qps super low (10 workers, 50qps) and slowly ramped them up. it was pretty non-quantitative. when the kube-api was getting overloaded, we’d see all the workers free (I assume because they were waiting on timed-out api calls?). I upped workers and qps until I saw a regular fluctuation of sometimes all the workers working and sometimes most of them idle.
    I ended up landing at 75 workers and 100qps
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    5 months ago
    @Anna Cunningham thank you so much for the info! I'll work on updating the documentation.