Hi! I’m trying to optimize the FlytePropeller conf...
# flyte-deployment
a
Hi! I’m trying to optimize the FlytePropeller configuration to avoid the worse-case scenario described in this doc. I’m still running into the issue - there are too many nodes running at once, so by the time the propeller gets around to checking whether a task finished, the pod was already completed and cleaned up long ago. I added more workers, increased the kube-client-config qps, and reduced maxParallelism, but this problem is still happening. I was wondering if I could get some more advice?
k
best way would be to
inject-finalizer
or set maxPArallelism
inject-finalizer will prevent K8s from claiming back the pod
a
thanks, I’ll look into
inject-finalizer
more! I set
maxParallelism
to 3 (from 25) and it did not help. am I right in my understanding that setting it lower would help, as it would limit the number of nodes per workflow the propeller launches?
@Ketan (kumare3) thanks for the
inject-finalizer
suggestion, I tried it out and my original issue seems solved, but now I’m having a backlog of pods that are stuck in Terminating (for 4+ hours so far) and still have the finalizer on them. have you seen this before?
k
Hmm, that is odd, cc @Haytham Abuelfutuh / @Dan Rammer (hamersaw) do you guys know
Would love to dive deeper
🙏 1
a
It looks like actually the pods do eventually terminate, it just takes a really long time. There are still some pods that have been Terminating for 16+ hours from yesterday, there are almost 10,000 Terminating pods that have accumulated in total. Any tips for how I could speed it up? Currently I have workers set to 1000.
d
Hi @Anna Cunningham, this is odd. Just so I'm sure that I understand: you were having issues where the pods were being cleaned up too quickly, you set the
inject-finalizer
configuration and that fixed it, only now the Flyte pods are taking a very long time to terminate? And the pods still have the "flyte" finalizer set on them?
a
correct!
d
Interesting, are there errors in any of the pod logs? did the tasks which launched the pods themselves fail?
Also, what version of FlytePropeller are you using?
a
v0.16.19
d
Can you dump the FlytePropeller configuration? It sounds like the finalizer is doing it's job of keeping the pod around - but FlytePropeller is supposed to remove it on completion, obviously this is not happening for some reason.
Also on the pods is the deletion timestamp set?
a
what is the best way for me to dump the configuration?
the deletion timestamp is set on the pods, here’s an example:
Copy code
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
    creationTimestamp: "2022-04-19T07:30:49Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2022-04-19T10:36:03Z"
    finalizers:
    - flyte/flytek8s
    labels:
      domain: main
      execution-id: f11qb62y-n1-0-dn0-0
      interruptible: "true"
      node-id: n2
      project: sunflower
      shard-key: "4"
      task-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
      workflow-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
    name: f11qb62y-n1-0-dn0-0-n2-0
    namespace: sunflower-main
    ownerReferences:
    - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
      blockOwnerDeletion: true
      controller: true
      kind: flyteworkflow
      name: f11qb62y-n1-0-dn0-0
      uid: e0525cc8-17c2-463b-a5dd-342776c69bbd
    resourceVersion: "266395947"
    uid: 393c4d2b-42ad-4061-8c95-a5d4f6eabb67
  spec:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
              operator: Exists
          weight: 1
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
              operator: In
              values:
              - flyte-worker
            - key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
              operator: In
              values:
              - "true"
    containers:
    - args:
      - pyflyte-execute
      - --inputs
      - <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/inputs.pb>
      - --output-prefix
      - <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/0>
      - --raw-output-data-prefix
      - <gs://freenome-orchid-staging-flyte-data/ak/f11qb62y-n1-0-dn0-0-n2-0>
      - --resolver
      - flytekit.core.python_auto_container.default_task_resolver
      - --
      - task-module
      - sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow
      - task-name
      - align
      env:
      - name: FLYTE_INTERNAL_CONFIGURATION_PATH
        value: /usr/src/app/sunflower/workflows/config/workflows.config
      - name: FLYTE_INTERNAL_IMAGE
        value: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
      - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
        value: sunflower:main:sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.trim_fastqs_and_align_wf
      - name: FLYTE_INTERNAL_EXECUTION_ID
        value: f11qb62y-n1-0-dn0-0
      - name: FLYTE_INTERNAL_EXECUTION_PROJECT
        value: sunflower
      - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
        value: main
      - name: FLYTE_ATTEMPT_NUMBER
        value: "0"
      - name: FLYTE_INTERNAL_TASK_PROJECT
        value: sunflower
      - name: FLYTE_INTERNAL_TASK_DOMAIN
        value: main
      - name: FLYTE_INTERNAL_TASK_NAME
        value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
      - name: FLYTE_INTERNAL_TASK_VERSION
        value: "20220414.2"
      - name: FLYTE_INTERNAL_PROJECT
        value: sunflower
      - name: FLYTE_INTERNAL_DOMAIN
        value: main
      - name: FLYTE_INTERNAL_NAME
        value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
      - name: FLYTE_INTERNAL_VERSION
        value: "20220414.2"
      - name: SUNFLOWER_STATIC_DATA_GCS_PATH
        value: <gs://freenome-orchid-staging-static-data>
      image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
      imagePullPolicy: IfNotPresent
      name: f11qb62y-n1-0-dn0-0-n2-0
      resources:
        limits:
          cpu: "16"
          ephemeral-storage: 52Gi
          memory: 57Gi
        requests:
          cpu: "16"
          ephemeral-storage: 26Gi
          memory: 57Gi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: FallbackToLogsOnError
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6h4gh
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: gke-orchid-west1-flyte-worker-a3b955b5-pccx
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
      operator: Equal
      value: flyte-worker
    - effect: NoExecute
      key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-6h4gh
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T07:30:49Z"
      reason: PodCompleted
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T08:24:46Z"
      reason: PodCompleted
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T08:24:46Z"
      reason: PodCompleted
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T07:30:49Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
      image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
      imageID: <http://gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5|gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5>
      lastState: {}
      name: f11qb62y-n1-0-dn0-0-n2-0
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
          exitCode: 0
          finishedAt: "2022-04-19T08:24:45Z"
          reason: Completed
          startedAt: "2022-04-19T07:30:49Z"
    hostIP: 172.31.0.27
    phase: Succeeded
    podIP: 172.20.53.23
    podIPs:
    - ip: 172.20.53.23
    qosClass: Guaranteed
    startTime: "2022-04-19T07:30:49Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
I exec’d into the flytepropeller pod and here are the contents of all our config yamls:
d
This is very odd. Do you have access to the propeller logs? I'd be happy to jump on a call and run through them quick to see if there is anything interesting there. It would come up with warnings like "Failed to clear finalizers for Resource with name" or "Failed in finalizing get Resource with name".
a
that would be so helpful. when are you free? I’m free now til 1 and then after 2 (PST)
d
Now works good for me.
To join the video meeting, click this link: https://meet.google.com/yvu-kffs-gqh Otherwise, to join by phone, dial +1 720-598-2817 and enter this PIN: 193 537 382# To view more phone numbers, click this link: https://tel.meet/yvu-kffs-gqh?hs=5
d
cc @Haytham Abuelfutuh - We're not seeing anything interesting in the logs unfortunately, was wondering if you might have any insight? Just to scope this a little bit, launching about 800 workflows which consist of a dynamic task which executes ~10 launch plans each. So a large number of concurrent workflows (8k ish) but a relatively small number of tasks.
h
@Dan Rammer (hamersaw), are you seeing the workflow getting marked as terminated but the Pods still have the finalizer on them?
d
The one workflow we looked at was still running, with the pod completed but the Flyte node still in the running phase.
So then intuitively propeller just is not processing the nodes, and in turn removing the finalizers. It's most likely not a bug, just a bottleneck. It would probably benefit to look at some prometheus metrics on the etcd query times and workflow round latencies in flytepropeller to see if it's k8s or Flyte and what we can do to reduce it.
@Anna Cunningham is there a specific reason you are using launchplans instead of just subworkflows (beyond stress testing)? I ask because I don't believe the max parallelism configuration from the parent workflow is adopted by launchplans. Therefore, each workflow may execute 3 launchplans in parallel, each of which executes 3 tasks. When you extrapolate that by the 800 workflows it's something like 7200 pods simultaneously. With the resource requests on the sample pod (16 cpu / 51Gi memory) that adds up quickly.
a
I believe when we originally wrote these workflows, launchplans were the only way to invoke sub-workflows, and we never updated it when we upgraded flytekit. We currently call our sub-workflows by creating a launchplan using
launch_plan = LaunchPlan.create("name", workflow)
and then calling
launch_plan()
. Instead should we just be calling
workflow()
directly?
d
I think it depends. If the goal is the run 800+ of these workflows in parallel then using subworkflows will give a little better control over the parallelism and may help reduce some of the overhead in managing additional workflows.
a
I see. I’ll read a bit more about it, found this helpful documentation 🙂
in terms of the prometheus metrics, I found the prometheus dashboard for flytepropeller, here are some screenshots of round latency and etcd in the last 24 hours:
I think maybe I was just overloading the Kubernetes API. I reduced workers drastically (to 10 instead of 1000) and also reduced the kube-client-config qps, and suddenly the pods stuck in Terminating started getting cleaned up. Do you have tips on how to balance FlytePropeller performance with Kubernetes constraints?
k
hmm, this is interesting, 10 sounds too low. I think you should be able to increase the workers and increase the qps. Make the workers like 40-100
a
👍 I increased the workers little by little to 75 and things looked good. I’ll also experiment with increasing qps a little tomorrow!
k
i can try and work with Dan to see, understand
h
I’ve noticed that when kube-api-client gets throttled, you don’t get helpful logs it just gets backed up… you should be able to get metrics using from api server for writes… You should check the Update calls for both Pods (propeller calls to clear the finalizers) and Workflows… This may be helpful.
a
thank you all for the help and advice! I was able to settle on a good amount of workers and qps that didn’t overload the kubernetes api. not seeing the issue with pods stuck in Terminating anymore 🙂
d
That's great to hear! Can you say a little more about it? ie. did you look at the kubeapi metrics and just run experiments with varying numbers until it looked good? We should certainly update the documentation to include this info, and it would help to have something more solid than “don't overload the kube api server”.
a
@Dan Rammer (hamersaw) sorry I missed this until now! I started by dialing down the workers and qps super low (10 workers, 50qps) and slowly ramped them up. it was pretty non-quantitative. when the kube-api was getting overloaded, we’d see all the workers free (I assume because they were waiting on timed-out api calls?). I upped workers and qps until I saw a regular fluctuation of sometimes all the workers working and sometimes most of them idle.
I ended up landing at 75 workers and 100qps
d
@Anna Cunningham thank you so much for the info! I'll work on updating the documentation.
170 Views