Hi I m trying to optimize the FlytePropeller configuration t Flyte #flyte-deployment

Hi! I’m trying to optimize the FlytePropeller conf...

abundant-laptop-47033

04/13/2022, 7:41 PM

Hi! I’m trying to optimize the FlytePropeller configuration to avoid the worse-case scenario described in this doc. I’m still running into the issue - there are too many nodes running at once, so by the time the propeller gets around to checking whether a task finished, the pod was already completed and cleaned up long ago. I added more workers, increased the kube-client-config qps, and reduced maxParallelism, but this problem is still happening. I was wondering if I could get some more advice?

freezing-airport-6809

04/13/2022, 8:10 PM

best way would be to

inject-finalizer

or set maxPArallelism

freezing-airport-6809

04/13/2022, 8:11 PM

inject-finalizer will prevent K8s from claiming back the pod

freezing-airport-6809

04/13/2022, 8:11 PM

docs here

abundant-laptop-47033

04/13/2022, 8:12 PM

thanks, I’ll look into

inject-finalizer

more! I set

maxParallelism

to 3 (from 25) and it did not help. am I right in my understanding that setting it lower would help, as it would limit the number of nodes per workflow the propeller launches?

abundant-laptop-47033

04/19/2022, 12:55 AM

@freezing-airport-6809 thanks for the

inject-finalizer

suggestion, I tried it out and my original issue seems solved, but now I’m having a backlog of pods that are stuck in Terminating (for 4+ hours so far) and still have the finalizer on them. have you seen this before?

freezing-airport-6809

04/19/2022, 12:58 AM

Hmm, that is odd, cc @high-park-82026 / @hallowed-mouse-14616 do you guys know

freezing-airport-6809

04/19/2022, 12:59 AM

Would love to dive deeper

🙏 1

abundant-laptop-47033

04/19/2022, 5:06 PM

It looks like actually the pods do eventually terminate, it just takes a really long time. There are still some pods that have been Terminating for 16+ hours from yesterday, there are almost 10,000 Terminating pods that have accumulated in total. Any tips for how I could speed it up? Currently I have workers set to 1000.

hallowed-mouse-14616

04/19/2022, 6:18 PM

Hi @abundant-laptop-47033, this is odd. Just so I'm sure that I understand: you were having issues where the pods were being cleaned up too quickly, you set the

inject-finalizer

configuration and that fixed it, only now the Flyte pods are taking a very long time to terminate? And the pods still have the "flyte" finalizer set on them?

abundant-laptop-47033

04/19/2022, 6:19 PM

correct!

hallowed-mouse-14616

04/19/2022, 6:20 PM

Interesting, are there errors in any of the pod logs? did the tasks which launched the pods themselves fail?

hallowed-mouse-14616

04/19/2022, 6:23 PM

Also, what version of FlytePropeller are you using?

abundant-laptop-47033

04/19/2022, 6:23 PM

v0.16.19

hallowed-mouse-14616

04/19/2022, 6:27 PM

Can you dump the FlytePropeller configuration? It sounds like the finalizer is doing it's job of keeping the pod around - but FlytePropeller is supposed to remove it on completion, obviously this is not happening for some reason.

hallowed-mouse-14616

04/19/2022, 6:29 PM

Also on the pods is the deletion timestamp set?

abundant-laptop-47033

04/19/2022, 6:32 PM

what is the best way for me to dump the configuration?

abundant-laptop-47033

04/19/2022, 6:33 PM

the deletion timestamp is set on the pods, here’s an example:

Copy code

apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
    creationTimestamp: "2022-04-19T07:30:49Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2022-04-19T10:36:03Z"
    finalizers:
    - flyte/flytek8s
    labels:
      domain: main
      execution-id: f11qb62y-n1-0-dn0-0
      interruptible: "true"
      node-id: n2
      project: sunflower
      shard-key: "4"
      task-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
      workflow-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
    name: f11qb62y-n1-0-dn0-0-n2-0
    namespace: sunflower-main
    ownerReferences:
    - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
      blockOwnerDeletion: true
      controller: true
      kind: flyteworkflow
      name: f11qb62y-n1-0-dn0-0
      uid: e0525cc8-17c2-463b-a5dd-342776c69bbd
    resourceVersion: "266395947"
    uid: 393c4d2b-42ad-4061-8c95-a5d4f6eabb67
  spec:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
              operator: Exists
          weight: 1
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
              operator: In
              values:
              - flyte-worker
            - key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
              operator: In
              values:
              - "true"
    containers:
    - args:
      - pyflyte-execute
      - --inputs
      - <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/inputs.pb>
      - --output-prefix
      - <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/0>
      - --raw-output-data-prefix
      - <gs://freenome-orchid-staging-flyte-data/ak/f11qb62y-n1-0-dn0-0-n2-0>
      - --resolver
      - flytekit.core.python_auto_container.default_task_resolver
      - --
      - task-module
      - sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow
      - task-name
      - align
      env:
      - name: FLYTE_INTERNAL_CONFIGURATION_PATH
        value: /usr/src/app/sunflower/workflows/config/workflows.config
      - name: FLYTE_INTERNAL_IMAGE
        value: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
      - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
        value: sunflower:main:sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.trim_fastqs_and_align_wf
      - name: FLYTE_INTERNAL_EXECUTION_ID
        value: f11qb62y-n1-0-dn0-0
      - name: FLYTE_INTERNAL_EXECUTION_PROJECT
        value: sunflower
      - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
        value: main
      - name: FLYTE_ATTEMPT_NUMBER
        value: "0"
      - name: FLYTE_INTERNAL_TASK_PROJECT
        value: sunflower
      - name: FLYTE_INTERNAL_TASK_DOMAIN
        value: main
      - name: FLYTE_INTERNAL_TASK_NAME
        value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
      - name: FLYTE_INTERNAL_TASK_VERSION
        value: "20220414.2"
      - name: FLYTE_INTERNAL_PROJECT
        value: sunflower
      - name: FLYTE_INTERNAL_DOMAIN
        value: main
      - name: FLYTE_INTERNAL_NAME
        value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
      - name: FLYTE_INTERNAL_VERSION
        value: "20220414.2"
      - name: SUNFLOWER_STATIC_DATA_GCS_PATH
        value: <gs://freenome-orchid-staging-static-data>
      image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
      imagePullPolicy: IfNotPresent
      name: f11qb62y-n1-0-dn0-0-n2-0
      resources:
        limits:
          cpu: "16"
          ephemeral-storage: 52Gi
          memory: 57Gi
        requests:
          cpu: "16"
          ephemeral-storage: 26Gi
          memory: 57Gi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: FallbackToLogsOnError
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6h4gh
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: gke-orchid-west1-flyte-worker-a3b955b5-pccx
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
      operator: Equal
      value: flyte-worker
    - effect: NoExecute
      key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-6h4gh
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T07:30:49Z"
      reason: PodCompleted
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T08:24:46Z"
      reason: PodCompleted
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T08:24:46Z"
      reason: PodCompleted
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-04-19T07:30:49Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
      image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
      imageID: <http://gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5|gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5>
      lastState: {}
      name: f11qb62y-n1-0-dn0-0-n2-0
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
          exitCode: 0
          finishedAt: "2022-04-19T08:24:45Z"
          reason: Completed
          startedAt: "2022-04-19T07:30:49Z"
    hostIP: 172.31.0.27
    phase: Succeeded
    podIP: 172.20.53.23
    podIPs:
    - ip: 172.20.53.23
    qosClass: Guaranteed
    startTime: "2022-04-19T07:30:49Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

abundant-laptop-47033

04/19/2022, 6:42 PM

I exec’d into the flytepropeller pod and here are the contents of all our config yamls:

abundant-laptop-47033

04/19/2022, 6:42 PM

flytepropeller_config.yaml

hallowed-mouse-14616

04/19/2022, 7:17 PM

This is very odd. Do you have access to the propeller logs? I'd be happy to jump on a call and run through them quick to see if there is anything interesting there. It would come up with warnings like "Failed to clear finalizers for Resource with name" or "Failed in finalizing get Resource with name".

abundant-laptop-47033

04/19/2022, 7:22 PM

that would be so helpful. when are you free? I’m free now til 1 and then after 2 (PST)

hallowed-mouse-14616

04/19/2022, 7:26 PM

Now works good for me.

hallowed-mouse-14616

04/19/2022, 7:26 PM

To join the video meeting, click this link: https://meet.google.com/yvu-kffs-gqh Otherwise, to join by phone, dial +1 720-598-2817 and enter this PIN: 193 537 382# To view more phone numbers, click this link: https://tel.meet/yvu-kffs-gqh?hs=5

abundant-laptop-47033

04/19/2022, 7:30 PM

https://freenome.zoom.us/j/84244304516?pwd=dTV1bk15OWFDekJSc1gvQkVGOWE5Zz09

hallowed-mouse-14616

04/19/2022, 7:46 PM

cc @high-park-82026 - We're not seeing anything interesting in the logs unfortunately, was wondering if you might have any insight? Just to scope this a little bit, launching about 800 workflows which consist of a dynamic task which executes ~10 launch plans each. So a large number of concurrent workflows (8k ish) but a relatively small number of tasks.

high-park-82026

04/19/2022, 8:11 PM

@hallowed-mouse-14616, are you seeing the workflow getting marked as terminated but the Pods still have the finalizer on them?

hallowed-mouse-14616

04/19/2022, 9:41 PM

The one workflow we looked at was still running, with the pod completed but the Flyte node still in the running phase.

hallowed-mouse-14616

04/19/2022, 9:41 PM

So then intuitively propeller just is not processing the nodes, and in turn removing the finalizers. It's most likely not a bug, just a bottleneck. It would probably benefit to look at some prometheus metrics on the etcd query times and workflow round latencies in flytepropeller to see if it's k8s or Flyte and what we can do to reduce it.

hallowed-mouse-14616

04/19/2022, 9:42 PM

@abundant-laptop-47033 is there a specific reason you are using launchplans instead of just subworkflows (beyond stress testing)? I ask because I don't believe the max parallelism configuration from the parent workflow is adopted by launchplans. Therefore, each workflow may execute 3 launchplans in parallel, each of which executes 3 tasks. When you extrapolate that by the 800 workflows it's something like 7200 pods simultaneously. With the resource requests on the sample pod (16 cpu / 51Gi memory) that adds up quickly.

abundant-laptop-47033

04/19/2022, 9:46 PM

I believe when we originally wrote these workflows, launchplans were the only way to invoke sub-workflows, and we never updated it when we upgraded flytekit. We currently call our sub-workflows by creating a launchplan using

launch_plan = LaunchPlan.create("name", workflow)

and then calling

launch_plan()

. Instead should we just be calling

workflow()

directly?

hallowed-mouse-14616

04/19/2022, 10:02 PM

I think it depends. If the goal is the run 800+ of these workflows in parallel then using subworkflows will give a little better control over the parallelism and may help reduce some of the overhead in managing additional workflows.

abundant-laptop-47033

04/19/2022, 10:04 PM

I see. I’ll read a bit more about it, found this helpful documentation 🙂

abundant-laptop-47033

04/19/2022, 10:07 PM

in terms of the prometheus metrics, I found the prometheus dashboard for flytepropeller, here are some screenshots of round latency and etcd in the last 24 hours:

abundant-laptop-47033

04/20/2022, 12:23 AM

I think maybe I was just overloading the Kubernetes API. I reduced workers drastically (to 10 instead of 1000) and also reduced the kube-client-config qps, and suddenly the pods stuck in Terminating started getting cleaned up. Do you have tips on how to balance FlytePropeller performance with Kubernetes constraints?

freezing-airport-6809

04/20/2022, 3:56 AM

hmm, this is interesting, 10 sounds too low. I think you should be able to increase the workers and increase the qps. Make the workers like 40-100

abundant-laptop-47033

04/20/2022, 4:11 AM

👍 I increased the workers little by little to 75 and things looked good. I’ll also experiment with increasing qps a little tomorrow!

freezing-airport-6809

04/20/2022, 4:11 AM

i can try and work with Dan to see, understand

high-park-82026

04/20/2022, 6:15 AM

I’ve noticed that when kube-api-client gets throttled, you don’t get helpful logs it just gets backed up… you should be able to get metrics using from api server for writes… You should check the Update calls for both Pods (propeller calls to clear the finalizers) and Workflows… This may be helpful.

abundant-laptop-47033

04/21/2022, 2:12 PM

thank you all for the help and advice! I was able to settle on a good amount of workers and qps that didn’t overload the kubernetes api. not seeing the issue with pods stuck in Terminating anymore 🙂

hallowed-mouse-14616

04/21/2022, 2:16 PM

That's great to hear! Can you say a little more about it? ie. did you look at the kubeapi metrics and just run experiments with varying numbers until it looked good? We should certainly update the documentation to include this info, and it would help to have something more solid than “don't overload the kube api server”.

abundant-laptop-47033

04/26/2022, 8:26 PM

@hallowed-mouse-14616 sorry I missed this until now! I started by dialing down the workers and qps super low (10 workers, 50qps) and slowly ramped them up. it was pretty non-quantitative. when the kube-api was getting overloaded, we’d see all the workers free (I assume because they were waiting on timed-out api calls?). I upped workers and qps until I saw a regular fluctuation of sometimes all the workers working and sometimes most of them idle.

abundant-laptop-47033

04/26/2022, 8:26 PM

I ended up landing at 75 workers and 100qps

hallowed-mouse-14616

04/27/2022, 8:52 AM

@abundant-laptop-47033 thank you so much for the info! I'll work on updating the documentation.

238 Views

Open in Slack

Previous Next