Anna Cunningham
04/13/2022, 7:41 PMKetan (kumare3)
inject-finalizer
or set maxPArallelismAnna Cunningham
04/13/2022, 8:12 PMinject-finalizer
more!
I set maxParallelism
to 3 (from 25) and it did not help. am I right in my understanding that setting it lower would help, as it would limit the number of nodes per workflow the propeller launches?inject-finalizer
suggestion, I tried it out and my original issue seems solved, but now I’m having a backlog of pods that are stuck in Terminating (for 4+ hours so far) and still have the finalizer on them. have you seen this before?Ketan (kumare3)
Anna Cunningham
04/19/2022, 5:06 PMDan Rammer (hamersaw)
04/19/2022, 6:18 PMinject-finalizer
configuration and that fixed it, only now the Flyte pods are taking a very long time to terminate? And the pods still have the "flyte" finalizer set on them?Anna Cunningham
04/19/2022, 6:19 PMDan Rammer (hamersaw)
04/19/2022, 6:20 PMAnna Cunningham
04/19/2022, 6:23 PMDan Rammer (hamersaw)
04/19/2022, 6:27 PMAnna Cunningham
04/19/2022, 6:32 PMapiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
annotations:
<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
creationTimestamp: "2022-04-19T07:30:49Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2022-04-19T10:36:03Z"
finalizers:
- flyte/flytek8s
labels:
domain: main
execution-id: f11qb62y-n1-0-dn0-0
interruptible: "true"
node-id: n2
project: sunflower
shard-key: "4"
task-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
workflow-name: sunflower-workflows-flyte-workflows-trim-fastqs-and-align-workf
name: f11qb62y-n1-0-dn0-0-n2-0
namespace: sunflower-main
ownerReferences:
- apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
blockOwnerDeletion: true
controller: true
kind: flyteworkflow
name: f11qb62y-n1-0-dn0-0
uid: e0525cc8-17c2-463b-a5dd-342776c69bbd
resourceVersion: "266395947"
uid: 393c4d2b-42ad-4061-8c95-a5d4f6eabb67
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
operator: Exists
weight: 1
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
operator: In
values:
- flyte-worker
- key: <http://cloud.google.com/gke-preemptible|cloud.google.com/gke-preemptible>
operator: In
values:
- "true"
containers:
- args:
- pyflyte-execute
- --inputs
- <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/inputs.pb>
- --output-prefix
- <gs://freenome-orchid-staging-flyte-data/metadata/propeller/sunflower-main-f11qb62y-n1-0-dn0-0/n2/data/0>
- --raw-output-data-prefix
- <gs://freenome-orchid-staging-flyte-data/ak/f11qb62y-n1-0-dn0-0-n2-0>
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow
- task-name
- align
env:
- name: FLYTE_INTERNAL_CONFIGURATION_PATH
value: /usr/src/app/sunflower/workflows/config/workflows.config
- name: FLYTE_INTERNAL_IMAGE
value: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: sunflower:main:sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.trim_fastqs_and_align_wf
- name: FLYTE_INTERNAL_EXECUTION_ID
value: f11qb62y-n1-0-dn0-0
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: sunflower
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: main
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: sunflower
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: main
- name: FLYTE_INTERNAL_TASK_NAME
value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
- name: FLYTE_INTERNAL_TASK_VERSION
value: "20220414.2"
- name: FLYTE_INTERNAL_PROJECT
value: sunflower
- name: FLYTE_INTERNAL_DOMAIN
value: main
- name: FLYTE_INTERNAL_NAME
value: sunflower.workflows.flyte_workflows.trim_fastqs_and_align_workflow.align
- name: FLYTE_INTERNAL_VERSION
value: "20220414.2"
- name: SUNFLOWER_STATIC_DATA_GCS_PATH
value: <gs://freenome-orchid-staging-static-data>
image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
imagePullPolicy: IfNotPresent
name: f11qb62y-n1-0-dn0-0-n2-0
resources:
limits:
cpu: "16"
ephemeral-storage: 52Gi
memory: 57Gi
requests:
cpu: "16"
ephemeral-storage: 26Gi
memory: 57Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-6h4gh
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: gke-orchid-west1-flyte-worker-a3b955b5-pccx
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: <http://k8s.freenome.net/node-role|k8s.freenome.net/node-role>
operator: Equal
value: flyte-worker
- effect: NoExecute
key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-6h4gh
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-04-19T07:30:49Z"
reason: PodCompleted
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-04-19T08:24:46Z"
reason: PodCompleted
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-04-19T08:24:46Z"
reason: PodCompleted
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-04-19T07:30:49Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
image: <http://gcr.io/freenome-build/ap/sunflower:20220414.2|gcr.io/freenome-build/ap/sunflower:20220414.2>
imageID: <http://gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5|gcr.io/freenome-build/ap/sunflower@sha256:c1029318c0902a1a7c69cc9252c11c0c745d6cc5433857bc06d83b98885742a5>
lastState: {}
name: f11qb62y-n1-0-dn0-0-n2-0
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: <containerd://7a10fb36a46f0c245ee20736b470f2dd271755c0d50eb5b9c0e38e0a801b6d2>0
exitCode: 0
finishedAt: "2022-04-19T08:24:45Z"
reason: Completed
startedAt: "2022-04-19T07:30:49Z"
hostIP: 172.31.0.27
phase: Succeeded
podIP: 172.20.53.23
podIPs:
- ip: 172.20.53.23
qosClass: Guaranteed
startTime: "2022-04-19T07:30:49Z"
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Dan Rammer (hamersaw)
04/19/2022, 7:17 PMAnna Cunningham
04/19/2022, 7:22 PMDan Rammer (hamersaw)
04/19/2022, 7:26 PMAnna Cunningham
04/19/2022, 7:30 PMDan Rammer (hamersaw)
04/19/2022, 7:46 PMHaytham Abuelfutuh
Dan Rammer (hamersaw)
04/19/2022, 9:41 PMAnna Cunningham
04/19/2022, 9:46 PMlaunch_plan = LaunchPlan.create("name", workflow)
and then calling launch_plan()
. Instead should we just be calling workflow()
directly?Dan Rammer (hamersaw)
04/19/2022, 10:02 PMAnna Cunningham
04/19/2022, 10:04 PMKetan (kumare3)
Anna Cunningham
04/20/2022, 4:11 AMKetan (kumare3)
Haytham Abuelfutuh
Anna Cunningham
04/21/2022, 2:12 PMDan Rammer (hamersaw)
04/21/2022, 2:16 PMAnna Cunningham
04/26/2022, 8:26 PMDan Rammer (hamersaw)
04/27/2022, 8:52 AM