Hello Flyte team, While running map tasks, few ta...
# announcements
v
Hello Flyte team, While running map tasks, few tasks successfully completed but few were in running state in flyte console for a long time. I checked GCP logs and it showed that the map task completed. kubectl also showed all pods had completed executing except one which was
terminating
. What could be the reason for this? Did the map task not complete because of the one pod ?
๐Ÿ‘‹ 1
k
It looks as if 22 succeeded? Cc @Dan Rammer (hamersaw) can we look into this
s
Hey, I feel like we have a similar issue with map tasks. All our tasks are
completed
but the UI still shows some with a
Queued
status so I guess itโ€™s a visual bug?
Copy code
โฏ k get pods | grep ajbv88z56dwcpksp4cs8-n7-0-                                                                                                                                                                                                                                                                                                                                                                                                               
ajbv88z56dwcpksp4cs8-n7-0-0     0/1     Completed   0          78m
ajbv88z56dwcpksp4cs8-n7-0-1     0/1     Completed   0          78m
ajbv88z56dwcpksp4cs8-n7-0-10    0/1     Completed   0          78m
ajbv88z56dwcpksp4cs8-n7-0-100   0/1     Completed   0          60m
ajbv88z56dwcpksp4cs8-n7-0-101   0/1     Completed   0          65m
ajbv88z56dwcpksp4cs8-n7-0-102   0/1     Completed   0          60m
ajbv88z56dwcpksp4cs8-n7-0-103   0/1     Completed   0          70m
ajbv88z56dwcpksp4cs8-n7-0-104   0/1     Completed   0          70m
ajbv88z56dwcpksp4cs8-n7-0-105   0/1     Completed   0          65m
ajbv88z56dwcpksp4cs8-n7-0-106   0/1     Completed   0          65m
ajbv88z56dwcpksp4cs8-n7-0-107   0/1     Completed   0          65m
ajbv88z56dwcpksp4cs8-n7-0-108   0/1     Completed   0          65m
ajbv88z56dwcpksp4cs8-n7-0-109   0/1     Completed   0          60m
ajbv88z56dwcpksp4cs8-n7-0-11    0/1     Completed   0          78m
ajbv88z56dwcpksp4cs8-n7-0-110   0/1     Completed   0          59m
ajbv88z56dwcpksp4cs8-n7-0-111   0/1     Completed   0          59m
ajbv88z56dwcpksp4cs8-n7-0-112   0/1     Completed   0          59m
ajbv88z56dwcpksp4cs8-n7-0-113   0/1     Completed   0          59m
ajbv88z56dwcpksp4cs8-n7-0-114   0/1     Completed   0          63m
ajbv88z56dwcpksp4cs8-n7-0-115   0/1     Completed   0          67m
ajbv88z56dwcpksp4cs8-n7-0-116   0/1     Completed   0          58m
ajbv88z56dwcpksp4cs8-n7-0-117   0/1     Completed   0          58m
ajbv88z56dwcpksp4cs8-n7-0-118   0/1     Completed   0          58m
ajbv88z56dwcpksp4cs8-n7-0-119   0/1     Completed   0          57m
ajbv88z56dwcpksp4cs8-n7-0-12    0/1     Completed   0          78m
We also have tons in
waiting
state but they are all completed
d
@Vijay Saravana it sounds like in @Stephenโ€™s case the issue is the UI updating rather than paused or halted task execution. Can you confirm this? Are downstream tasks being executed / completed? Regardless this is an issue with the backend, just trying to scope if it has to do with actual execution or status reporting. Thanks!
Also, to both, what version of FltyePropeller are you running?
s
On our side, weโ€™re on our fork of
v1.1.0
d
Oh perfect, this would not be an issue in versioning then.
k
Cc @Jason Porter / @eugene jahn can we track it as a Ui issue
๐Ÿ‘ 1
d
I suspect this is an issue with map task reporting / recording subtask phase transitions between FlytePropeller and FlyteAdmin.
j
Okay we'll look into that - tracking here: https://github.com/flyteorg/flyteconsole/issues/544
v
@Dan Rammer (hamersaw) We are using
flytepropeller-v1.1.12
When the node dies the pod stays in a "Terminating" state forever (but it never actually completes terminating and erroring out). This causes the map task's subtask to think it is still "running" forever (instead of retrying).
cc: @Alex Bain @varsha Parthasarathy @Alex Pozimenko
d
@Vijay Saravana can you do a
kubectl get pod FOO -o yaml
on the pod? I'm interested in why the pod is not terminating, is there a finalizer? what is the error?
v
I manually aborted the map workflow from UI since it was taking too much time. So, I am unable to run this command now
I will be on the lookout for this to happen again, and share the output here.
@Dan Rammer (hamersaw) Note: This is observed if we run the sub tasks on SPOT instances. We have not encountered this with on-demand instances. It has to do something with SPOT instance timing out/ availability for long and beefy tasks?
d
Oh interesting. Please post again if you see this issue - any more information would certainly help. In the mean time I will take a look into this, with the way things work internally in the map tasks nothing should be handled differently concerning spot instances than regular k8s tasks.
๐Ÿ‘ 1
v
@Dan Rammer (hamersaw) I was able to reproduce the issue. Please notice the output for
kubectl get pod FOO -o yaml
below:
Copy code
vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pod a8mwjq5z94p55fxhk9zl-n2-0-27 -n dev -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://flyte.lyft.com/deployment|flyte.lyft.com/deployment>: flyte-l5
    <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
  creationTimestamp: "2022-07-19T05:47:08Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-07-19T05:59:45Z"
  finalizers:
  - flyte/array
  labels:
    domain: dev
    execution-id: a8mwjq5z94p55fxhk9zl
    interruptible: "false"
    manager: avora
    node-id: n2
    owner-email: mtoledo
    owner-name: mtoledo
    platform: flyte
    project: avdelorean
    shard-key: "21"
    task-name: src-backend-delorean-delorean-map-base-mapper-run-map-task-0
    team: compute-infra
    workflow-name: src-planning-lib-prediction-metrics-prediction-metrics-processo
  name: a8mwjq5z94p55fxhk9zl-n2-0-27
  namespace: dev
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a8mwjq5z94p55fxhk9zl
    uid: 0d0a8d7c-f935-437c-b339-d003c7643827
  resourceVersion: "9478565284"
  uid: 78bb2022-c7d6-4f47-9832-d12656cbdb2c
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: <http://l5.lyft.com/pool|l5.lyft.com/pool>
            operator: In
            values:
            - eks-pdx-pool-gpu
  containers:
  - args:
    - pyflyte-map-execute
    - --inputs
    - <s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/inputs.pb>
    - --output-prefix
    - <s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/0>
    - --raw-output-data-prefix
    - <s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0>
    - --checkpoint-path
    - <s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0/_flytecheckpoints>
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - src.backend.delorean.delorean_map_base
    - task-name
    - run_map_task
    env:
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: avdelorean:dev:src.planning.lib.prediction.metrics.prediction_metrics_processor_wfe_map_class.PredictionMetricsProcessorMapWorkflowPerfTest
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: a8mwjq5z94p55fxhk9zl
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: dev
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: dev
    - name: FLYTE_INTERNAL_TASK_NAME
      value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    - name: FLYTE_INTERNAL_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_DOMAIN
      value: dev
    - name: FLYTE_INTERNAL_NAME
      value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
    - name: FLYTE_INTERNAL_VERSION
      value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    - name: KUBERNETES_REQUEST_TIMEOUT
      value: "100000"
    - name: L5_BASE_DOMAIN
      value: l5.woven-planet.tech
    - name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
      value: "20"
    - name: AWS_METADATA_SERVICE_TIMEOUT
      value: "5"
    - name: FLYTE_STATSD_HOST
      value: flyte-telegraf.infrastructure
    - name: KUBERNETES_CLUSTER_NAME
      value: pdx
    - name: FLYTE_K8S_ARRAY_INDEX
      value: "27"
    - name: BATCH_JOB_ARRAY_INDEX_VAR_NAME
      value: FLYTE_K8S_ARRAY_INDEX
    - name: L5_DATACENTER
      value: pdx
    - name: L5_ENVIRONMENT
      value: pdx
    - name: RUNTIME_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: RUNTIME_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: RUNTIME_NODE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: L5_NAMESPACE
      value: dev
    image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    imagePullPolicy: IfNotPresent
    name: a8mwjq5z94p55fxhk9zl-n2-0-27
    resources:
      limits:
        cpu: "4"
        memory: 56Gi
        <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
      requests:
        cpu: "4"
        memory: 56Gi
        <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c459j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-162-107-6.us-west-2.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: flyte-scheduler
  securityContext:
    fsGroup: 65534
  serviceAccount: avdelorean-dev
  serviceAccountName: avdelorean-dev
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: <http://lyft.com/gpu|lyft.com/gpu>
    operator: Equal
    value: dedicated
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: <http://nvidia.com/gpu|nvidia.com/gpu>
    operator: Exists
  volumes:
  - name: kube-api-access-c459j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:35Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:54:40Z"
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:37Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:35Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: <docker://e07e4f0cec8265cd15c57833782c305e5581c9d463a97af8307ce3c40bd2c32>4
    image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    imageID: <docker-pullable://ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test@sha256:172ecf248838b1ec88e520528f0125451043769fb31c26d0bfc55057c98afabf>
    lastState: {}
    name: a8mwjq5z94p55fxhk9zl-n2-0-27
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-07-19T05:51:37Z"
  hostIP: 10.162.107.6
  phase: Running
  podIP: 10.162.72.241
  podIPs:
  - ip: 10.162.72.241
  qosClass: Guaranteed
  startTime: "2022-07-19T05:51:35Z"
List of pods :
Copy code
vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pods -n dev | grep a8mwjq5z94p55fxhk9zl
a8mwjq5z94p55fxhk9zl-n2-0-0              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-1              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-11             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-13             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-19             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-2              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-24             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-27             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-3              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-30             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-31             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-34             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-36             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-37             0/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-4              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-43             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-49             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-5              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-50             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-53             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-54             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-56             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-57             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-6              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-7              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-72             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-76             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-8              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-82             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-86             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-87             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-9              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-91             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-92             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-94             0/1     Completed          0          9h
d
Perfect, thanks so much. My first intuition is that this is an issue with finalizers. It looks like there is a
flyte/array
finalizer we inject that still exists, which would stop the Pod from being deleted. This would explain why this issue is not occuring in regular k8s tasks (where the array finalizer would not be injected). However, it was my understanding that preemptible instances (ie. SPOT) didn't really care about finalizers and just deleted the Pod if they wanted - which is how we are currently detecting them.
v
Great, thanks! If this is fixed it would be great because it is a blocker as we are replacing our spark workflows with flyte map tasks. If there is a GH issue for this, please do share the link here.
d
I don't think there is an issue. Would you mind filing one? Otherwise I can, regardless I will pick it up right away.
v
๐Ÿ™Œ 1
d
Hey @Vijay Saravana, taking a little deeper look into this. In the issue you mentioned changing the AWS autoscaling from on-demand instances to SPOT instances during task execution. I don't have a ton of experience with this so some clarification would help. Specifically, in the k8s cluster when you make this transition it sounds like the autoscaler will attempt to delete all of the pods (currently running on on-demand instances) and then restart them on SPOT instances. Am I understanding this correctly? You also mentioned that this seems to work with other kinds of tasks in Flyte?
v
@Dan Rammer (hamersaw) That is one way I was able to reproduce the issue. I have also seen this happen just on SPOT instances for meaty GPU tasks. According to my observations, it happens when the failing/ straggler pod needs to restart. But the pod state does not get updated and retry does not happen (for the retry value specified in the map task).
d
Hi @Vijay Saravana , I dove into this a little bit and left some comments on the github issue to help maintain history. I you would be willing, we may need to meet up to discuss this further.
v
Thanks for your comments. Regarding last comment, Do you mean to say we need to set
interruptible
flag for the map task ? Also, yes we can meet to discuss this.
d
Hey @Alex Bain @Vijay Saravana this should be fixed in FlytePropeller v1.1.21 here . Let me know if you run into more issues!
v
Thank you @Dan Rammer (hamersaw). Should we upgrade any other components or just the FlytePropeller ?
d
There shouldn't be any backwards compatibility issues if you are running v1+ of Flyte on everything.
v
Ok! Could you please confirm that FlytePropeller v1.1.21 is prod ready ? @Dan Rammer (hamersaw)
a
Hi @Dan Rammer (hamersaw) gentle ping ^^^
d
Sorry, this somehow got buried. This should be production ready. What version are you currently running? I can take a quick look into the delta changes and make sure nothing may be problematic.
v
We are running FlytePropeller v1.1.12
d
@Vijay Saravana from the github issue it sounds like you guys already update FlytePropeller and this is resolved?
170 Views