Vijay Saravana
07/16/2022, 1:26 AMterminating
. What could be the reason for this? Did the map task not complete because of the one pod ?Ketan (kumare3)
Stephen
07/18/2022, 10:57 AMcompleted
but the UI still shows some with a Queued
status so I guess it’s a visual bug?
❯ k get pods | grep ajbv88z56dwcpksp4cs8-n7-0-
ajbv88z56dwcpksp4cs8-n7-0-0 0/1 Completed 0 78m
ajbv88z56dwcpksp4cs8-n7-0-1 0/1 Completed 0 78m
ajbv88z56dwcpksp4cs8-n7-0-10 0/1 Completed 0 78m
ajbv88z56dwcpksp4cs8-n7-0-100 0/1 Completed 0 60m
ajbv88z56dwcpksp4cs8-n7-0-101 0/1 Completed 0 65m
ajbv88z56dwcpksp4cs8-n7-0-102 0/1 Completed 0 60m
ajbv88z56dwcpksp4cs8-n7-0-103 0/1 Completed 0 70m
ajbv88z56dwcpksp4cs8-n7-0-104 0/1 Completed 0 70m
ajbv88z56dwcpksp4cs8-n7-0-105 0/1 Completed 0 65m
ajbv88z56dwcpksp4cs8-n7-0-106 0/1 Completed 0 65m
ajbv88z56dwcpksp4cs8-n7-0-107 0/1 Completed 0 65m
ajbv88z56dwcpksp4cs8-n7-0-108 0/1 Completed 0 65m
ajbv88z56dwcpksp4cs8-n7-0-109 0/1 Completed 0 60m
ajbv88z56dwcpksp4cs8-n7-0-11 0/1 Completed 0 78m
ajbv88z56dwcpksp4cs8-n7-0-110 0/1 Completed 0 59m
ajbv88z56dwcpksp4cs8-n7-0-111 0/1 Completed 0 59m
ajbv88z56dwcpksp4cs8-n7-0-112 0/1 Completed 0 59m
ajbv88z56dwcpksp4cs8-n7-0-113 0/1 Completed 0 59m
ajbv88z56dwcpksp4cs8-n7-0-114 0/1 Completed 0 63m
ajbv88z56dwcpksp4cs8-n7-0-115 0/1 Completed 0 67m
ajbv88z56dwcpksp4cs8-n7-0-116 0/1 Completed 0 58m
ajbv88z56dwcpksp4cs8-n7-0-117 0/1 Completed 0 58m
ajbv88z56dwcpksp4cs8-n7-0-118 0/1 Completed 0 58m
ajbv88z56dwcpksp4cs8-n7-0-119 0/1 Completed 0 57m
ajbv88z56dwcpksp4cs8-n7-0-12 0/1 Completed 0 78m
waiting
state but they are all completedDan Rammer (hamersaw)
07/18/2022, 1:09 PMStephen
07/18/2022, 1:14 PMv1.1.0
Dan Rammer (hamersaw)
07/18/2022, 1:15 PMKetan (kumare3)
Dan Rammer (hamersaw)
07/18/2022, 4:04 PMJason Porter
07/18/2022, 6:19 PMVijay Saravana
07/18/2022, 7:26 PMflytepropeller-v1.1.12
Dan Rammer (hamersaw)
07/18/2022, 8:11 PMkubectl get pod FOO -o yaml
on the pod? I'm interested in why the pod is not terminating, is there a finalizer? what is the error?Vijay Saravana
07/18/2022, 8:52 PMDan Rammer (hamersaw)
07/19/2022, 12:54 AMVijay Saravana
07/19/2022, 3:30 PMkubectl get pod FOO -o yaml
below:
vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pod a8mwjq5z94p55fxhk9zl-n2-0-27 -n dev -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
<http://flyte.lyft.com/deployment|flyte.lyft.com/deployment>: flyte-l5
<http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
creationTimestamp: "2022-07-19T05:47:08Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2022-07-19T05:59:45Z"
finalizers:
- flyte/array
labels:
domain: dev
execution-id: a8mwjq5z94p55fxhk9zl
interruptible: "false"
manager: avora
node-id: n2
owner-email: mtoledo
owner-name: mtoledo
platform: flyte
project: avdelorean
shard-key: "21"
task-name: src-backend-delorean-delorean-map-base-mapper-run-map-task-0
team: compute-infra
workflow-name: src-planning-lib-prediction-metrics-prediction-metrics-processo
name: a8mwjq5z94p55fxhk9zl-n2-0-27
namespace: dev
ownerReferences:
- apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
blockOwnerDeletion: true
controller: true
kind: flyteworkflow
name: a8mwjq5z94p55fxhk9zl
uid: 0d0a8d7c-f935-437c-b339-d003c7643827
resourceVersion: "9478565284"
uid: 78bb2022-c7d6-4f47-9832-d12656cbdb2c
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: <http://l5.lyft.com/pool|l5.lyft.com/pool>
operator: In
values:
- eks-pdx-pool-gpu
containers:
- args:
- pyflyte-map-execute
- --inputs
- <s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/inputs.pb>
- --output-prefix
- <s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/0>
- --raw-output-data-prefix
- <s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0>
- --checkpoint-path
- <s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0/_flytecheckpoints>
- --prev-checkpoint
- '""'
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- src.backend.delorean.delorean_map_base
- task-name
- run_map_task
env:
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: avdelorean:dev:src.planning.lib.prediction.metrics.prediction_metrics_processor_wfe_map_class.PredictionMetricsProcessorMapWorkflowPerfTest
- name: FLYTE_INTERNAL_EXECUTION_ID
value: a8mwjq5z94p55fxhk9zl
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: avdelorean
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: dev
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: avdelorean
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: dev
- name: FLYTE_INTERNAL_TASK_NAME
value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
- name: FLYTE_INTERNAL_TASK_VERSION
value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
- name: FLYTE_INTERNAL_PROJECT
value: avdelorean
- name: FLYTE_INTERNAL_DOMAIN
value: dev
- name: FLYTE_INTERNAL_NAME
value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
- name: FLYTE_INTERNAL_VERSION
value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
- name: KUBERNETES_REQUEST_TIMEOUT
value: "100000"
- name: L5_BASE_DOMAIN
value: l5.woven-planet.tech
- name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
value: "20"
- name: AWS_METADATA_SERVICE_TIMEOUT
value: "5"
- name: FLYTE_STATSD_HOST
value: flyte-telegraf.infrastructure
- name: KUBERNETES_CLUSTER_NAME
value: pdx
- name: FLYTE_K8S_ARRAY_INDEX
value: "27"
- name: BATCH_JOB_ARRAY_INDEX_VAR_NAME
value: FLYTE_K8S_ARRAY_INDEX
- name: L5_DATACENTER
value: pdx
- name: L5_ENVIRONMENT
value: pdx
- name: RUNTIME_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: RUNTIME_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: RUNTIME_NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: L5_NAMESPACE
value: dev
image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
imagePullPolicy: IfNotPresent
name: a8mwjq5z94p55fxhk9zl-n2-0-27
resources:
limits:
cpu: "4"
memory: 56Gi
<http://nvidia.com/gpu|nvidia.com/gpu>: "1"
requests:
cpu: "4"
memory: 56Gi
<http://nvidia.com/gpu|nvidia.com/gpu>: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-c459j
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-162-107-6.us-west-2.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: flyte-scheduler
securityContext:
fsGroup: 65534
serviceAccount: avdelorean-dev
serviceAccountName: avdelorean-dev
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: <http://lyft.com/gpu|lyft.com/gpu>
operator: Equal
value: dedicated
- effect: NoExecute
key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: <http://nvidia.com/gpu|nvidia.com/gpu>
operator: Exists
volumes:
- name: kube-api-access-c459j
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:51:35Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:54:40Z"
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:51:37Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:51:35Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: <docker://e07e4f0cec8265cd15c57833782c305e5581c9d463a97af8307ce3c40bd2c32>4
image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
imageID: <docker-pullable://ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test@sha256:172ecf248838b1ec88e520528f0125451043769fb31c26d0bfc55057c98afabf>
lastState: {}
name: a8mwjq5z94p55fxhk9zl-n2-0-27
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-07-19T05:51:37Z"
hostIP: 10.162.107.6
phase: Running
podIP: 10.162.72.241
podIPs:
- ip: 10.162.72.241
qosClass: Guaranteed
startTime: "2022-07-19T05:51:35Z"
vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pods -n dev | grep a8mwjq5z94p55fxhk9zl
a8mwjq5z94p55fxhk9zl-n2-0-0 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-1 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-11 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-13 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-19 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-2 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-24 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-27 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-3 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-30 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-31 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-34 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-36 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-37 0/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-4 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-43 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-49 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-5 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-50 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-53 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-54 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-56 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-57 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-6 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-7 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-72 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-76 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-8 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-82 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-86 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-87 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-9 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-91 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-92 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-94 0/1 Completed 0 9h
Dan Rammer (hamersaw)
07/19/2022, 3:34 PMflyte/array
finalizer we inject that still exists, which would stop the Pod from being deleted. This would explain why this issue is not occuring in regular k8s tasks (where the array finalizer would not be injected). However, it was my understanding that preemptible instances (ie. SPOT) didn't really care about finalizers and just deleted the Pod if they wanted - which is how we are currently detecting them.Vijay Saravana
07/19/2022, 3:42 PMDan Rammer (hamersaw)
07/19/2022, 3:42 PMVijay Saravana
07/19/2022, 4:27 PMDan Rammer (hamersaw)
07/20/2022, 2:06 AMVijay Saravana
07/20/2022, 4:47 AMDan Rammer (hamersaw)
07/21/2022, 3:35 PMVijay Saravana
07/21/2022, 3:56 PMinterruptible
flag for the map task ? Also, yes we can meet to discuss this.Dan Rammer (hamersaw)
07/26/2022, 6:45 PMVijay Saravana
07/26/2022, 8:18 PMDan Rammer (hamersaw)
07/26/2022, 9:33 PMVijay Saravana
07/26/2022, 9:55 PMAlex Pozimenko
07/27/2022, 6:22 PMDan Rammer (hamersaw)
07/28/2022, 3:39 PMVijay Saravana
07/28/2022, 6:54 PMDan Rammer (hamersaw)
07/28/2022, 6:59 PM