swift-animal-75798
03/15/2022, 1:26 PMfreezing-airport-6809
swift-animal-75798
03/15/2022, 1:49 PMfreezing-airport-6809
swift-animal-75798
03/15/2022, 3:57 PMfreezing-airport-6809
swift-animal-75798
03/16/2022, 1:05 PMfreezing-airport-6809
swift-animal-75798
03/16/2022, 1:44 PMfreezing-airport-6809
swift-animal-75798
03/16/2022, 1:50 PMfreezing-airport-6809
swift-animal-75798
03/16/2022, 1:52 PMswift-animal-75798
03/16/2022, 2:29 PMflyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 0
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 0
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 0
flyte:propeller:all:workflow:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 0
flyte:propeller:all:workflow:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376
Does 4376 make sense?swift-animal-75798
03/16/2022, 2:35 PMflyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 4
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 6
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 6
flyte:propeller:all:workflow:event_recording:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 54815
flyte:propeller:all:workflow:event_recording:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376
swift-animal-75798
03/16/2022, 2:45 PMswift-animal-75798
03/16/2022, 3:11 PMswift-animal-75798
03/16/2022, 3:14 PMkubectl -n flyte logs flytepropeller | grep com.spotify.data.flytecanary.FlyteCanaryWorkflow
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:1012","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Node not yet started, will not finalize","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"workflow_event_recorder.go:69","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Failed to record workflow event [execution_id:\u003cproject:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647441372 nanos:95751514 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [15069/50] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\\\"flyte-canary\\\" domain:\\\"production\\\" name:\\\"tvy2szcxi7bcnzg36oef\\\" ]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:351","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]]","severity":"WARNING","timestamp":"2022-03-16T14:36:12Z"}
notice the 15069/50 retries
Then we see checked the resources for tvy2szcxi7bcnzg36oef
and found it (together with some other ones that are old as well)
kubectl -n flyte-canary-production get flyteworkflows
NAME AGE
br5o45dnhdjfmfbbr7ts 6m30s
brrq4doftnbbiflu4zvf 26m
fkpnyoqmal5aejlfbsgr 6m16s
hckqfx2umxjabzcvoqgf 54d
holnhktw7fvhy3lkpsrh 16m
mueszlcebofasnowxmwx 36m
nstoy3mlfnffkth7y5np 16m
ojpzr2ctr6fc5neeir5k 36m
p6lvppusayncj5c7b33i 26m
qrivhl7rztrcqviw6py7 54d
tq2j6kkisvrfgroyt4sr 20d
tvy2szcxi7bcnzg36oef 20d
vbjdy2wshxngbdlrlahz 34m
vnlhr26tdt5da3pqcqwn 16m
ydkgy4krdnbgajo4y3ba 54d
fetching the specific resource
kubectl -n flyte-canary-production get flyteworkflows tvy2szcxi7bcnzg36oef -o yaml
failedAttempts: 15086
message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
by: ExecutionNotFound: The execution that the event belongs to does not exist,
caused by [rpc error: code = NotFound desc = missing entity of type execution
with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
]'
phase: 0
So we now believe that these flyteworkflow resources were applied on the k8s cluster during a short incident that we had and they've been stuck like that since then (3w ago). We realise now that there are other errors on workflows/namespaces that missed before and keep retrying.
One of them
RuntimeExecutionError: max number of system retry attempts [224267/50] e
...
Are we expecting that flyte should somehow recycle these crds?swift-animal-75798
03/16/2022, 3:20 PMfreezing-airport-6809
freezing-airport-6809
hallowed-mouse-14616
03/16/2022, 4:55 PMhallowed-mouse-14616
03/16/2022, 4:56 PMhallowed-mouse-14616
03/16/2022, 4:57 PMswift-animal-75798
03/16/2022, 6:35 PMacceptedAt: "2022-02-23T17:08:16Z"
apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
executionConfig:
MaxParallelism: 0
RecoveryExecution: {}
TaskPluginImpls: {}
TaskResources:
Limits:
CPU: "8"
EphemeralStorage: "0"
GPU: "0"
Memory: 40Gi
Storage: 4Gi
Requests:
CPU: "1"
EphemeralStorage: "0"
GPU: "0"
Memory: 1Gi
Storage: 2Gi
executionId:
domain: production
name: tvy2szcxi7bcnzg36oef
project: flyte-canary
inputs:
literals:
styx_parameter:
scalar:
primitive:
datetime: "2022-02-23T17:00:00Z"
kind: FlyteWorkflow
metadata:
annotations:
...
creationTimestamp: "2022-02-23T17:08:17Z"
generation: 15189
labels:
...
name: tvy2szcxi7bcnzg36oef
namespace: flyte-canary-production
resourceVersion: "796306095"
uid: b5b58ac2-3565-4896-8c56-5c6cea4d3d7b
node-defaults: {}
rawOutputDataConfig: {}
securityContext: {}
spec:
connections:
say-hello:
- end-node
start-node:
- say-hello
edges:
downstream:
say-hello:
- end-node
start-node:
- say-hello
upstream:
end-node:
- say-hello
say-hello:
- start-node
id: flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow
nodes:
end-node:
id: end-node
inputBindings:
- binding:
promise:
nodeId: say-hello
var: greet
var: greet
kind: end
resources: {}
say-hello:
id: say-hello
inputBindings:
- binding:
scalar:
primitive:
stringValue: World
var: name
kind: task
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
retry:
minAttempts: 1
task: 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
start-node:
id: start-node
kind: start
resources: {}
outputBindings:
- binding:
promise:
nodeId: say-hello
var: greet
var: greet
outputs:
variables:
greet:
type:
simple: STRING
status:
failedAttempts: 15188
message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
by: ExecutionNotFound: The execution that the event belongs to does not exist,
caused by [rpc error: code = NotFound desc = missing entity of type execution
with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
]'
phase: 0
tasks:
? 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
: container:
args:
- jflyte
- execute
- --task
- com.spotify.data.flytecanary.FlyteCanaryTask
- --inputs
- '{{.input}}'
- --outputPrefix
- '{{.outputPrefix}}'
- --taskTemplatePath
- '{{.taskTemplatePath}}'
image: ...
resources:
limits:
- name: CPU
value: "1"
- name: MEMORY
value: 1Gi
requests:
- name: CPU
value: "1"
- name: MEMORY
value: 1Gi
custom:
jflyte:
artifacts:
- location: ... a bunch of jars
id:
domain: production
name: com.spotify.data.flytecanary.FlyteCanaryTask
project: flyte-canary
resourceType: TASK
version: 6345c4e4-6fa2-4403-81a5-cd6010f3510a
interface:
inputs:
variables:
name:
type:
simple: STRING
outputs:
variables:
greet:
type:
simple: STRING
metadata:
retries: {}
runtime:
flavor: java
type: FLYTE_SDK
version: 0.0.1
type: java-task
workflowMeta:
eventVersion: 1
hallowed-mouse-14616
03/16/2022, 6:48 PM{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:1012","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Node not yet started, will not finalize","ts":"2022-03-16T13:47:43-05:00"}
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"workflow_event_recorder.go:69","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Failed to record workflow event [execution_id:\u003cproject:\"flytesnacks\" domain:\"development\" name:\"rf3qjeb4w1\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647456463 nanos:821780010 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flytesnacks:development:core.flyte_basics.hello_world.my_wf] failed. RuntimeExecutionError: max number of system retry attempts [483/10] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]","ts":"2022-03-16T13:47:43-05:00"}
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:351","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"warning","msg":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]]","ts":"2022-03-16T13:47:43-05:00"}
hallowed-mouse-14616
03/16/2022, 6:49 PMhallowed-mouse-14616
03/16/2022, 6:50 PMhallowed-mouse-14616
03/16/2022, 6:51 PMswift-animal-75798
03/16/2022, 6:57 PMselect * from executions where execution_name ='tvy2szcxi7bcnzg36oef';
id | created_at | updated_at | deleted_at | execution_project | execution_domain | execution_name | launch_plan_id | workflow_id | task_id | phase | closure | spec | started_at | execution_created_at | execution_updated_at | duration | abort_cause | mode | source_execution_id | parent_node_execution_id | cluster | inputs_uri | user_inputs_uri | error_kind | error_code | user | state
----+------------+------------+------------+-------------------+------------------+----------------+----------------+-------------+---------+-------+---------+------+------------+----------------------+----------------------+----------+-------------+------+---------------------+--------------------------+---------+------------+-----------------+------------+------------+------+-------
(0 rows)
hallowed-mouse-14616
03/16/2022, 6:58 PMswift-animal-75798
03/16/2022, 6:58 PMhallowed-mouse-14616
03/16/2022, 7:05 PMswift-animal-75798
03/16/2022, 7:09 PMfreezing-airport-6809
swift-animal-75798
03/17/2022, 5:18 PMkubectl get flyteworkflows --all-namespaces -o json > /tmp/all-namespaces
cat /tmp/all-namespaces | jq '.items[] | select((.status.phase==0 and .status.failedAttempts > 50)) | { executionId: .executionId, status: .status} | kubectl delete flyteworkflow -n \(.executionId.project)-\(.executionId.domain) \(.executionId.name)' > /tmp/kubectl-delete-stuck-worfklows
hallowed-mouse-14616
03/17/2022, 5:34 PMswift-animal-75798
03/17/2022, 5:41 PMhallowed-mouse-14616
03/17/2022, 5:42 PMswift-animal-75798
03/17/2022, 5:51 PMhallowed-mouse-14616
03/17/2022, 5:53 PMswift-animal-75798
03/17/2022, 6:15 PMhallowed-mouse-14616
03/17/2022, 6:18 PMswift-animal-75798
03/18/2022, 7:38 AM