Looking at the propeller prometheus metrics, what'...
# flyte-deployment
b
Looking at the propeller prometheus metrics, what's the best way to capture failures/errors by plugin node? Is there a metric for that?
k
This should work right?
b
did you mean to share a link?
b
i will look at this tomorrow, thanks! Not sure if we forward these currently (maybe we missed them)
k
ya
b
@Ketan (kumare3) This workflow has no failures but we are seeing some failures on the metric, is there some bug or did we misinterpret the metric?
k
Seems weird, can you do a rate on the counter
b
message has been deleted
k
Prometheus QL is weird. Check it on how to interpret rate. This seems to be zero
b
hm ok
k
Also Morning time..will tal in the day
👍 1
b
sure. for some other wfs it's 0 which is a bit weird
we see this when scraping flytepropeller
Copy code
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 0
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 0
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 0
flyte:propeller:all:workflow:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 0
flyte:propeller:all:workflow:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376
Does 4376 make sense?
also seeing some slightly different numbers on the event recording metrics
Copy code
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 4
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 6
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 6
flyte:propeller:all:workflow:event_recording:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 54815
flyte:propeller:all:workflow:event_recording:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376
we found something, i will DM you
maybe it's better here
so we found something here that might explain the wrong metrics: First we cheked the logs for the specific wf and found something fishy
Copy code
kubectl -n flyte logs  flytepropeller | grep com.spotify.data.flytecanary.FlyteCanaryWorkflow
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:1012","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Node not yet started, will not finalize","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"workflow_event_recorder.go:69","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Failed to record workflow event [execution_id:\u003cproject:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647441372 nanos:95751514 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [15069/50] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\\\"flyte-canary\\\" domain:\\\"production\\\" name:\\\"tvy2szcxi7bcnzg36oef\\\" ]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:351","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]]","severity":"WARNING","timestamp":"2022-03-16T14:36:12Z"}
notice the 15069/50 retries Then we see checked the resources for
tvy2szcxi7bcnzg36oef
and found it (together with some other ones that are old as well)
Copy code
kubectl -n flyte-canary-production get flyteworkflows
NAME                   AGE
br5o45dnhdjfmfbbr7ts   6m30s
brrq4doftnbbiflu4zvf   26m
fkpnyoqmal5aejlfbsgr   6m16s
hckqfx2umxjabzcvoqgf   54d
holnhktw7fvhy3lkpsrh   16m
mueszlcebofasnowxmwx   36m
nstoy3mlfnffkth7y5np   16m
ojpzr2ctr6fc5neeir5k   36m
p6lvppusayncj5c7b33i   26m
qrivhl7rztrcqviw6py7   54d
tq2j6kkisvrfgroyt4sr   20d
tvy2szcxi7bcnzg36oef   20d
vbjdy2wshxngbdlrlahz   34m
vnlhr26tdt5da3pqcqwn   16m
ydkgy4krdnbgajo4y3ba   54d
fetching the specific resource
Copy code
kubectl -n flyte-canary-production get flyteworkflows tvy2szcxi7bcnzg36oef -o yaml


  failedAttempts: 15086
  message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
    by: ExecutionNotFound: The execution that the event belongs to does not exist,
    caused by [rpc error: code = NotFound desc = missing entity of type execution
    with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
    ]'
  phase: 0
So we now believe that these flyteworkflow resources were applied on the k8s cluster during a short incident that we had and they've been stuck like that since then (3w ago). We realise now that there are other errors on workflows/namespaces that missed before and keep retrying. One of them
RuntimeExecutionError: max number of system retry attempts [224267/50] e
... Are we expecting that flyte should somehow recycle these crds?
it seems that during the incidents some CRDs got created but the executions did not start.
k
hmm this is interesting that they are still hanging around
cc @Dan Rammer (hamersaw) / @Haytham Abuelfutuh
👀 1
d
Hey @Babis Kiosidis, I think propeller is trying to report an event that the workflow has failed, the flyteadmin can't find the workflow, that causes a failure, and endless loop.
We had put in this fix which should handle this case. I think we may have missed a corner case here.
Can you confirm that flyteadmin doesn't have the execution? should be able to just check the db. Also, if it's not too large can you dump the flyteworkflow crd? having phases, etc would be helpful in debugging this.
b
Copy code
acceptedAt: "2022-02-23T17:08:16Z"
apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
executionConfig:
  MaxParallelism: 0
  RecoveryExecution: {}
  TaskPluginImpls: {}
  TaskResources:
    Limits:
      CPU: "8"
      EphemeralStorage: "0"
      GPU: "0"
      Memory: 40Gi
      Storage: 4Gi
    Requests:
      CPU: "1"
      EphemeralStorage: "0"
      GPU: "0"
      Memory: 1Gi
      Storage: 2Gi
executionId:
  domain: production
  name: tvy2szcxi7bcnzg36oef
  project: flyte-canary
inputs:
  literals:
    styx_parameter:
      scalar:
        primitive:
          datetime: "2022-02-23T17:00:00Z"
kind: FlyteWorkflow
metadata:
  annotations:
    ...
  creationTimestamp: "2022-02-23T17:08:17Z"
  generation: 15189
  labels:
    ...
  name: tvy2szcxi7bcnzg36oef
  namespace: flyte-canary-production
  resourceVersion: "796306095"
  uid: b5b58ac2-3565-4896-8c56-5c6cea4d3d7b
node-defaults: {}
rawOutputDataConfig: {}
securityContext: {}
spec:
  connections:
    say-hello:
    - end-node
    start-node:
    - say-hello
  edges:
    downstream:
      say-hello:
      - end-node
      start-node:
      - say-hello
    upstream:
      end-node:
      - say-hello
      say-hello:
      - start-node
  id: flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow
  nodes:
    end-node:
      id: end-node
      inputBindings:
      - binding:
          promise:
            nodeId: say-hello
            var: greet
        var: greet
      kind: end
      resources: {}
    say-hello:
      id: say-hello
      inputBindings:
      - binding:
          scalar:
            primitive:
              stringValue: World
        var: name
      kind: task
      resources:
        limits:
          cpu: "1"
          memory: 1Gi
        requests:
          cpu: "1"
          memory: 1Gi
      retry:
        minAttempts: 1
      task: 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
        version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
    start-node:
      id: start-node
      kind: start
      resources: {}
  outputBindings:
  - binding:
      promise:
        nodeId: say-hello
        var: greet
    var: greet
  outputs:
    variables:
      greet:
        type:
          simple: STRING
status:
  failedAttempts: 15188
  message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
    by: ExecutionNotFound: The execution that the event belongs to does not exist,
    caused by [rpc error: code = NotFound desc = missing entity of type execution
    with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
    ]'
  phase: 0
tasks:
  ? 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
    version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
  : container:
      args:
      - jflyte
      - execute
      - --task
      - com.spotify.data.flytecanary.FlyteCanaryTask
      - --inputs
      - '{{.input}}'
      - --outputPrefix
      - '{{.outputPrefix}}'
      - --taskTemplatePath
      - '{{.taskTemplatePath}}'
      image: ...
      resources:
        limits:
        - name: CPU
          value: "1"
        - name: MEMORY
          value: 1Gi
        requests:
        - name: CPU
          value: "1"
        - name: MEMORY
          value: 1Gi
    custom:
      jflyte:
        artifacts:
        - location: ... a bunch of jars
    id:
      domain: production
      name: com.spotify.data.flytecanary.FlyteCanaryTask
      project: flyte-canary
      resourceType: TASK
      version: 6345c4e4-6fa2-4403-81a5-cd6010f3510a
    interface:
      inputs:
        variables:
          name:
            type:
              simple: STRING
      outputs:
        variables:
          greet:
            type:
              simple: STRING
    metadata:
      retries: {}
      runtime:
        flavor: java
        type: FLYTE_SDK
        version: 0.0.1
    type: java-task
workflowMeta:
  eventVersion: 1
d
OK, I can reproduce this exactly:
Copy code
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:1012","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Node not yet started, will not finalize","ts":"2022-03-16T13:47:43-05:00"}
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"workflow_event_recorder.go:69","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Failed to record workflow event [execution_id:\u003cproject:\"flytesnacks\" domain:\"development\" name:\"rf3qjeb4w1\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647456463 nanos:821780010 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flytesnacks:development:core.flyte_basics.hello_world.my_wf] failed. RuntimeExecutionError: max number of system retry attempts [483/10] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]","ts":"2022-03-16T13:47:43-05:00"}
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:351","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"warning","msg":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]]","ts":"2022-03-16T13:47:43-05:00"}
here is what I'm guessing happened: 1. flyteworkflow crd was created (admin launched a new workflow) 2. flytepropeller didn't pick it up yet (or flytepropeller wasn't running) 3. deleted from the admin db 4. started flytepropeller
so effectively we orphaned a CRD in phase 0 (WorkflowReady) that will never be in admin DB. this is troublesome because we're specifically allowing ExecutionNotFound on the WorkflowReady phase to handle the race condition between when admin creates the CRD and writes to the DB.
for a short-term fix, you can delete the relating CRDs, this is safe because they don't exist in admin either. we should probably figure out a more elegant fix.
b
from the executions table:
Copy code
select * from executions where execution_name ='tvy2szcxi7bcnzg36oef';        

id | created_at | updated_at | deleted_at | execution_project | execution_domain | execution_name | launch_plan_id | workflow_id | task_id | phase | closure | spec | started_at | execution_created_at | execution_updated_at | duration | abort_cause | mode | source_execution_id | parent_node_execution_id | cluster | inputs_uri | user_inputs_uri | error_kind | error_code | user | state
----+------------+------------+------------+-------------------+------------------+----------------+----------------+-------------+---------+-------+---------+------+------------+----------------------+----------------------+----------+-------------+------+---------------------+--------------------------+---------+------------+-----------------+------------+------------+------+-------
(0 rows)
d
perfect, that's exactly what we were expecting to see.
b
I can look into cleaning them tomorrow thanks for looking into this
d
no problem, let me know if you run into issues. do you have to file an issue? i'd be happy to pick it up to track the flytepropeller fix so we can automatically handle this case.
b
Great I can do it tomorrow
🙌 1
k
interesting find
b
I came up with this, do you think that the failedAttempts >50 and phase==0 checks make sense, or is there any risk to remove something useful?
Copy code
kubectl get flyteworkflows --all-namespaces -o json > /tmp/all-namespaces
cat /tmp/all-namespaces | jq '.items[] | select((.status.phase==0 and .status.failedAttempts > 50)) | { executionId: .executionId, status: .status} |  kubectl delete flyteworkflow -n \(.executionId.project)-\(.executionId.domain) \(.executionId.name)' > /tmp/kubectl-delete-stuck-worfklows
🙌 1
d
I think that could be a great idea, I was actually thinking the same thing. Just need to run some tests to make sure there is adequate backoff between each attempt.
b
I actually run it since it wasn't that many resources, approx 150, that were stuck. I am going to monitor the metrics and logs and get back to you if something looks 🐟
👍 1
d
Oh yeah, I think for removing the CRDs from k8s that looks great. I was thinking we could make that addition to FlytePropeller as well so that these cases are automatically detected and handled there.
b
gotcha! i was looking at opening a ticket now. Is it still necessary?
d
Yeah, I think it is still a good idea. If it ends up being something that we think is a little too specific to implement at least there can be a record of the discussion.
b
d
Perfect, thanks so much!
b
It's interesting how much these ghost workflows polluted our metrics
244 Views