Looking at the propeller prometheus metrics what s the best Flyte #flyte-deployment

Looking at the propeller prometheus metrics, what'...

swift-animal-75798

03/15/2022, 1:26 PM

Looking at the propeller prometheus metrics, what's the best way to capture failures/errors by plugin node? Is there a metric for that?

freezing-airport-6809

03/15/2022, 1:41 PM

This should work right?

swift-animal-75798

03/15/2022, 1:49 PM

did you mean to share a link?

freezing-airport-6809

03/15/2022, 2:06 PM

Ohh ya https://github.com/flyteorg/flytepropeller/blob/f381d0475b55a6623ec57de866aa5f5c704b2aa9/pkg/controller/nodes/task/handler.go#L357

swift-animal-75798

03/15/2022, 3:57 PM

i will look at this tomorrow, thanks! Not sure if we forward these currently (maybe we missed them)

freezing-airport-6809

03/15/2022, 4:11 PM

swift-animal-75798

03/16/2022, 1:05 PM

@freezing-airport-6809 This workflow has no failures but we are seeing some failures on the metric, is there some bug or did we misinterpret the metric?

freezing-airport-6809

03/16/2022, 1:41 PM

Seems weird, can you do a rate on the counter

swift-animal-75798

03/16/2022, 1:44 PM

message has been deleted

freezing-airport-6809

03/16/2022, 1:47 PM

Prometheus QL is weird. Check it on how to interpret rate. This seems to be zero

swift-animal-75798

03/16/2022, 1:50 PM

hm ok

freezing-airport-6809

03/16/2022, 1:50 PM

Also Morning time..will tal in the day

👍 1

swift-animal-75798

03/16/2022, 1:52 PM

sure. for some other wfs it's 0 which is a bit weird

swift-animal-75798

03/16/2022, 2:29 PM

we see this when scraping flytepropeller

Copy code

flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 0
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 0
flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 0
flyte:propeller:all:workflow:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 0
flyte:propeller:all:workflow:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376

Does 4376 make sense?

swift-animal-75798

03/16/2022, 2:35 PM

also seeing some slightly different numbers on the event recording metrics

Copy code

flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 4
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 6
flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 6
flyte:propeller:all:workflow:event_recording:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 54815
flyte:propeller:all:workflow:event_recording:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376

swift-animal-75798

03/16/2022, 2:45 PM

we found something, i will DM you

swift-animal-75798

03/16/2022, 3:11 PM

maybe it's better here

swift-animal-75798

03/16/2022, 3:14 PM

so we found something here that might explain the wrong metrics: First we cheked the logs for the specific wf and found something fishy

Copy code

kubectl -n flyte logs  flytepropeller | grep com.spotify.data.flytecanary.FlyteCanaryWorkflow
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:1012","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Node not yet started, will not finalize","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"workflow_event_recorder.go:69","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Failed to record workflow event [execution_id:\u003cproject:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647441372 nanos:95751514 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [15069/50] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\\\"flyte-canary\\\" domain:\\\"production\\\" name:\\\"tvy2szcxi7bcnzg36oef\\\" ]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
{"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:351","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]]","severity":"WARNING","timestamp":"2022-03-16T14:36:12Z"}

notice the 15069/50 retries Then we see checked the resources for

tvy2szcxi7bcnzg36oef

and found it (together with some other ones that are old as well)

Copy code

kubectl -n flyte-canary-production get flyteworkflows
NAME                   AGE
br5o45dnhdjfmfbbr7ts   6m30s
brrq4doftnbbiflu4zvf   26m
fkpnyoqmal5aejlfbsgr   6m16s
hckqfx2umxjabzcvoqgf   54d
holnhktw7fvhy3lkpsrh   16m
mueszlcebofasnowxmwx   36m
nstoy3mlfnffkth7y5np   16m
ojpzr2ctr6fc5neeir5k   36m
p6lvppusayncj5c7b33i   26m
qrivhl7rztrcqviw6py7   54d
tq2j6kkisvrfgroyt4sr   20d
tvy2szcxi7bcnzg36oef   20d
vbjdy2wshxngbdlrlahz   34m
vnlhr26tdt5da3pqcqwn   16m
ydkgy4krdnbgajo4y3ba   54d

fetching the specific resource

Copy code

kubectl -n flyte-canary-production get flyteworkflows tvy2szcxi7bcnzg36oef -o yaml


  failedAttempts: 15086
  message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
    by: ExecutionNotFound: The execution that the event belongs to does not exist,
    caused by [rpc error: code = NotFound desc = missing entity of type execution
    with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
    ]'
  phase: 0

So we now believe that these flyteworkflow resources were applied on the k8s cluster during a short incident that we had and they've been stuck like that since then (3w ago). We realise now that there are other errors on workflows/namespaces that missed before and keep retrying. One of them

RuntimeExecutionError: max number of system retry attempts [224267/50] e

... Are we expecting that flyte should somehow recycle these crds?

swift-animal-75798

03/16/2022, 3:20 PM

it seems that during the incidents some CRDs got created but the executions did not start.

freezing-airport-6809

03/16/2022, 4:07 PM

hmm this is interesting that they are still hanging around

freezing-airport-6809

03/16/2022, 4:07 PM

cc @hallowed-mouse-14616 / @high-park-82026

👀 1

hallowed-mouse-14616

03/16/2022, 4:55 PM

Hey @swift-animal-75798, I think propeller is trying to report an event that the workflow has failed, the flyteadmin can't find the workflow, that causes a failure, and endless loop.

hallowed-mouse-14616

03/16/2022, 4:56 PM

We had put in this fix which should handle this case. I think we may have missed a corner case here.

hallowed-mouse-14616

03/16/2022, 4:57 PM

Can you confirm that flyteadmin doesn't have the execution? should be able to just check the db. Also, if it's not too large can you dump the flyteworkflow crd? having phases, etc would be helpful in debugging this.

swift-animal-75798

03/16/2022, 6:35 PM

Copy code

acceptedAt: "2022-02-23T17:08:16Z"
apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
executionConfig:
  MaxParallelism: 0
  RecoveryExecution: {}
  TaskPluginImpls: {}
  TaskResources:
    Limits:
      CPU: "8"
      EphemeralStorage: "0"
      GPU: "0"
      Memory: 40Gi
      Storage: 4Gi
    Requests:
      CPU: "1"
      EphemeralStorage: "0"
      GPU: "0"
      Memory: 1Gi
      Storage: 2Gi
executionId:
  domain: production
  name: tvy2szcxi7bcnzg36oef
  project: flyte-canary
inputs:
  literals:
    styx_parameter:
      scalar:
        primitive:
          datetime: "2022-02-23T17:00:00Z"
kind: FlyteWorkflow
metadata:
  annotations:
    ...
  creationTimestamp: "2022-02-23T17:08:17Z"
  generation: 15189
  labels:
    ...
  name: tvy2szcxi7bcnzg36oef
  namespace: flyte-canary-production
  resourceVersion: "796306095"
  uid: b5b58ac2-3565-4896-8c56-5c6cea4d3d7b
node-defaults: {}
rawOutputDataConfig: {}
securityContext: {}
spec:
  connections:
    say-hello:
    - end-node
    start-node:
    - say-hello
  edges:
    downstream:
      say-hello:
      - end-node
      start-node:
      - say-hello
    upstream:
      end-node:
      - say-hello
      say-hello:
      - start-node
  id: flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow
  nodes:
    end-node:
      id: end-node
      inputBindings:
      - binding:
          promise:
            nodeId: say-hello
            var: greet
        var: greet
      kind: end
      resources: {}
    say-hello:
      id: say-hello
      inputBindings:
      - binding:
          scalar:
            primitive:
              stringValue: World
        var: name
      kind: task
      resources:
        limits:
          cpu: "1"
          memory: 1Gi
        requests:
          cpu: "1"
          memory: 1Gi
      retry:
        minAttempts: 1
      task: 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
        version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
    start-node:
      id: start-node
      kind: start
      resources: {}
  outputBindings:
  - binding:
      promise:
        nodeId: say-hello
        var: greet
    var: greet
  outputs:
    variables:
      greet:
        type:
          simple: STRING
status:
  failedAttempts: 15188
  message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
    by: ExecutionNotFound: The execution that the event belongs to does not exist,
    caused by [rpc error: code = NotFound desc = missing entity of type execution
    with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
    ]'
  phase: 0
tasks:
  ? 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
    version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
  : container:
      args:
      - jflyte
      - execute
      - --task
      - com.spotify.data.flytecanary.FlyteCanaryTask
      - --inputs
      - '{{.input}}'
      - --outputPrefix
      - '{{.outputPrefix}}'
      - --taskTemplatePath
      - '{{.taskTemplatePath}}'
      image: ...
      resources:
        limits:
        - name: CPU
          value: "1"
        - name: MEMORY
          value: 1Gi
        requests:
        - name: CPU
          value: "1"
        - name: MEMORY
          value: 1Gi
    custom:
      jflyte:
        artifacts:
        - location: ... a bunch of jars
    id:
      domain: production
      name: com.spotify.data.flytecanary.FlyteCanaryTask
      project: flyte-canary
      resourceType: TASK
      version: 6345c4e4-6fa2-4403-81a5-cd6010f3510a
    interface:
      inputs:
        variables:
          name:
            type:
              simple: STRING
      outputs:
        variables:
          greet:
            type:
              simple: STRING
    metadata:
      retries: {}
      runtime:
        flavor: java
        type: FLYTE_SDK
        version: 0.0.1
    type: java-task
workflowMeta:
  eventVersion: 1

hallowed-mouse-14616

03/16/2022, 6:48 PM

OK, I can reproduce this exactly:

Copy code

{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:1012","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Node not yet started, will not finalize","ts":"2022-03-16T13:47:43-05:00"}
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"workflow_event_recorder.go:69","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Failed to record workflow event [execution_id:\u003cproject:\"flytesnacks\" domain:\"development\" name:\"rf3qjeb4w1\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647456463 nanos:821780010 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flytesnacks:development:core.flyte_basics.hello_world.my_wf] failed. RuntimeExecutionError: max number of system retry attempts [483/10] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]","ts":"2022-03-16T13:47:43-05:00"}
{"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:351","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"warning","msg":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]]","ts":"2022-03-16T13:47:43-05:00"}

hallowed-mouse-14616

03/16/2022, 6:49 PM

here is what I'm guessing happened: 1. flyteworkflow crd was created (admin launched a new workflow) 2. flytepropeller didn't pick it up yet (or flytepropeller wasn't running) 3. deleted from the admin db 4. started flytepropeller

hallowed-mouse-14616

03/16/2022, 6:50 PM

so effectively we orphaned a CRD in phase 0 (WorkflowReady) that will never be in admin DB. this is troublesome because we're specifically allowing ExecutionNotFound on the WorkflowReady phase to handle the race condition between when admin creates the CRD and writes to the DB.

hallowed-mouse-14616

03/16/2022, 6:51 PM

for a short-term fix, you can delete the relating CRDs, this is safe because they don't exist in admin either. we should probably figure out a more elegant fix.

swift-animal-75798

03/16/2022, 6:57 PM

from the executions table:

Copy code

select * from executions where execution_name ='tvy2szcxi7bcnzg36oef';        

id | created_at | updated_at | deleted_at | execution_project | execution_domain | execution_name | launch_plan_id | workflow_id | task_id | phase | closure | spec | started_at | execution_created_at | execution_updated_at | duration | abort_cause | mode | source_execution_id | parent_node_execution_id | cluster | inputs_uri | user_inputs_uri | error_kind | error_code | user | state
----+------------+------------+------------+-------------------+------------------+----------------+----------------+-------------+---------+-------+---------+------+------------+----------------------+----------------------+----------+-------------+------+---------------------+--------------------------+---------+------------+-----------------+------------+------------+------+-------
(0 rows)

hallowed-mouse-14616

03/16/2022, 6:58 PM

perfect, that's exactly what we were expecting to see.

swift-animal-75798

03/16/2022, 6:58 PM

I can look into cleaning them tomorrow thanks for looking into this

hallowed-mouse-14616

03/16/2022, 7:05 PM

no problem, let me know if you run into issues. do you have to file an issue? i'd be happy to pick it up to track the flytepropeller fix so we can automatically handle this case.

swift-animal-75798

03/16/2022, 7:09 PM

Great I can do it tomorrow

🙌 1

freezing-airport-6809

03/16/2022, 7:16 PM

interesting find

swift-animal-75798

03/17/2022, 5:18 PM

I came up with this, do you think that the failedAttempts >50 and phase==0 checks make sense, or is there any risk to remove something useful?

Copy code

kubectl get flyteworkflows --all-namespaces -o json > /tmp/all-namespaces
cat /tmp/all-namespaces | jq '.items[] | select((.status.phase==0 and .status.failedAttempts > 50)) | { executionId: .executionId, status: .status} |  kubectl delete flyteworkflow -n \(.executionId.project)-\(.executionId.domain) \(.executionId.name)' > /tmp/kubectl-delete-stuck-worfklows

🙌 1

hallowed-mouse-14616

03/17/2022, 5:34 PM

I think that could be a great idea, I was actually thinking the same thing. Just need to run some tests to make sure there is adequate backoff between each attempt.

swift-animal-75798

03/17/2022, 5:41 PM

I actually run it since it wasn't that many resources, approx 150, that were stuck. I am going to monitor the metrics and logs and get back to you if something looks 🐟

👍 1

hallowed-mouse-14616

03/17/2022, 5:42 PM

Oh yeah, I think for removing the CRDs from k8s that looks great. I was thinking we could make that addition to FlytePropeller as well so that these cases are automatically detected and handled there.

swift-animal-75798

03/17/2022, 5:51 PM

gotcha! i was looking at opening a ticket now. Is it still necessary?

hallowed-mouse-14616

03/17/2022, 5:53 PM

Yeah, I think it is still a good idea. If it ends up being something that we think is a little too specific to implement at least there can be a record of the discussion.

swift-animal-75798

03/17/2022, 6:15 PM

https://github.com/flyteorg/flyte/issues/2275

hallowed-mouse-14616

03/17/2022, 6:18 PM

Perfect, thanks so much!

swift-animal-75798

03/18/2022, 7:38 AM

It's interesting how much these ghost workflows polluted our metrics

282 Views

Open in Slack

Previous Next