b

    Babis Kiosidis

    6 months ago
    Looking at the propeller prometheus metrics, what's the best way to capture failures/errors by plugin node? Is there a metric for that?
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    This should work right?
    b

    Babis Kiosidis

    6 months ago
    did you mean to share a link?
    b

    Babis Kiosidis

    6 months ago
    i will look at this tomorrow, thanks! Not sure if we forward these currently (maybe we missed them)
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    ya
    b

    Babis Kiosidis

    6 months ago
    @Ketan (kumare3) This workflow has no failures but we are seeing some failures on the metric, is there some bug or did we misinterpret the metric?
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    Seems weird, can you do a rate on the counter
    b

    Babis Kiosidis

    6 months ago
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    Prometheus QL is weird. Check it on how to interpret rate. This seems to be zero
    b

    Babis Kiosidis

    6 months ago
    hm ok
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    Also Morning time..will tal in the day
    b

    Babis Kiosidis

    6 months ago
    sure. for some other wfs it's 0 which is a bit weird
    we see this when scraping flytepropeller
    flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 0
    flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 0
    flyte:propeller:all:workflow:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 0
    flyte:propeller:all:workflow:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 0
    flyte:propeller:all:workflow:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376
    Does 4376 make sense?
    also seeing some slightly different numbers on the event recording metrics
    flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.5"} 4
    flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.9"} 6
    flyte:propeller:all:workflow:event_recording:failure_duration_ms{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow",quantile="0.99"} 6
    flyte:propeller:all:workflow:event_recording:failure_duration_ms_sum{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 54815
    flyte:propeller:all:workflow:event_recording:failure_duration_ms_count{domain="production",project="flyte-canary",task="",wf="flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"} 4376
    we found something, i will DM you
    maybe it's better here
    so we found something here that might explain the wrong metrics: First we cheked the logs for the specific wf and found something fishy
    kubectl -n flyte logs  flytepropeller | grep com.spotify.data.flytecanary.FlyteCanaryWorkflow
    {"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:1012","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Node not yet started, will not finalize","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
    {"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"workflow_event_recorder.go:69","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Failed to record workflow event [execution_id:\u003cproject:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647441372 nanos:95751514 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [15069/50] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\\\"flyte-canary\\\" domain:\\\"production\\\" name:\\\"tvy2szcxi7bcnzg36oef\\\" ]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]","severity":"INFO","timestamp":"2022-03-16T14:36:12Z"}
    {"data":{"exec_id":"tvy2szcxi7bcnzg36oef","ns":"flyte-canary-production","res_ver":"795880409","routine":"worker-135","src":"executor.go:351","wf":"flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow"},"message":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = missing entity of type execution with identifier project:\"flyte-canary\" domain:\"production\" name:\"tvy2szcxi7bcnzg36oef\" ]]","severity":"WARNING","timestamp":"2022-03-16T14:36:12Z"}
    notice the 15069/50 retries Then we see checked the resources for
    tvy2szcxi7bcnzg36oef
    and found it (together with some other ones that are old as well)
    kubectl -n flyte-canary-production get flyteworkflows
    NAME                   AGE
    br5o45dnhdjfmfbbr7ts   6m30s
    brrq4doftnbbiflu4zvf   26m
    fkpnyoqmal5aejlfbsgr   6m16s
    hckqfx2umxjabzcvoqgf   54d
    holnhktw7fvhy3lkpsrh   16m
    mueszlcebofasnowxmwx   36m
    nstoy3mlfnffkth7y5np   16m
    ojpzr2ctr6fc5neeir5k   36m
    p6lvppusayncj5c7b33i   26m
    qrivhl7rztrcqviw6py7   54d
    tq2j6kkisvrfgroyt4sr   20d
    tvy2szcxi7bcnzg36oef   20d
    vbjdy2wshxngbdlrlahz   34m
    vnlhr26tdt5da3pqcqwn   16m
    ydkgy4krdnbgajo4y3ba   54d
    fetching the specific resource
    kubectl -n flyte-canary-production get flyteworkflows tvy2szcxi7bcnzg36oef -o yaml
    
    
      failedAttempts: 15086
      message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
        by: ExecutionNotFound: The execution that the event belongs to does not exist,
        caused by [rpc error: code = NotFound desc = missing entity of type execution
        with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
        ]'
      phase: 0
    So we now believe that these flyteworkflow resources were applied on the k8s cluster during a short incident that we had and they've been stuck like that since then (3w ago). We realise now that there are other errors on workflows/namespaces that missed before and keep retrying. One of them
    RuntimeExecutionError: max number of system retry attempts [224267/50] e
    ... Are we expecting that flyte should somehow recycle these crds?
    it seems that during the incidents some CRDs got created but the executions did not start.
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    hmm this is interesting that they are still hanging around
    cc @Dan Rammer (hamersaw) / @Haytham Abuelfutuh
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    Hey @Babis Kiosidis, I think propeller is trying to report an event that the workflow has failed, the flyteadmin can't find the workflow, that causes a failure, and endless loop.
    We had put in this fix which should handle this case. I think we may have missed a corner case here.
    Can you confirm that flyteadmin doesn't have the execution? should be able to just check the db. Also, if it's not too large can you dump the flyteworkflow crd? having phases, etc would be helpful in debugging this.
    b

    Babis Kiosidis

    6 months ago
    acceptedAt: "2022-02-23T17:08:16Z"
    apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    executionConfig:
      MaxParallelism: 0
      RecoveryExecution: {}
      TaskPluginImpls: {}
      TaskResources:
        Limits:
          CPU: "8"
          EphemeralStorage: "0"
          GPU: "0"
          Memory: 40Gi
          Storage: 4Gi
        Requests:
          CPU: "1"
          EphemeralStorage: "0"
          GPU: "0"
          Memory: 1Gi
          Storage: 2Gi
    executionId:
      domain: production
      name: tvy2szcxi7bcnzg36oef
      project: flyte-canary
    inputs:
      literals:
        styx_parameter:
          scalar:
            primitive:
              datetime: "2022-02-23T17:00:00Z"
    kind: FlyteWorkflow
    metadata:
      annotations:
        ...
      creationTimestamp: "2022-02-23T17:08:17Z"
      generation: 15189
      labels:
        ...
      name: tvy2szcxi7bcnzg36oef
      namespace: flyte-canary-production
      resourceVersion: "796306095"
      uid: b5b58ac2-3565-4896-8c56-5c6cea4d3d7b
    node-defaults: {}
    rawOutputDataConfig: {}
    securityContext: {}
    spec:
      connections:
        say-hello:
        - end-node
        start-node:
        - say-hello
      edges:
        downstream:
          say-hello:
          - end-node
          start-node:
          - say-hello
        upstream:
          end-node:
          - say-hello
          say-hello:
          - start-node
      id: flyte-canary:production:com.spotify.data.flytecanary.FlyteCanaryWorkflow
      nodes:
        end-node:
          id: end-node
          inputBindings:
          - binding:
              promise:
                nodeId: say-hello
                var: greet
            var: greet
          kind: end
          resources: {}
        say-hello:
          id: say-hello
          inputBindings:
          - binding:
              scalar:
                primitive:
                  stringValue: World
            var: name
          kind: task
          resources:
            limits:
              cpu: "1"
              memory: 1Gi
            requests:
              cpu: "1"
              memory: 1Gi
          retry:
            minAttempts: 1
          task: 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
            version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
        start-node:
          id: start-node
          kind: start
          resources: {}
      outputBindings:
      - binding:
          promise:
            nodeId: say-hello
            var: greet
        var: greet
      outputs:
        variables:
          greet:
            type:
              simple: STRING
    status:
      failedAttempts: 15188
      message: 'Workflow[] failed. ErrorRecordingError: failed to publish event, caused
        by: ExecutionNotFound: The execution that the event belongs to does not exist,
        caused by [rpc error: code = NotFound desc = missing entity of type execution
        with identifier project:"flyte-canary" domain:"production" name:"tvy2szcxi7bcnzg36oef"
        ]'
      phase: 0
    tasks:
      ? 'resource_type:TASK project:"flyte-canary" domain:"production" name:"com.spotify.data.flytecanary.FlyteCanaryTask"
        version:"6345c4e4-6fa2-4403-81a5-cd6010f3510a" '
      : container:
          args:
          - jflyte
          - execute
          - --task
          - com.spotify.data.flytecanary.FlyteCanaryTask
          - --inputs
          - '{{.input}}'
          - --outputPrefix
          - '{{.outputPrefix}}'
          - --taskTemplatePath
          - '{{.taskTemplatePath}}'
          image: ...
          resources:
            limits:
            - name: CPU
              value: "1"
            - name: MEMORY
              value: 1Gi
            requests:
            - name: CPU
              value: "1"
            - name: MEMORY
              value: 1Gi
        custom:
          jflyte:
            artifacts:
            - location: ... a bunch of jars
        id:
          domain: production
          name: com.spotify.data.flytecanary.FlyteCanaryTask
          project: flyte-canary
          resourceType: TASK
          version: 6345c4e4-6fa2-4403-81a5-cd6010f3510a
        interface:
          inputs:
            variables:
              name:
                type:
                  simple: STRING
          outputs:
            variables:
              greet:
                type:
                  simple: STRING
        metadata:
          retries: {}
          runtime:
            flavor: java
            type: FLYTE_SDK
            version: 0.0.1
        type: java-task
    workflowMeta:
      eventVersion: 1
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    OK, I can reproduce this exactly:
    {"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:1012","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Node not yet started, will not finalize","ts":"2022-03-16T13:47:43-05:00"}
    {"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"workflow_event_recorder.go:69","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"info","msg":"Failed to record workflow event [execution_id:\u003cproject:\"flytesnacks\" domain:\"development\" name:\"rf3qjeb4w1\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at:\u003cseconds:1647456463 nanos:821780010 \u003e error:\u003ccode:\"Workflow abort failed\" message:\"Workflow[flytesnacks:development:core.flyte_basics.hello_world.my_wf] failed. RuntimeExecutionError: max number of system retry attempts [483/10] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]\" kind:SYSTEM \u003e ] with err: ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]","ts":"2022-03-16T13:47:43-05:00"}
    {"json":{"exec_id":"rf3qjeb4w1","ns":"flytesnacks-development","res_ver":"2076","routine":"worker-0","src":"executor.go:351","wf":"flytesnacks:development:core.flyte_basics.hello_world.my_wf"},"level":"warning","msg":"Event recording failed. Error [ExecutionNotFound: The execution that the event belongs to does not exist, caused by [rpc error: code = NotFound desc = entry not found]]","ts":"2022-03-16T13:47:43-05:00"}
    here is what I'm guessing happened:1. flyteworkflow crd was created (admin launched a new workflow) 2. flytepropeller didn't pick it up yet (or flytepropeller wasn't running) 3. deleted from the admin db 4. started flytepropeller
    so effectively we orphaned a CRD in phase 0 (WorkflowReady) that will never be in admin DB. this is troublesome because we're specifically allowing ExecutionNotFound on the WorkflowReady phase to handle the race condition between when admin creates the CRD and writes to the DB.
    for a short-term fix, you can delete the relating CRDs, this is safe because they don't exist in admin either. we should probably figure out a more elegant fix.
    b

    Babis Kiosidis

    6 months ago
    from the executions table:
    select * from executions where execution_name ='tvy2szcxi7bcnzg36oef';        
    
    id | created_at | updated_at | deleted_at | execution_project | execution_domain | execution_name | launch_plan_id | workflow_id | task_id | phase | closure | spec | started_at | execution_created_at | execution_updated_at | duration | abort_cause | mode | source_execution_id | parent_node_execution_id | cluster | inputs_uri | user_inputs_uri | error_kind | error_code | user | state
    ----+------------+------------+------------+-------------------+------------------+----------------+----------------+-------------+---------+-------+---------+------+------------+----------------------+----------------------+----------+-------------+------+---------------------+--------------------------+---------+------------+-----------------+------------+------------+------+-------
    (0 rows)
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    perfect, that's exactly what we were expecting to see.
    b

    Babis Kiosidis

    6 months ago
    I can look into cleaning them tomorrow thanks for looking into this
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    no problem, let me know if you run into issues. do you have to file an issue? i'd be happy to pick it up to track the flytepropeller fix so we can automatically handle this case.
    b

    Babis Kiosidis

    6 months ago
    Great I can do it tomorrow
    Ketan (kumare3)

    Ketan (kumare3)

    6 months ago
    interesting find
    b

    Babis Kiosidis

    6 months ago
    I came up with this, do you think that the failedAttempts >50 and phase==0 checks make sense, or is there any risk to remove something useful?
    kubectl get flyteworkflows --all-namespaces -o json > /tmp/all-namespaces
    cat /tmp/all-namespaces | jq '.items[] | select((.status.phase==0 and .status.failedAttempts > 50)) | { executionId: .executionId, status: .status} |  kubectl delete flyteworkflow -n \(.executionId.project)-\(.executionId.domain) \(.executionId.name)' > /tmp/kubectl-delete-stuck-worfklows
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    I think that could be a great idea, I was actually thinking the same thing. Just need to run some tests to make sure there is adequate backoff between each attempt.
    b

    Babis Kiosidis

    6 months ago
    I actually run it since it wasn't that many resources, approx 150, that were stuck. I am going to monitor the metrics and logs and get back to you if something looks 🐟
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    Oh yeah, I think for removing the CRDs from k8s that looks great. I was thinking we could make that addition to FlytePropeller as well so that these cases are automatically detected and handled there.
    b

    Babis Kiosidis

    6 months ago
    gotcha! i was looking at opening a ticket now. Is it still necessary?
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    Yeah, I think it is still a good idea. If it ends up being something that we think is a little too specific to implement at least there can be a record of the discussion.
    b

    Babis Kiosidis

    6 months ago
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    6 months ago
    Perfect, thanks so much!
    b

    Babis Kiosidis

    6 months ago
    It's interesting how much these ghost workflows polluted our metrics