Thread
#announcements
    Sonja Ericsson

    Sonja Ericsson

    7 months ago
    Hi! We are comparing how often a workflow succeeds and gets accepted, i.e.,
    flyte:propeller:all:workflow:success_duration_ms_count
      vs 
    flyte:propeller:all:workflow:accepted
    and get some unexpected numbers. Successes is equal to 4345 for the last 30 days which is bigger than the number of accepted which is 4322. There are 0 failures also. Should they not be the same? Is there a better metric to use to be able to compare these numbers? I don’t find
    flyte:propeller:all:workflow:success
    for example
    I also see a note from a meeting I did not attend where we have written down that you have an SLO for Flyte Execution Success rate. Do you have a query to share on how you calculate this and what metrics to use?
    We also see no values for
    flyte:propeller:all:workflow:completion_latency_ms_count
    is there a way to get number of completions?
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    7 months ago
    @Sonja Ericsson all great questions. For the missing workflows, there may have been a number of aborted workflows. There should be a metric like `flyte😛ropeller:all:workflow:workflow_aborted`to denote these instances.
    For each success the duration (in ms) is reported. I think using a count over these values is the correct way to count the total number of successes. It shouldn't be necessary to emit two separate metrics.
    The completion latency metric refers to the time is takes to transition from the workflow end node start to workflow success . I don't think this is what you're looking for, you're looking for a total count of completed workflows correct?
    Not sure on SLO - perhaps @Ketan (kumare3) / @Haytham Abuelfutuh can touch on this?
    Ketan (kumare3)

    Ketan (kumare3)

    7 months ago
    @Sonja Ericsson I do not think you can use the stats to proxy real numbers
    A better way would be to use the db for these metrics
    @Julien Bisconti created this issue to capture these numeric metrics https://github.com/flyteorg/flyte/issues/2079
    Sonja Ericsson

    Sonja Ericsson

    7 months ago
    @Dan Rammer (hamersaw) the workflow aborted seems to be 0 or empty for this particular workflow, and yes exactly, we were looking for the total count of completions for a particular workflow.
    @Ketan (kumare3) Ok, why does it not work to use these metrics we are using? We are currently doing this to figure out how often our canary workflow is succeeding
    Ketan (kumare3)

    Ketan (kumare3)

    7 months ago
    So @Sonja Ericsson stats collectors are approx by design. It is ok for them to be lossy or sometimes have duplicates
    Sonja Ericsson

    Sonja Ericsson

    7 months ago
    but as we haven’t found a metric for completions, we divide by the number of successes we expect
    Ketan (kumare3)

    Ketan (kumare3)

    7 months ago
    That being said they should not be too lossy
    I think the but seeing your dashboard I understand- you want an indicator and not an actual number
    To create a percentage
    Let me Think and compose an answer
    Sonja Ericsson

    Sonja Ericsson

    7 months ago
    yes we don’t care about being super exact. Thank you!
    We are also interested in the number of internal server errors for create_task, we only find this for create_workflow
    flyte:admin:create_workflow:codes:Internal
    . For create_task the errors doesn’t seem to distinguish between invalid_arg codes and errors
    flyte:admin:create_task:errors
    (also create_launch_plan)
    To give some more information on how we calculate the values. This is our current SLO dashboard. We calculate the metrics like: 1. Workflow execution success rate > 99.5% of the month (Canary success rate) a. sum of 
    flyte:propeller:all:workflow:accepted
     occurrences for our canary workflow the past 30 days -  sum of 
    flyte:propeller:all:workflow:failure_duration_ms_count
      occurrences for our canary workflow the past 30 days/ a constant number on how many times we expect the workflow to run in 30 days b. TODO: Would be nice to exchange this constant number to sum of completions 2. Workflow execution success rate > 99.5% of the month (Create Execution RPC success rate) a. sum of 
    flyte:admin:create_execution:codes:Internal
     the past 30 days / 
    flyte:admin:create_execution:codes:OK
      the past 30 days + 
    flyte:admin:create_execution:codes:Internal
     the past 30 days + 
    flyte:admin:create_execution:codes:InvalidArgument
    the past 30 days b. Haven’t decided which one of metric 1 or 2 to use 3. Register workflow success rate > 99.5% of the month a. Same as 2 but with
    flyte:admin:create_workflow*
    b. TODO: We want to do the same for create_task and create_launch_plan but there is no 
    create_task:codes:Internal
    Hi again and good morning! 👋 This is our final dashboard, would be interested to hear your thoughts on these metrics and approach or if you think we need to take some other approach to it. Also had a last question about our last SLO. We want to calculate the time flyte spends on executing user code vs the time flyte spends on “overhead”, e.g., how much the workflow is delayed. I thought that overhead time should vary depending on how many nodes a workflow has so used (1-(
    flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum
    +
    flyte:propeller:all:node:transition_latency_unlabeled_ms_sum
    ) /`flyte😛ropeller:all:node:node_exec_latency_unlabeled_us_sum`)*100 which I hope is calculating how much time a node spends in queuing and transition out of its total execution time. Does that make sense or am I doing something wrong?
    Ketan (kumare3)

    Ketan (kumare3)

    7 months ago
    This dashboard looks fantastic 👏 I should have answered your questions yes, will get to them today. Sorry for the delay
    Sonja Ericsson

    Sonja Ericsson

    7 months ago
    Thank you no problem, looking forward to your answers.
    Ketan (kumare3)

    Ketan (kumare3)

    7 months ago
    @Sonja Ericsson For Part 1 - FlytePropeller runs in an event loop and some of the events are eventually consistent. so it is possible that the same loop may be run more than one time and so relying on stats like
    flyte:propeller:all:workflow:workflow_aborted
    to inform the actual number of runs may be inaccurate. To really get the actual number the only source of truth is the flyteAdmin database.
    for Part 2:
    (1-(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum  ) /flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum)
    In the above query this metric math is potentially incorrect
    flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum
    as you are not using
    avg
    or
    sum
    or some aggregation. So i am not sure if this is for one workflow or random workflows.
    Now lets try to see how we can get good metrics
    I think we should add a new metric here - For Node Overhead • https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/nodes/executor.go#L702 Which can be computed as follows as a percentage Queue Overhead: (nodeStatus.GetStartedAt() - nodeStatus.GetQueuedAt()) Maybe your query itself can be improved if you perform avg maybe?
    (1-(avg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum) + avg(flyte:propeller:all:node:transition_latency_unlabeled_ms_sum)  ) /avg(flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum))
    Also all these times are recorded in flyteadmin database
    Sonja Ericsson

    Sonja Ericsson

    7 months ago
    Got it, thank you! For 1 : We might try out to use these metrics we have for now as SLO indicator, and once we dump the database we can change it. The metric where we count internal server errors on different RPCs, that is not in the database I assume? About part 2: I thought “_sum” was a sum of all latencies of all nodes. I get similar values when I do
    avg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum)
    in promql vs
    flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum
    778654651 vs 779608963