Hi! We are comparing how often a workflow succeeds...
# announcements
Hi! We are comparing how often a workflow succeeds and gets accepted, i.e.,
and get some unexpected numbers. Successes is equal to 4345 for the last 30 days which is bigger than the number of accepted which is 4322. There are 0 failures also. Should they not be the same? Is there a better metric to use to be able to compare these numbers? I don’t find
for example
I also see a note from a meeting I did not attend where we have written down that you have an SLO for Flyte Execution Success rate. Do you have a query to share on how you calculate this and what metrics to use?
We also see no values for
Copy code
is there a way to get number of completions?
@Sonja Ericsson all great questions. For the missing workflows, there may have been a number of aborted workflows. There should be a metric like `flytepropellerallworkflowworkflow_aborted`to denote these instances.
For each success the duration (in ms) is reported. I think using a count over these values is the correct way to count the total number of successes. It shouldn't be necessary to emit two separate metrics.
The completion latency metric refers to the time is takes to transition from the workflow end node start to workflow success . I don't think this is what you're looking for, you're looking for a total count of completed workflows correct?
Not sure on SLO - perhaps @Ketan (kumare3) / @Haytham Abuelfutuh can touch on this?
@Sonja Ericsson I do not think you can use the stats to proxy real numbers
A better way would be to use the db for these metrics
@Julien Bisconti created this issue to capture these numeric metrics https://github.com/flyteorg/flyte/issues/2079
@Dan Rammer (hamersaw) the workflow aborted seems to be 0 or empty for this particular workflow, and yes exactly, we were looking for the total count of completions for a particular workflow.
@Ketan (kumare3) Ok, why does it not work to use these metrics we are using? We are currently doing this to figure out how often our canary workflow is succeeding
message has been deleted
So @Sonja Ericsson stats collectors are approx by design. It is ok for them to be lossy or sometimes have duplicates
but as we haven’t found a metric for completions, we divide by the number of successes we expect
That being said they should not be too lossy
I think the but seeing your dashboard I understand- you want an indicator and not an actual number
To create a percentage
Let me Think and compose an answer
yes we don’t care about being super exact. Thank you!
We are also interested in the number of internal server errors for create_task, we only find this for create_workflow
. For create_task the errors doesn’t seem to distinguish between invalid_arg codes and errors
(also create_launch_plan)
To give some more information on how we calculate the values. This is our current SLO dashboard. We calculate the metrics like: 1. Workflow execution success rate > 99.5% of the month (Canary success rate) a. sum of 
 occurrences for our canary workflow the past 30 days -  sum of 
  occurrences for our canary workflow the past 30 days/ a constant number on how many times we expect the workflow to run in 30 days b. TODO: Would be nice to exchange this constant number to sum of completions 2. Workflow execution success rate > 99.5% of the month (Create Execution RPC success rate) a. sum of 
 the past 30 days / 
  the past 30 days + 
 the past 30 days + 
the past 30 days b. Haven’t decided which one of metric 1 or 2 to use 3. Register workflow success rate > 99.5% of the month a. Same as 2 but with
b. TODO: We want to do the same for create_task and create_launch_plan but there is no 
Hi again and good morning! 👋 This is our final dashboard, would be interested to hear your thoughts on these metrics and approach or if you think we need to take some other approach to it. Also had a last question about our last SLO. We want to calculate the time flyte spends on executing user code vs the time flyte spends on “overhead”, e.g., how much the workflow is delayed. I thought that overhead time should vary depending on how many nodes a workflow has so used (1-(
) /`flytepropellerallnodenode_exec_latency_unlabeled_us_sum`)*100 which I hope is calculating how much time a node spends in queuing and transition out of its total execution time. Does that make sense or am I doing something wrong?
This dashboard looks fantastic 👏 I should have answered your questions yes, will get to them today. Sorry for the delay
Thank you no problem, looking forward to your answers.
@Sonja Ericsson For Part 1 - FlytePropeller runs in an event loop and some of the events are eventually consistent. so it is possible that the same loop may be run more than one time and so relying on stats like
to inform the actual number of runs may be inaccurate. To really get the actual number the only source of truth is the flyteAdmin database.
for Part 2:
(1-(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum  ) /flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum)
In the above query this metric math is potentially incorrect
flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum
as you are not using
or some aggregation. So i am not sure if this is for one workflow or random workflows.
Now lets try to see how we can get good metrics
I think we should add a new metric here - For Node Overhead • https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/nodes/executor.go#L702 Which can be computed as follows as a percentage Queue Overhead: (nodeStatus.GetStartedAt() - nodeStatus.GetQueuedAt()) Maybe your query itself can be improved if you perform avg maybe?
(1-(avg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum) + avg(flyte:propeller:all:node:transition_latency_unlabeled_ms_sum)  ) /avg(flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum))
Also all these times are recorded in flyteadmin database
Got it, thank you! For 1 : We might try out to use these metrics we have for now as SLO indicator, and once we dump the database we can change it. The metric where we count internal server errors on different RPCs, that is not in the database I assume? About part 2: I thought “_sum” was a sum of all latencies of all nodes. I get similar values when I do
in promql vs
778654651 vs 779608963