Sonja Ericsson
01/31/2022, 3:07 PMflyte:propeller:all:workflow:success_duration_ms_count
vs flyte:propeller:all:workflow:accepted
and get some unexpected numbers. Successes is equal to 4345 for the last 30 days which is bigger than the number of accepted which is 4322. There are 0 failures also. Should they not be the same? Is there a better metric to use to be able to compare these numbers? I don’t find flyte:propeller:all:workflow:success
for exampleflyte:propeller:all:workflow:completion_latency_ms_count
is there a way to get number of completions?Dan Rammer (hamersaw)
01/31/2022, 3:38 PMKetan (kumare3)
Sonja Ericsson
01/31/2022, 3:58 PMKetan (kumare3)
Sonja Ericsson
01/31/2022, 4:03 PMKetan (kumare3)
Sonja Ericsson
01/31/2022, 4:06 PMflyte:admin:create_workflow:codes:Internal
. For create_task the errors doesn’t seem to distinguish between invalid_arg codes and errors flyte:admin:create_task:errors
(also create_launch_plan)flyte:propeller:all:workflow:accepted
occurrences for our canary workflow the past 30 days - sum of flyte:propeller:all:workflow:failure_duration_ms_count
occurrences for our canary workflow the past 30 days/ a constant number on how many times we expect the workflow to run in 30 days
b. TODO: Would be nice to exchange this constant number to sum of completions
2. Workflow execution success rate > 99.5% of the month (Create Execution RPC success rate)
a. sum of flyte:admin:create_execution:codes:Internal
the past 30 days / flyte:admin:create_execution:codes:OK
the past 30 days + flyte:admin:create_execution:codes:Internal
the past 30 days + flyte:admin:create_execution:codes:InvalidArgument
the past 30 days
b. Haven’t decided which one of metric 1 or 2 to use
3. Register workflow success rate > 99.5% of the month
a. Same as 2 but with flyte:admin:create_workflow*
b. TODO: We want to do the same for create_task and create_launch_plan but there is no create_task:codes:Internal
flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum
+ flyte:propeller:all:node:transition_latency_unlabeled_ms_sum
) /`flytepropellerallnodenode_exec_latency_unlabeled_us_sum`)*100 which I hope is calculating how much time a node spends in queuing and transition out of its total execution time. Does that make sense or am I doing something wrong?Ketan (kumare3)
Sonja Ericsson
02/01/2022, 7:00 PMKetan (kumare3)
flyte:propeller:all:workflow:workflow_aborted
to inform the actual number of runs may be inaccurate.
To really get the actual number the only source of truth is the flyteAdmin database.(1-(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum ) /flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum)
In the above query this metric math is potentially incorrect flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum
as you are not using avg
or sum
or some aggregation. So i am not sure if this is for one workflow or random workflows.(1-(avg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum) + avg(flyte:propeller:all:node:transition_latency_unlabeled_ms_sum) ) /avg(flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum))
Sonja Ericsson
02/02/2022, 11:51 AMavg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum)
in promql vs flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum
778654651 vs 779608963