Hi We are comparing how often a workflow succeeds and gets a Flyte #announcements

Hi! We are comparing how often a workflow succeeds...

colossal-solstice-11091

01/31/2022, 3:07 PM

Hi! We are comparing how often a workflow succeeds and gets accepted, i.e.,

flyte:propeller:all:workflow:success_duration_ms_count

flyte:propeller:all:workflow:accepted

and get some unexpected numbers. Successes is equal to 4345 for the last 30 days which is bigger than the number of accepted which is 4322. There are 0 failures also. Should they not be the same? Is there a better metric to use to be able to compare these numbers? I don’t find

flyte:propeller:all:workflow:success

for example

colossal-solstice-11091

01/31/2022, 3:10 PM

I also see a note from a meeting I did not attend where we have written down that you have an SLO for Flyte Execution Success rate. Do you have a query to share on how you calculate this and what metrics to use?

colossal-solstice-11091

01/31/2022, 3:30 PM

We also see no values for

Copy code

flyte:propeller:all:workflow:completion_latency_ms_count

is there a way to get number of completions?

hallowed-mouse-14616

01/31/2022, 3:38 PM

@colossal-solstice-11091 all great questions. For the missing workflows, there may have been a number of aborted workflows. There should be a metric like `flytepropellerallworkflowworkflow_aborted`to denote these instances.

hallowed-mouse-14616

01/31/2022, 3:39 PM

For each success the duration (in ms) is reported. I think using a count over these values is the correct way to count the total number of successes. It shouldn't be necessary to emit two separate metrics.

hallowed-mouse-14616

01/31/2022, 3:41 PM

The completion latency metric refers to the time is takes to transition from the workflow end node start to workflow success . I don't think this is what you're looking for, you're looking for a total count of completed workflows correct?

hallowed-mouse-14616

01/31/2022, 3:42 PM

Not sure on SLO - perhaps @freezing-airport-6809 / @high-park-82026 can touch on this?

freezing-airport-6809

01/31/2022, 3:51 PM

@colossal-solstice-11091 I do not think you can use the stats to proxy real numbers

freezing-airport-6809

01/31/2022, 3:51 PM

A better way would be to use the db for these metrics

freezing-airport-6809

01/31/2022, 3:53 PM

@acoustic-cpu-86019 created this issue to capture these numeric metrics https://github.com/flyteorg/flyte/issues/2079

colossal-solstice-11091

01/31/2022, 3:58 PM

@hallowed-mouse-14616 the workflow aborted seems to be 0 or empty for this particular workflow, and yes exactly, we were looking for the total count of completions for a particular workflow.

colossal-solstice-11091

01/31/2022, 4:02 PM

@freezing-airport-6809 Ok, why does it not work to use these metrics we are using? We are currently doing this to figure out how often our canary workflow is succeeding

colossal-solstice-11091

01/31/2022, 4:03 PM

message has been deleted

freezing-airport-6809

01/31/2022, 4:03 PM

So @colossal-solstice-11091 stats collectors are approx by design. It is ok for them to be lossy or sometimes have duplicates

colossal-solstice-11091

01/31/2022, 4:03 PM

but as we haven’t found a metric for completions, we divide by the number of successes we expect

freezing-airport-6809

01/31/2022, 4:04 PM

That being said they should not be too lossy

freezing-airport-6809

01/31/2022, 4:05 PM

I think the but seeing your dashboard I understand- you want an indicator and not an actual number

freezing-airport-6809

01/31/2022, 4:05 PM

To create a percentage

freezing-airport-6809

01/31/2022, 4:05 PM

Let me Think and compose an answer

colossal-solstice-11091

01/31/2022, 4:06 PM

yes we don’t care about being super exact. Thank you!

colossal-solstice-11091

01/31/2022, 4:45 PM

We are also interested in the number of internal server errors for create_task, we only find this for create_workflow

flyte:admin:create_workflow:codes:Internal

. For create_task the errors doesn’t seem to distinguish between invalid_arg codes and errors

flyte:admin:create_task:errors

(also create_launch_plan)

colossal-solstice-11091

01/31/2022, 5:23 PM

To give some more information on how we calculate the values. This is our current SLO dashboard. We calculate the metrics like: 1. Workflow execution success rate > 99.5% of the month (Canary success rate) a. sum of

flyte:propeller:all:workflow:accepted

occurrences for our canary workflow the past 30 days - sum of

flyte:propeller:all:workflow:failure_duration_ms_count

occurrences for our canary workflow the past 30 days/ a constant number on how many times we expect the workflow to run in 30 days b. TODO: Would be nice to exchange this constant number to sum of completions 2. Workflow execution success rate > 99.5% of the month (Create Execution RPC success rate) a. sum of

flyte:admin:create_execution:codes:Internal

the past 30 days /

flyte:admin:create_execution:codes:OK

the past 30 days +

flyte:admin:create_execution:codes:Internal

the past 30 days +

flyte:admin:create_execution:codes:InvalidArgument

the past 30 days b. Haven’t decided which one of metric 1 or 2 to use 3. Register workflow success rate > 99.5% of the month a. Same as 2 but with

flyte:admin:create_workflow*

b. TODO: We want to do the same for create_task and create_launch_plan but there is no

create_task:codes:Internal

colossal-solstice-11091

02/01/2022, 4:37 PM

Hi again and good morning! 👋 This is our final dashboard, would be interested to hear your thoughts on these metrics and approach or if you think we need to take some other approach to it. Also had a last question about our last SLO. We want to calculate the time flyte spends on executing user code vs the time flyte spends on “overhead”, e.g., how much the workflow is delayed. I thought that overhead time should vary depending on how many nodes a workflow has so used (1-(

flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum

flyte:propeller:all:node:transition_latency_unlabeled_ms_sum

) /`flytepropellerallnodenode_exec_latency_unlabeled_us_sum`)*100 which I hope is calculating how much time a node spends in queuing and transition out of its total execution time. Does that make sense or am I doing something wrong?

freezing-airport-6809

02/01/2022, 4:41 PM

This dashboard looks fantastic 👏 I should have answered your questions yes, will get to them today. Sorry for the delay

colossal-solstice-11091

02/01/2022, 7:00 PM

Thank you ✨ no problem, looking forward to your answers.

freezing-airport-6809

02/02/2022, 5:16 AM

@colossal-solstice-11091 For Part 1 - FlytePropeller runs in an event loop and some of the events are eventually consistent. so it is possible that the same loop may be run more than one time and so relying on stats like

flyte:propeller:all:workflow:workflow_aborted

to inform the actual number of runs may be inaccurate. To really get the actual number the only source of truth is the flyteAdmin database.

freezing-airport-6809

02/02/2022, 5:19 AM

for Part 2:

(1-(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum  ) /flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum)

In the above query this metric math is potentially incorrect

flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum + flyte:propeller:all:node:transition_latency_unlabeled_ms_sum

as you are not using

avg

sum

or some aggregation. So i am not sure if this is for one workflow or random workflows.

freezing-airport-6809

02/02/2022, 5:19 AM

Now lets try to see how we can get good metrics

freezing-airport-6809

02/02/2022, 5:29 AM

I think we should add a new metric here - For Node Overhead • https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/nodes/executor.go#L702 Which can be computed as follows as a percentage Queue Overhead: (nodeStatus.GetStartedAt() - nodeStatus.GetQueuedAt()) Maybe your query itself can be improved if you perform avg maybe?

(1-(avg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum) + avg(flyte:propeller:all:node:transition_latency_unlabeled_ms_sum)  ) /avg(flyte:propeller:all:node:node_exec_latency_unlabeled_us_sum))

freezing-airport-6809

02/02/2022, 5:29 AM

Also all these times are recorded in flyteadmin database

colossal-solstice-11091

02/02/2022, 11:51 AM

Got it, thank you! For 1 : We might try out to use these metrics we have for now as SLO indicator, and once we dump the database we can change it. The metric where we count internal server errors on different RPCs, that is not in the database I assume? About part 2: I thought “_sum” was a sum of all latencies of all nodes. I get similar values when I do

avg(flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum)

in promql vs

flyte:propeller:all:node:queueing_latency_unlabeled_ms_sum

778654651 vs 779608963

234 Views

Open in Slack

Previous Next