<@UPTRGR537>, mind helping <@U029E6ZLZ9S> with thi...
# announcements
h
@Prafulla Mahindrakar, mind helping @Alex Pozimenko with this? He's trying to get metrics on terminated tasks/executions annotated with the phase they ended as. We merged this PR https://github.com/flyteorg/flyteadmin/pull/386/files and he's trying with admin version 0.6.131 (should include this PR) but can't see any phases under:
flyte-admin-task.execution.manager-task.executions.terminated.counter
p
Sure 👍
There is bug in this path and the new metric is not published . These are the ones published with that image
Copy code
flyte:admin:admin:execution_manager:acceptance_delay_count 0
flyte:admin:admin:execution_manager:acceptance_delay_sum 0
flyte:admin:admin:execution_manager:acceptance_delay{quantile="0.5"} NaN
flyte:admin:admin:execution_manager:acceptance_delay{quantile="0.9"} NaN
flyte:admin:admin:execution_manager:acceptance_delay{quantile="0.99"} NaN
flyte:admin:admin:execution_manager:active_executions 0
flyte:admin:admin:execution_manager:closure_size_bytes_count 0
flyte:admin:admin:execution_manager:closure_size_bytes_sum 0
flyte:admin:admin:execution_manager:closure_size_bytes{quantile="0.5"} NaN
flyte:admin:admin:execution_manager:closure_size_bytes{quantile="0.9"} NaN
flyte:admin:admin:execution_manager:closure_size_bytes{quantile="0.99"} NaN
flyte:admin:admin:execution_manager:execution_events_created 0
flyte:admin:admin:execution_manager:execution_termination_failure 0
flyte:admin:admin:execution_manager:executions_created 0
flyte:admin:admin:execution_manager:propeller_failures 0
flyte:admin:admin:execution_manager:publish_error 0
flyte:admin:admin:execution_manager:publish_event_error 0
flyte:admin:admin:execution_manager:spec_size_bytes_count 0
flyte:admin:admin:execution_manager:spec_size_bytes_sum 0
flyte:admin:admin:execution_manager:spec_size_bytes{quantile="0.5"} NaN
flyte:admin:admin:execution_manager:spec_size_bytes{quantile="0.9"} NaN
flyte:admin:admin:execution_manager:spec_size_bytes{quantile="0.99"} NaN
flyte:admin:admin:execution_manager:transformer_error 0
flyte:admin:admin:execution_manager:unexpected_data_error 0
Will send out the fix
Take that back . It works as expected. The counter show up only after executions come to terminal state of succeeded, failed, timeoout/ aborted . eg:
Copy code
# TYPE flyte:admin:admin:execution_manager:executions_terminated counter
flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
@Alex Pozimenko let us know what behavior you are seeing
h
I don't see the phase in the metrics you posted... Can you also check the task and node terminations' metrics?
p
Task execution metrics
Copy code
# HELP flyte:admin:admin:task_execution_manager:task_executions_terminated overall count of terminated workflow executions
# TYPE flyte:admin:admin:task_execution_manager:task_executions_terminated counter
flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.mapper_a_mappable_task_0-0",tasktype="",wf=""} 1
flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.coalesce-0",tasktype="",wf=""} 1
flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.mapper_a_mappable_task_0-0",tasktype="",wf=""} 1
flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.coalesce-0",tasktype="",wf=""} 1
Node execution metrics
Copy code
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="end-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="start-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="end-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="start-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""}
I will check why the phase is not being emitted
h
p
Ok that seems to have worked @Haytham Abuelfutuh Node executions
Copy code
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="end-node",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n0",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n1",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="start-node",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
Task executions
Copy code
flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n0",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.mapper_a_mappable_task_0-0",tasktype="",wf=""} 1
flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n1",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.coalesce-0",tasktype="",wf=""} 1
Executions
Copy code
flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
a
@Prafulla Mahindrakar, thanks for the fix. Shall I try flyteadmin v0.6.145 ?
I've upgraded flyteadmin but now don't see
flyte-admin-task.execution.manager-task.executions.terminated.counter
at all
also v0.6.145 appears to have auth issues. My workflows kept failing this error after the upgrade: "service account name not authorized". I downgraded to v0.6.131 and the same workflows work fine. Trying 0.6.145 again just to be sure
confirming - the auth issue is back after upgrading to 0.6.145
h
Sorry you seen to have hit a regression!! Where do you see that error? Does the execution start and the task fails with that? When it fails, can you check the pod created to see what service account was set on it? How do you launch the workflow? Is it through flytectl and you pass a k8s service account? @Prafulla Mahindrakar @katrina can you please help investigate this? We can't ship with that regression in a release @Eduardo Apolinario (eapolinario) @Yuvraj can we see how to add this scenario to the endtoend tests?
You don't have to read the entire thread, just the last two messages...
p
Hi Alex , knowing how are you launching this workflows would be helpful for further investigation
Also regarding the missing metric can you verify you are checking for this named metric. With my testing i am seeing this metric being emittted,
Copy code
flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
k
@Alex Pozimenko are you setting the auth role?
a
Where do you see that error? Does the execution start and the task fails with that? When it fails, can you check the pod created to see what service account was set on it? How do you launch the workflow? Is it through flytectl and you pass a k8s service account?
1. i launch from the console 2. the task starts, the error is coming from the container when it's trying to access other aws resources 3. i'll check the service account (need to upgrade my environment again 🙂 ) (@Haytham Abuelfutuh @Prafulla Mahindrakar)
are you setting the auth role?
I'm not sure what you mean by that. I only changed version of the admin, no other changes to our deployment. We use OIDC auth and K8s service account if that helps
k
hi @Alex Pozimenko thank you for reporting this! we've update the v0.19.4 release with a fix for reading the deprecated auth field in admin. do you mind updating your deployment to unblock?
a
cool, what version of the admin shall I use?
k
v0.6.147
👍 1
a
is there a matrix that maps release versions to containers?
v0.6.147 is looking good! My workflow completed successfully. Thanks @katrina for fixing this
k
this was all @Prafulla Mahindrakar 🎉
but glad to hear it! thanks for updating us
a
thank you @Prafulla Mahindrakar!
k
is there a matrix that maps release versions to containers?
we should publish an image per release: https://github.com/flyteorg/flyteadmin/pkgs/container/flyteadmin/19159524?tag=v0.6.147
a
right, but how do I know that v0.19.4 is part of v0.6.147 release?
k
a
got it!
p
Glad it worked out for you Alex. And hopefully the metrics issue is resolved as well ?
k
also we use the non-release "flyteadmin" package in the helm chart: https://github.com/flyteorg/flyte/blob/v0.19.4/charts/flyte-core/values.yaml#L19
a
now about the missing metric. Looks like the name changed from
flyte-admin-task.execution.manager-task.executions.terminated
to
flyte-admin-admin-execution.manager-executions.terminated
i see the phase tag on the new metric now
🎉 1
I'm thinking of 3 scenarios here: • succeeded - all good • failed - likely (but not necessary) user error • everything else - likely infra error does that make sense?
p
I think failed can mean user errors aswell as infra errors like permission issues
a
can a workflow be terminated in phase other than failed/succeeded ? or is it always one of the two?
p
Node execution phases
Copy code
const (
	NodeExecution_UNDEFINED       NodeExecution_Phase = 0
	NodeExecution_QUEUED          NodeExecution_Phase = 1
	NodeExecution_RUNNING         NodeExecution_Phase = 2
	NodeExecution_SUCCEEDED       NodeExecution_Phase = 3
	NodeExecution_FAILING         NodeExecution_Phase = 4
	NodeExecution_FAILED          NodeExecution_Phase = 5
	NodeExecution_ABORTED         NodeExecution_Phase = 6
	NodeExecution_SKIPPED         NodeExecution_Phase = 7
	NodeExecution_TIMED_OUT       NodeExecution_Phase = 8
	NodeExecution_DYNAMIC_RUNNING NodeExecution_Phase = 9
	NodeExecution_RECOVERED       NodeExecution_Phase = 10
)
Task execution phases
Copy code
const (
	TaskExecution_UNDEFINED TaskExecution_Phase = 0
	TaskExecution_QUEUED    TaskExecution_Phase = 1
	TaskExecution_RUNNING   TaskExecution_Phase = 2
	TaskExecution_SUCCEEDED TaskExecution_Phase = 3
	TaskExecution_ABORTED   TaskExecution_Phase = 4
	TaskExecution_FAILED    TaskExecution_Phase = 5
	// To indicate cases where task is initializing, like: ErrImagePull, ContainerCreating, PodInitializing
	TaskExecution_INITIALIZING TaskExecution_Phase = 6
	// To address cases, where underlying resource is not available: Backoff error, Resource quota exceeded
	TaskExecution_WAITING_FOR_RESOURCES TaskExecution_Phase = 7
)
Workflow execution phases
Copy code
const (
	WorkflowExecution_UNDEFINED  WorkflowExecution_Phase = 0
	WorkflowExecution_QUEUED     WorkflowExecution_Phase = 1
	WorkflowExecution_RUNNING    WorkflowExecution_Phase = 2
	WorkflowExecution_SUCCEEDING WorkflowExecution_Phase = 3
	WorkflowExecution_SUCCEEDED  WorkflowExecution_Phase = 4
	WorkflowExecution_FAILING    WorkflowExecution_Phase = 5
	WorkflowExecution_FAILED     WorkflowExecution_Phase = 6
	WorkflowExecution_ABORTED    WorkflowExecution_Phase = 7
	WorkflowExecution_TIMED_OUT  WorkflowExecution_Phase = 8
	WorkflowExecution_ABORTING   WorkflowExecution_Phase = 9
)
Terminal Phases for all
Copy code
var terminalExecutionPhases = map[core.WorkflowExecution_Phase]bool{
	core.WorkflowExecution_SUCCEEDED: true,
	core.WorkflowExecution_FAILED:    true,
	core.WorkflowExecution_TIMED_OUT: true,
	core.WorkflowExecution_ABORTED:   true,
}

var terminalNodeExecutionPhases = map[core.NodeExecution_Phase]bool{
	core.NodeExecution_SUCCEEDED: true,
	core.NodeExecution_FAILED:    true,
	core.NodeExecution_TIMED_OUT: true,
	core.NodeExecution_ABORTED:   true,
	core.NodeExecution_SKIPPED:   true,
	core.NodeExecution_RECOVERED: true,
}

var terminalTaskExecutionPhases = map[core.TaskExecution_Phase]bool{
	core.TaskExecution_SUCCEEDED: true,
	core.TaskExecution_FAILED:    true,
	core.TaskExecution_ABORTED:   true,
}
a
thanks, but this doesn't answer the question
k
@Alex Pozimenko a terminated workflow has a phase in: [succeeded, failed, aborted, timed out]
a
can we use any of these as an indication of infra failures vs user errors?
or maybe there're other ways?
k
not from these metrics i believe. we distinguish between system and user failures in the individual errors output by failed workflows using the kind. we could modify flyteadmin to tag the failures by kind but propeller should also output system errors
(sorry stealth updated the link for kind above)
a
so just to be clear, are you saying there's a metric output by propeller that has kind tag?
k
the propeller metric just has a total count of system errors. but if you want to open a pr to tag failures by system vs. user errors in flyteadmin that would be awesome!
a
i can try if you point me to the code
k
awesome! here is where we increment terminated executions with the phase. so for failed executions you can parse the error in the event and add the error kind
166 Views