• Haytham Abuelfutuh

    Haytham Abuelfutuh

    3 months ago
    @Prafulla Mahindrakar, mind helping @Alex Pozimenko with this? He's trying to get metrics on terminated tasks/executions annotated with the phase they ended as. We merged this PR https://github.com/flyteorg/flyteadmin/pull/386/files and he's trying with admin version 0.6.131 (should include this PR) but can't see any phases under:
    flyte-admin-task.execution.manager-task.executions.terminated.counter
  • p

    Prafulla Mahindrakar

    3 months ago
    Sure πŸ‘
  • There is bug in this path and the new metric is not published . These are the ones published with that image
    flyte:admin:admin:execution_manager:acceptance_delay_count 0
    flyte:admin:admin:execution_manager:acceptance_delay_sum 0
    flyte:admin:admin:execution_manager:acceptance_delay{quantile="0.5"} NaN
    flyte:admin:admin:execution_manager:acceptance_delay{quantile="0.9"} NaN
    flyte:admin:admin:execution_manager:acceptance_delay{quantile="0.99"} NaN
    flyte:admin:admin:execution_manager:active_executions 0
    flyte:admin:admin:execution_manager:closure_size_bytes_count 0
    flyte:admin:admin:execution_manager:closure_size_bytes_sum 0
    flyte:admin:admin:execution_manager:closure_size_bytes{quantile="0.5"} NaN
    flyte:admin:admin:execution_manager:closure_size_bytes{quantile="0.9"} NaN
    flyte:admin:admin:execution_manager:closure_size_bytes{quantile="0.99"} NaN
    flyte:admin:admin:execution_manager:execution_events_created 0
    flyte:admin:admin:execution_manager:execution_termination_failure 0
    flyte:admin:admin:execution_manager:executions_created 0
    flyte:admin:admin:execution_manager:propeller_failures 0
    flyte:admin:admin:execution_manager:publish_error 0
    flyte:admin:admin:execution_manager:publish_event_error 0
    flyte:admin:admin:execution_manager:spec_size_bytes_count 0
    flyte:admin:admin:execution_manager:spec_size_bytes_sum 0
    flyte:admin:admin:execution_manager:spec_size_bytes{quantile="0.5"} NaN
    flyte:admin:admin:execution_manager:spec_size_bytes{quantile="0.9"} NaN
    flyte:admin:admin:execution_manager:spec_size_bytes{quantile="0.99"} NaN
    flyte:admin:admin:execution_manager:transformer_error 0
    flyte:admin:admin:execution_manager:unexpected_data_error 0
    Will send out the fix
  • Take that back . It works as expected. The counter show up only after executions come to terminal state of succeeded, failed, timeoout/ aborted . eg:
    # TYPE flyte:admin:admin:execution_manager:executions_terminated counter
    flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
  • @Alex Pozimenko let us know what behavior you are seeing
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    3 months ago
    I don't see the phase in the metrics you posted... Can you also check the task and node terminations' metrics?
  • p

    Prafulla Mahindrakar

    3 months ago
    Task execution metrics
    # HELP flyte:admin:admin:task_execution_manager:task_executions_terminated overall count of terminated workflow executions
    # TYPE flyte:admin:admin:task_execution_manager:task_executions_terminated counter
    flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.mapper_a_mappable_task_0-0",tasktype="",wf=""} 1
    flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.coalesce-0",tasktype="",wf=""} 1
    flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.mapper_a_mappable_task_0-0",tasktype="",wf=""} 1
    flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.coalesce-0",tasktype="",wf=""} 1
    Node execution metrics
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="end-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="fe533aa1e880546b59db",node="start-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="end-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n0",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="n1",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="ffd9b80a79ca34186889",node="start-node",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""}
    I will check why the phase is not being emitted
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    3 months ago
  • p

    Prafulla Mahindrakar

    3 months ago
    Ok that seems to have worked @Haytham Abuelfutuh Node executions
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="end-node",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n0",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n1",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    flyte:admin:admin:node_execution_manager:node_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="start-node",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
    Task executions
    flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n0",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.mapper_a_mappable_task_0-0",tasktype="",wf=""} 1
    flyte:admin:admin:task_execution_manager:task_executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="n1",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="core.control_flow.map_task.coalesce-0",tasktype="",wf=""} 1
    Executions
    flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    @Prafulla Mahindrakar, thanks for the fix. Shall I try flyteadmin v0.6.145 ?
  • I've upgraded flyteadmin but now don't see
    flyte-admin-task.execution.manager-task.executions.terminated.counter
    at all
  • also v0.6.145 appears to have auth issues. My workflows kept failing this error after the upgrade: "service account name not authorized". I downgraded to v0.6.131 and the same workflows work fine. Trying 0.6.145 again just to be sure
  • confirming - the auth issue is back after upgrading to 0.6.145
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    3 months ago
    Sorry you seen to have hit a regression!! Where do you see that error? Does the execution start and the task fails with that? When it fails, can you check the pod created to see what service account was set on it? How do you launch the workflow? Is it through flytectl and you pass a k8s service account? @Prafulla Mahindrakar @katrina can you please help investigate this? We can't ship with that regression in a release @Eduardo Apolinario (eapolinario) @Yuvraj can we see how to add this scenario to the endtoend tests?
  • You don't have to read the entire thread, just the last two messages...
  • p

    Prafulla Mahindrakar

    3 months ago
    Hi Alex , knowing how are you launching this workflows would be helpful for further investigation
  • Also regarding the missing metric can you verify you are checking for this named metric. With my testing i am seeing this metric being emittted,
    flyte:admin:admin:execution_manager:executions_terminated{app_name="",domain="development",exec_id="f2282f588a3b940cd99d",node="",phase="SUCCEEDED",project="flytesnacks",runtime_type="",runtime_version="",task="",tasktype="",wf=""} 1
  • Ketan (kumare3)

    Ketan (kumare3)

    3 months ago
    @Alex Pozimenko are you setting the auth role?
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    Where do you see that error? Does the execution start and the task fails with that? When it fails, can you check the pod created to see what service account was set on it? How do you launch the workflow? Is it through flytectl and you pass a k8s service account?
    1. i launch from the console 2. the task starts, the error is coming from the container when it's trying to access other aws resources 3. i'll check the service account (need to upgrade my environment again πŸ™‚ ) (@Haytham Abuelfutuh @Prafulla Mahindrakar)
  • are you setting the auth role? I'm not sure what you mean by that. I only changed version of the admin, no other changes to our deployment. We use OIDC auth and K8s service account if that helps
  • k

    katrina

    3 months ago
    hi @Alex Pozimenko thank you for reporting this! we've update the v0.19.4 release with a fix for reading the deprecated auth field in admin. do you mind updating your deployment to unblock?
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    cool, what version of the admin shall I use?
  • k

    katrina

    3 months ago
    v0.6.147
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    is there a matrix that maps release versions to containers?
  • v0.6.147 is looking good! My workflow completed successfully. Thanks @katrina for fixing this
  • k

    katrina

    3 months ago
    this was all @Prafulla Mahindrakar πŸŽ‰
  • but glad to hear it! thanks for updating us
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    thank you @Prafulla Mahindrakar!
  • k

    katrina

    3 months ago
    is there a matrix that maps release versions to containers? we should publish an image per release:
    https://github.com/flyteorg/flyteadmin/pkgs/container/flyteadmin/19159524?tag=v0.6.147
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    right, but how do I know that v0.19.4 is part of v0.6.147 release?
  • k

    katrina

    3 months ago
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    got it!
  • p

    Prafulla Mahindrakar

    3 months ago
    Glad it worked out for you Alex. And hopefully the metrics issue is resolved as well ?
  • k

    katrina

    3 months ago
    also we use the non-release "flyteadmin" package in the helm chart: https://github.com/flyteorg/flyte/blob/v0.19.4/charts/flyte-core/values.yaml#L19
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    now about the missing metric. Looks like the name changed from
    flyte-admin-task.execution.manager-task.executions.terminated
    to
    flyte-admin-admin-execution.manager-executions.terminated
  • i see the phase tag on the new metric now
  • I'm thinking of 3 scenarios here: β€’ succeeded - all good β€’ failed - likely (but not necessary) user error β€’ everything else - likely infra error does that make sense?
  • p

    Prafulla Mahindrakar

    3 months ago
    I think failed can mean user errors aswell as infra errors like permission issues
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    can a workflow be terminated in phase other than failed/succeeded ? or is it always one of the two?
  • p

    Prafulla Mahindrakar

    3 months ago
    Node execution phases
    const (
    	NodeExecution_UNDEFINED       NodeExecution_Phase = 0
    	NodeExecution_QUEUED          NodeExecution_Phase = 1
    	NodeExecution_RUNNING         NodeExecution_Phase = 2
    	NodeExecution_SUCCEEDED       NodeExecution_Phase = 3
    	NodeExecution_FAILING         NodeExecution_Phase = 4
    	NodeExecution_FAILED          NodeExecution_Phase = 5
    	NodeExecution_ABORTED         NodeExecution_Phase = 6
    	NodeExecution_SKIPPED         NodeExecution_Phase = 7
    	NodeExecution_TIMED_OUT       NodeExecution_Phase = 8
    	NodeExecution_DYNAMIC_RUNNING NodeExecution_Phase = 9
    	NodeExecution_RECOVERED       NodeExecution_Phase = 10
    )
    Task execution phases
    const (
    	TaskExecution_UNDEFINED TaskExecution_Phase = 0
    	TaskExecution_QUEUED    TaskExecution_Phase = 1
    	TaskExecution_RUNNING   TaskExecution_Phase = 2
    	TaskExecution_SUCCEEDED TaskExecution_Phase = 3
    	TaskExecution_ABORTED   TaskExecution_Phase = 4
    	TaskExecution_FAILED    TaskExecution_Phase = 5
    	// To indicate cases where task is initializing, like: ErrImagePull, ContainerCreating, PodInitializing
    	TaskExecution_INITIALIZING TaskExecution_Phase = 6
    	// To address cases, where underlying resource is not available: Backoff error, Resource quota exceeded
    	TaskExecution_WAITING_FOR_RESOURCES TaskExecution_Phase = 7
    )
    Workflow execution phases
    const (
    	WorkflowExecution_UNDEFINED  WorkflowExecution_Phase = 0
    	WorkflowExecution_QUEUED     WorkflowExecution_Phase = 1
    	WorkflowExecution_RUNNING    WorkflowExecution_Phase = 2
    	WorkflowExecution_SUCCEEDING WorkflowExecution_Phase = 3
    	WorkflowExecution_SUCCEEDED  WorkflowExecution_Phase = 4
    	WorkflowExecution_FAILING    WorkflowExecution_Phase = 5
    	WorkflowExecution_FAILED     WorkflowExecution_Phase = 6
    	WorkflowExecution_ABORTED    WorkflowExecution_Phase = 7
    	WorkflowExecution_TIMED_OUT  WorkflowExecution_Phase = 8
    	WorkflowExecution_ABORTING   WorkflowExecution_Phase = 9
    )
    Terminal Phases for all
    var terminalExecutionPhases = map[core.WorkflowExecution_Phase]bool{
    	core.WorkflowExecution_SUCCEEDED: true,
    	core.WorkflowExecution_FAILED:    true,
    	core.WorkflowExecution_TIMED_OUT: true,
    	core.WorkflowExecution_ABORTED:   true,
    }
    
    var terminalNodeExecutionPhases = map[core.NodeExecution_Phase]bool{
    	core.NodeExecution_SUCCEEDED: true,
    	core.NodeExecution_FAILED:    true,
    	core.NodeExecution_TIMED_OUT: true,
    	core.NodeExecution_ABORTED:   true,
    	core.NodeExecution_SKIPPED:   true,
    	core.NodeExecution_RECOVERED: true,
    }
    
    var terminalTaskExecutionPhases = map[core.TaskExecution_Phase]bool{
    	core.TaskExecution_SUCCEEDED: true,
    	core.TaskExecution_FAILED:    true,
    	core.TaskExecution_ABORTED:   true,
    }
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    thanks, but this doesn't answer the question
  • k

    katrina

    3 months ago
    @Alex Pozimenko a terminated workflow has a phase in: [succeeded, failed, aborted, timed out]
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    can we use any of these as an indication of infra failures vs user errors?
  • or maybe there're other ways?
  • k

    katrina

    3 months ago
    not from these metrics i believe. we distinguish between system and user failures in the individual errors output by failed workflows using the kind. we could modify flyteadmin to tag the failures by kind but propeller should also output system errors
  • (sorry stealth updated the link for kind above)
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    so just to be clear, are you saying there's a metric output by propeller that has kind tag?
  • k

    katrina

    3 months ago
    the propeller metric just has a total count of system errors. but if you want to open a pr to tag failures by system vs. user errors in flyteadmin that would be awesome!
  • Alex Pozimenko

    Alex Pozimenko

    3 months ago
    i can try if you point me to the code
  • k

    katrina

    3 months ago
    awesome! here is where we increment terminated executions with the phase. so for failed executions you can parse the error in the event and add the error kind