Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi all :wave:
I'm trying to monitor a stale workflow - a workflow that is stuck in one of the following states:
• There are not enough k8s resources for this workflow/tasks
• Something is wrong with the workflow definition - bottlenecks, complex or poorly designed WF, etc.
I need to catch up on the Flyte Metric. What is the best way to monitor this?

Also, I thought of using a timeout alert using the launch plan Slack alert.

<https://docs.flyte.org/en/latest/deployment/configuration/monitoring.html>
I have saw some Flyte metrics can be monitoring by garfana here

Yeah, I'm familiar with that; there are too many; I'm looking for metric suggestions

Are you using Flyte binary <@U05H8EZL7L3>

No, I use the following chart:


{“chartName”: “flyte”, “repoName”: “flyte”, “repoUrl”: “<https://flyteorg.github.io/flyte>”}

<@U05H8EZL7L3> Here, there <https://docs.flyte.org/en/latest/deployment/deployment/index.html|3 flyte deployment path>, If you deploy with Flyte chart, I think it's Flyte binary or Flyte Core now.
About your question to customize the resources of K8s task Pod. Have to check 2 things:
• Change Default limit Resources of Task Pod ( This could be config in the <https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/values.yaml#L35|helm Chart> )
• Your definded python task
```@task(requests=Resources(cpu="1", mem="100Mi"),  limits=Resources(cpu="2", mem="150Mi")) # The limits Resources's maximum == helm chart values config 
def count_unique_numbers(x: typing.List[int]) -> int:
    s = set()
    for i in x:
        s.add(i)
    return len(s)```

<@U05H8EZL7L3> If you don't want to re-deploy your Flyte Core, You can update task-resource-attribute directly <https://docs.flyte.org/en/latest/flytectl/gen/flytectl_get_task-resource-attribute.html|here>

This is not helping me to monitor the Flyte stale WF.

Oh so you want monitoring the wf by the metrics which contain info about resource using right ?

<@U05H8EZL7L3> I think for the conditions you describe, logging would me more helpful for you.

<@U05U1MSDGET> I want to know when there is some WF with execution duration above 2H, for example.

<@U05U1MSDGET> <@U04H6UUE78B> what we are trying to do is to set a “timeout” warning  at workflow level, similar to task timeout but only to alert on “long running wf executions”, ofc `wf_executions = sum(tasks_execution)`.

for example, we have one wf with 100 tasks/nodes, we would like to schedule it multiple time during the day, what we don’t want is for one scheduled run to overlap another run, since Flyte doesn’t support this option, we want to at least get an alert if on extreme cases one run gets out of the normal duration to finish.

the most convenient way would be to rely on prom metrics, we got lost there and we are not sure how to get to the right one.

is there any metric (or combination of metrics) to give us the duration of a specific executions including tasks that are failed/finished or still running?

if there are other solutions to the problem, they are also welcome :slightly_smiling_face: