Hi all :wave: I'm trying to monitor a stale workfl...
# ask-the-community
y
Hi all 👋 I'm trying to monitor a stale workflow - a workflow that is stuck in one of the following states: • There are not enough k8s resources for this workflow/tasks • Something is wrong with the workflow definition - bottlenecks, complex or poorly designed WF, etc. I need to catch up on the Flyte Metric. What is the best way to monitor this?
Also, I thought of using a timeout alert using the launch plan Slack alert.
r
https://docs.flyte.org/en/latest/deployment/configuration/monitoring.html I have saw some Flyte metrics can be monitoring by garfana here
y
Yeah, I'm familiar with that; there are too many; I'm looking for metric suggestions
r
Are you using Flyte binary @Yaniv Ben Zvi
y
No, I use the following chart: {“chartName”: “flyte”, “repoName”: “flyte”, “repoUrl”: “https://flyteorg.github.io/flyte”}
r
@Yaniv Ben Zvi Here, there 3 flyte deployment path, If you deploy with Flyte chart, I think it's Flyte binary or Flyte Core now. About your question to customize the resources of K8s task Pod. Have to check 2 things: • Change Default limit Resources of Task Pod ( This could be config in the helm Chart ) • Your definded python task
Copy code
@task(requests=Resources(cpu="1", mem="100Mi"),  limits=Resources(cpu="2", mem="150Mi")) # The limits Resources's maximum == helm chart values config 
def count_unique_numbers(x: typing.List[int]) -> int:
    s = set()
    for i in x:
        s.add(i)
    return len(s)
@Yaniv Ben Zvi If you don't want to re-deploy your Flyte Core, You can update task-resource-attribute directly here
y
This is not helping me to monitor the Flyte stale WF.
r
Oh so you want monitoring the wf by the metrics which contain info about resource using right ?
d
@Yaniv Ben Zvi I think for the conditions you describe, logging would me more helpful for you.
y
@Ryuu I want to know when there is some WF with execution duration above 2H, for example.
e
@Ryuu @David Espejo (he/him) what we are trying to do is to set a “timeout” warning at workflow level, similar to task timeout but only to alert on “long running wf executions”, ofc
wf_executions = sum(tasks_execution)
. for example, we have one wf with 100 tasks/nodes, we would like to schedule it multiple time during the day, what we don’t want is for one scheduled run to overlap another run, since Flyte doesn’t support this option, we want to at least get an alert if on extreme cases one run gets out of the normal duration to finish. the most convenient way would be to rely on prom metrics, we got lost there and we are not sure how to get to the right one. is there any metric (or combination of metrics) to give us the duration of a specific executions including tasks that are failed/finished or still running? if there are other solutions to the problem, they are also welcome 🙂