Hi people! I have not found a lot of documentation...
# ask-the-community
v
Hi people! I have not found a lot of documentation about the Prometheus metrics made available by Flyte. Are there metrics about tasks such as individual task completion times, durations, etc? Much like we have for workflows with metrics like "flytepropellerallworkflowsuccess_duration_ms_count". These informations are on Flyte console so I wonder if they are also available by metrics or if at least there is a plan to implement that. Also, are there metrics about CPU and memory usage by workflow/task?
k
Ya it’s not all documented
s
Who can document this, @Ketan (kumare3)?
k
@Samhita Alla - @Dan Rammer (hamersaw) has some new work. But this is hard to document. We would appreciate community help
Cpu memory usage is your own thing
As all tasks are tagged, with the execution is
d
@Vinícius Sosnowski can we understand a little more about your objective? Are you trying to setup monitoring around task completion times with alerts based on promethus reported values? IMO this is difficult and the high cardinality of metric IDs associated with this use-case can cause massive memory usage in prometheus.
If alternatively, your goals is just better observability into workflow runtime executions - please take a look at this RFC. We have made great progress towards this and are hoping to release portions of the runtime and orchestration metric reporting components within a few weeks.
v
Hi! @Dan Rammer (hamersaw) Our objetive is basically to monitor these aspects of Flyte: • Workflow and task executions, successes, failures, average completion times • Computational resources (Cpu, memory, network, etc) consumed by workflow and task runs (<=> their pods in our K8s cluster), along with Flyte components The idea was to use Prometheus metrics to do all that, but currently we have come to the conclusion that it is not possible atm to use this approach to monitor Workflow and Task computational resources, neither Task executions and completion times. About the RFC, indeed it seems to suit some of these needs. If I understood it correctly we could obtain task completion times with the metrics provided by it, right?
d
So the RFC describes an approach for better understanding workflow execution through the use of orchestration and runtime metrics. It's not really meant for monitoring anything at scale. That workflow / node / task completion times are available right now through the admin db, each execution has an associated
duration
that is reported by Flyte. However, I fail to understand how this would make any sense to monitor at scale - all workflows / nodes / tasks will be different and therefore require different durations. If you have say 10 tasks, the average task runtime doesn't really mean anything unless you have a separate metric for each task, which then becomes unamangeable as the number of tasks scale into 100's or 1000's. Am I understadning this correctly?
v
Yeah, actually we are seeking to have separate metrics for each task. The idea would be that we could at least have an sorted list of tasks that are failing more than a certain threshold, or a list of task executions that have exceed the average duration by more than X times. Similar lists to tasks/workflows that are very resource-demanding, so for example we could investigate the most demanding ones and see if refactoring or changing something is a good idea. I confess I have not been using Prometheus for a long time, so I don't know if having so many metrics like this would be feasible in terms of memory and etc.
d
This sounds like a very useful effort. I think all the metrics are available right now, just not put together. For example, all of the task durations are available on the TaskExecution proto message that can be retrieved through the gRPC endpoint on flyteadmin. Resource utilization of different pods can be retrieved using the k8s metrics-server (ex.
kubectl top pods
). In Flyte, we try to be relatively un-opinionated with monitoring because the scope is so large and diverse that supporting every scenario is unmanageable. Thoughts?
v
Great! I'm taking a look at this, I see also that Flytekit offers the SynchronousFlyteClient class to directly communicate with FlyteAdmin so I'm using it to test all this stuff. Thank you very much, this was very helpful! =)
204 Views