Hi all :wave: I'm having some trouble getting the...
# flyte-support
a
Hi all šŸ‘‹ I'm having some trouble getting the Grafana User Dashboard up and running - all my stats come back as empty values. Is it possible to confirm if this dashboard is still expected to work out-of-the-box (and it's an issue with my prometheus config), or is the dashboard no longer maintained?
a
@abundant-judge-84756 what Helm chart and version are you using to deploy Flyte?
a
@average-finland-92144 We're currently running flyte-binary, v1.12.0. I've followed the instructions from that guide, as well as this thread for the flyte-binary specific instructions. Grafana has succesfully detected our projects/domains as variables, but the data all looks empty (except for one plot) - even though we've been running tests with both successful and failed workflows.
āŒ› 1
g
Compare the stats/fields you're collecting in prometheus vs what your grafana is trying to load. Might be a missing prefix or something
gratitude thank you 1
šŸ‘€ 1
a
Thanks @gentle-tomato-480 - browsing prometheus stats using the explore view in grafana has definitely helped with seeing what's available. It's a little tricky interpreting the stats and seeing which ones correspond to what might be intended to be visualised in this User Dashboard, but it at least gives me a starting point šŸ‘ I guess my basic question is, does this User Dashboard work for anyone else or is it known to be broken?
g
It used to work for me back in April. Have since taken the cluster prom/grafana/flyte been running on down, so I do not have a working example at the moment. Iirc, the grafana templates have had some updates. Not sure if they've also been updated on grafana marketplace AND if there have been any changes in prometheus metrics from flyte. https://github.com/flyteorg/flyte/pull/5255 This seems the most recent grafana-related PR which has a merge date later than the what's on the grafana website. So there may have been some changes
a
Thank you! I spied this MR, although it looks like it updates the propeller + admin dashboards but not the user dashboard. Still helpful to know about though, particularly with the extra context in the descriptions šŸ‘
g
I recommend looking at the grafana dashboard jsons btw. There you'll see the exact prometheus metric per graph. This helps for better comparison to what's in prometheus
šŸ‘ 1
But yeah, hopefully it's just a mismatch in names that is easily fixable on your side šŸ™
šŸ¤ž 1
a
I've found a good example for a name mismatch - failed workflows: Grafana marketplace user dashboard .json uses:
Copy code
sum(rate(flyte:propeller:all:workflow:failure_duration_ms_count{project=~"$project", domain=~"$domain", wf=~"$workflow"}[5m]))
Browsing prometheus metrics shows no data in
failure_duration_ms_count
, but there is data in
failure_duration_unlabeled_ms_count
. Also, applying the project/domain filter filters out this data. So perhaps something about the labelling isn't working as expected? Which is strange because these workflows have clearly run in the expected project-domain šŸ¤”
g
Might be related to https://github.com/flyteorg/flyte/issues/3758 btw, but not sure
a
Really sorry about the experience here with those dashboards. I'm taking point on updating the propeller and admin dashboards on Grafana marketplace. For the user dashboard, if you could share all the mismatches you find that'd be useful. If you want to go ahead and update them, even better but in any case, this is an area where we need to improve
a
Thanks @average-finland-92144 - much appreciated! I would love to be able to help contribute to some of these issues I've found as I've been working more heavily with flyte this year - but so far time has been a bit short šŸ˜“ For the user dashboard, the issues we've had were: • flytepropellerallworkflowaccepted - no data • flytepropellerallworkflowsuccess_duration_ms_count - needed to be
flyte:propeller:all:workflow:event_recording:success_duration_ms_count
• flytepropellerallworkflowfailure_duration_ms_count - no data. I'm able to visualise failed tasks instead of workflows using
flyte:propeller:all:task:event_recording:failure_duration_ms_count
• flytepropellerallworkflowworkflow_aborted - no data • success/failure/queueing time by quantile, and User VS System errors - no data unless I use the 'unlabeled_ms' version of the metrics, which doesn't allow us to filter by project/domain/workflow • CPU/Memory limits VS quota - no 'kube_resourcequota' metric found in our prometheus setup, but maybe this is unique to our setup. I was able to more-or-less recreate these visualisations using our own cluster prometheus metrics • Pending tasks - not clear if this works, only one data point visualised (but we've been testing across multiple workflows) • CPU/Memory Usage Percentage - infinite loading ā³
gratitude thank you 1
b
@abundant-judge-84756 @average-finland-92144 we have the same (almost) issues mentioned with User dashboard,, would love to know how much time approximately it will take to fix them , thanks!
a
I'm actively working on it and tracking progress on the issue
šŸ™ 1
@abundant-judge-84756 @bright-vr-24100 and everyone else interested: would you be able to try this update to the Grafana user dashboard and share your feedback here? Please keep in mind some metrics there come from `kube-state-metrics`(eg. CPU/memory usage and ResourceQuota), which I installed using the kube-prometheus-stack (not the only way to enable it). Thanks!
a
Thanks for the work on this @average-finland-92144! I've tested out the updates and left a few small comments on the pull request šŸ™Œ