Hey all I have Flyte in a K8s cluster with Prometheus scrapi Flyte #flyte-support

Hey all! I have Flyte in a K8s cluster with Promet...

steep-parrot-14561

02/15/2023, 1:12 AM

Hey all! I have Flyte in a K8s cluster with Prometheus scraping Kubelet/cAdvisor to get metrics such as

container_cpu_usage_seconds_total

and

kube_pod_container_resource_limits_memory_bytes

. I am trying to monitor these metrics for all pods that are generated by Flyte executions. It is kind of working but sometimes an execution is simply not scraped by Prometheus: the metric is empty. I have a workflow that runs daily and today's execution's metrics are there, but yesterday's are not. As it's the same workflow and it has not changed since yesterday I really don't know what could be happening. Can anyone help me? This is the Prometheus config file regarding kubelet:

Copy code

kubelet:
  enabled: true
  namespace: kube-system

  serviceMonitor:

    interval: ""

    proxyUrl: ""

    https: true

    cAdvisor: true

    probes: true

    resource: false

    resourcePath: "/metrics/resource/v1alpha1"

    cAdvisorMetricRelabelings: []

    probesMetricRelabelings: []

    cAdvisorRelabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

    probesRelabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

    resourceRelabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

    metricRelabelings: []

    relabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

tall-lock-23197

02/15/2023, 5:19 AM

cc @hallowed-mouse-14616

quaint-diamond-37493

02/15/2023, 8:05 AM

how long are your tasks running and what is your scrape interval?

steep-parrot-14561

02/15/2023, 3:37 PM

scrape interval is 30s, this workflow in specific has 3 tasks, the one that was rocorded had durations 1m28s -> 25m15s -> 1m17s, the one that was not had durations 1m48s -> 25m23s -> 1m15s so they were pretty similar

quaint-diamond-37493

02/15/2023, 6:32 PM

so at least they are not too short to be scraped... but not sure how fast new pods are picked up by the service monitor...

hallowed-mouse-14616

02/16/2023, 2:39 PM

@steep-parrot-14561 IIUC this is an issue with prometheus scraping the kubelet for pod resource utilization right? tbh I don't have experience setting something like this up. Have you made any progress? Happy to help where I can, but I fear it may not be as in-depth as your looking for 😅

steep-parrot-14561

02/17/2023, 8:21 PM

Hey! I have not worked too much on that since my last reply, but I guess I will just have to mess with Prometheus configurations to see if something fixes it. We will also be changing the Prometheus deployment we use in the next weeks so maybe the problem will fix itself. I will keep the thread updated though if something comes up so it can be of help to others 😃 Thank you!

🙏 1

185 Views

Open in Slack

Previous Next