Hey all! I have Flyte in a K8s cluster with Promet...
# ask-the-community
v
Hey all! I have Flyte in a K8s cluster with Prometheus scraping Kubelet/cAdvisor to get metrics such as
container_cpu_usage_seconds_total
and
kube_pod_container_resource_limits_memory_bytes
. I am trying to monitor these metrics for all pods that are generated by Flyte executions. It is kind of working but sometimes an execution is simply not scraped by Prometheus: the metric is empty. I have a workflow that runs daily and today's execution's metrics are there, but yesterday's are not. As it's the same workflow and it has not changed since yesterday I really don't know what could be happening. Can anyone help me? This is the Prometheus config file regarding kubelet:
Copy code
kubelet:
  enabled: true
  namespace: kube-system

  serviceMonitor:

    interval: ""

    proxyUrl: ""

    https: true

    cAdvisor: true

    probes: true

    resource: false

    resourcePath: "/metrics/resource/v1alpha1"

    cAdvisorMetricRelabelings: []

    probesMetricRelabelings: []

    cAdvisorRelabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

    probesRelabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

    resourceRelabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path

    metricRelabelings: []

    relabelings:
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path
s
cc @Dan Rammer (hamersaw)
f
how long are your tasks running and what is your scrape interval?
v
scrape interval is 30s, this workflow in specific has 3 tasks, the one that was rocorded had durations 1m28s -> 25m15s -> 1m17s, the one that was not had durations 1m48s -> 25m23s -> 1m15s so they were pretty similar
f
so at least they are not too short to be scraped... but not sure how fast new pods are picked up by the service monitor...
d
@Vinícius Sosnowski IIUC this is an issue with prometheus scraping the kubelet for pod resource utilization right? tbh I don't have experience setting something like this up. Have you made any progress? Happy to help where I can, but I fear it may not be as in-depth as your looking for 😅
v
Hey! I have not worked too much on that since my last reply, but I guess I will just have to mess with Prometheus configurations to see if something fixes it. We will also be changing the Prometheus deployment we use in the next weeks so maybe the problem will fix itself. I will keep the thread updated though if something comes up so it can be of help to others 😃 Thank you!
171 Views