I installed flyte v0.19.1 on aws using opta, but I...
# flyte-deployment
j
I installed flyte v0.19.1 on aws using opta, but I can't seem to get logging to work. I created a logGroup in aws named
flyte-prod-tasks-logs
and updated the
values-eks.yaml
file accordingly. In
opta/aws/flyte.yaml
I added the following.
Copy code
task_logs:
    plugins:
        logs:
            cloudwatch-enabled: true
            cloudwatch-log-group: flyte-prod-tasks-logs
            cloudwatch-region: "{vars.region}"
My workflow runs successfully but I don't see anything in the log group. My ShellTask prints to standard out and I have a PythonTask that uses logging. Is there anything special I need to do? Or some configuration I missed?
Copy code
> opta version
v0.22.1
> terraform --version
Terraform v1.0.11
on linux_amd64
Can you paste the contents of
Copy code
kubectl get -n cm flyte-propeller-config -o yaml
cc @Haytham Abuelfutuh 🙂
j
Copy code
apiVersion: v1
data:
  admin.yaml: |
    admin:
      clientId: 'flytepropeller'
      clientSecretLocation: /etc/secrets/client_secret
      endpoint: flyteadmin:81
      insecure: true
    event:
      capacity: 1000
      rate: 500
      type: admin
  cache.yaml: |
    cache:
      max_size_mbs: 1024
      target_gc_percent: 70
  catalog.yaml: |
    catalog-cache:
      endpoint: datacatalog:89
      insecure: true
      type: datacatalog
  copilot.yaml: |
    plugins:
      k8s:
        co-pilot:
          image: <http://cr.flyte.org/flyteorg/flytecopilot:v0.0.24|cr.flyte.org/flyteorg/flytecopilot:v0.0.24>
          name: flyte-copilot-
          start-timeout: 30s
  core.yaml: |
    manager:
      pod-application: flytepropeller
      pod-template-container-name: flytepropeller
      pod-template-name: flytepropeller-template
    propeller:
      downstream-eval-duration: 30s
      enable-admin-launcher: true
      gc-interval: 12h
      kube-client-config:
        burst: 25
        qps: 100
        timeout: 30s
      leader-election:
        enabled: true
        lease-duration: 15s
        lock-config-map:
          name: propeller-leader
          namespace: flyte
        renew-deadline: 10s
        retry-period: 2s
      limit-namespace: all
      max-workflow-retries: 50
      metadata-prefix: metadata/propeller
      metrics-prefix: flyte
      prof-port: 10254
      queue:
        batch-size: -1
        batching-interval: 2s
        queue:
          base-delay: 5s
          capacity: 1000
          max-delay: 120s
          rate: 100
          type: maxof
        sub-queue:
          capacity: 1000
          rate: 100
          type: bucket
        type: batch
      rawoutput-prefix: <s3://flyte-prod-service-flyte>
      workers: 40
      workflow-reeval-duration: 30s
    webhook:
      certDir: /etc/webhook/certs
      serviceName: flyte-pod-webhook
  enabled_plugins.yaml: |
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
  k8s.yaml: |
    plugins:
      k8s:
        default-cpus: 100m
        default-env-vars: []
        default-memory: 100Mi
  logger.yaml: |
    logger:
      level: 5
      show-source: true
  resource_manager.yaml: |
    propeller:
      resourcemanager:
        type: noop
  storage.yaml: |
    storage:
      type: s3
      container: "flyte-prod-service-flyte"
      connection:
        auth-type: iam
        region: us-west-2
      limits:
        maxDownloadMBs: 10
  task_logs.yaml: |
    plugins:
      logs:
        cloudwatch-enabled: true
        cloudwatch-log-group: flyte-prod-tasks-logs
        cloudwatch-region: us-west-2
        kubernetes-enabled: false
kind: ConfigMap
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: service-flyte-helmchart
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
  creationTimestamp: "2022-01-26T19:57:43Z"
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: service-flyte-helmchart
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v0.1.10
  name: flyte-propeller-config
  namespace: flyte
  resourceVersion: "8963"
  selfLink: /api/v1/namespaces/flyte/configmaps/flyte-propeller-config
  uid: 3fdd593b-ee78-476b-84fc-8c8a8c9b3502
k
cc @Haytham Abuelfutuh
h
Hey @JP Kosymna are these array tasks? Because you can choose to run array tasks in a different environment (e.g. AWS Batch), you will need to configure their logs separately… TL;DR: You can modify the task_logs.yaml like this:
Copy code
task_logs.yaml: |
    plugins:
      logs:
        cloudwatch-enabled: true
        cloudwatch-log-group: flyte-prod-tasks-logs
        cloudwatch-region: us-west-2
        kubernetes-enabled: false
      k8s-array:
        logs:
          config:
            cloudwatch-enabled: true
            cloudwatch-log-group: flyte-prod-tasks-logs
            cloudwatch-region: us-west-2
            kubernetes-enabled: false
y
cc: @Eugene Cha Is it same issue ?
j
I've updated my config as @Haytham Abuelfutuh has suggested. But see no logs in the log group. I'm now running a super simple task for testing
Copy code
@task
def get_uuid() -> str:
    id = str(uuid.uuid4())
    <http://logging.info|logging.info>(f'got new id {id}')
    print(f'got new id {id}')
    return id
k
Do you have a cloudwatch agent installed
j
Only if the opta install did it for me. How would I check?
k
Should be a daemonset?
j
I don't see anything.
k
hmm cc @Yuvraj / @Nitin Aggarwal do you guys know what can be off?
cc @JD Palomino
k
it seems after using opta, they are not seeing any cloudwatch daemon
h
Opta doesn’t setup cloudwatch logs (fluentD… etc.) AFAIK… these steps will have to be followed manually/separately/after…
k
ohh, we need to add this to the docs then?
i.e. that you should setup your preferred logs provider
j
er…. yeah we only setup cloudwatch logs for the vpc and eks control plane
not for the individual containers as that’s a personal decision
but I can help you setting it up!
h
we’ve this one paragraph here https://docs.flyte.org/projects/cookbook/en/latest/auto/deployment/configure_logging_links.html#sphx-glr-auto-deployment-configure-logging-links-py:
Every organization potentially uses different log aggregators, making it hard to create a one-size-fits-all solution. Some examples of the log aggregators include cloud-hosted solutions like AWS CloudWatch, GCP Stackdriver, Splunk, Datadog, etc.
Flyte does not have an opinion here and provides a simplified interface to configure your log provider. Flyte-sandbox ships with the Kubernetes dashboard to visualize the logs. This may not be safe for production; hence we recommend users explore other log aggregators.
I think if we can document how to do that in our guide here: https://docs.flyte.org/en/latest/deployment/aws/manual.html that would be fantastic!
j
Ok, this makes sense
@JD Palomino I would appreciate your help to get this setup
j
cool, lmk when you’re free
j
@JD Palomino showed me how to view the pods logs using kubectl and that will work for now. I'll look into https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-EKS-logs.html or another log aggregator in the future
k
ya
h
yeah.. We have used datadog, stackdriver, cloudwatch, kibana… there are many many of those as you can tell and most organizations have decided on one or the other, hence why flyte only allows you to configure how the link should be formulated without getting too deep into setting these up…
e
@Yuvraj I think it’s the same issue
oh you can just view the logs in kubectl? I should try that
h
Copy code
kubectl get pods -n <namespace>
Should list you all the pods in, say, flyte-snacks namespace… Unless nodes are deeply nested, the pod names will have the structure of:
Copy code
<execId>-n0-0
Where
<execId>
is the flyte execution id and
n0
is the node id (unless you explicitly name it) and
0
is the retry attempt Then:
Copy code
kubectl logs -n <namespace> <pod name>
Or to watch logs as they come:
Copy code
kubectl logs -n <namespace> <pod name> --follow
j
this tool is super useful https://github.com/wercker/stern
đź‘Ť 1
j
Somewhat related I have some nested nodes via a map_task that seem to get cleaned up almost immediately upon completion. Does flyte have any options to control how long pods are left around? Something like https://argoproj.github.io/argo-workflows/fields/#ttlstrategy
k
@JP Kosymna yes, it does have an option
j
Thanks for the links. The behavior I'm seeing is only the pods from the map_task are getting removed immediately the other pods do seem to stick around for 23h
If that's just how map_tasks work that's fine I should probably setup a real logging solution going forward
k
ohh interesting
187 Views