Hey folks, Apologies if I missed the solution to this in the documentation somewhere. I’m trying to...
h

Harry

over 2 years ago
Hey folks, Apologies if I missed the solution to this in the documentation somewhere. I’m trying to deploy Flyte onto AWS EKS and enable the AWS Batch plug-in. So far I’m using helm with a
values.yaml
listed below and I can’t seem to figure out how to get the right configuration into the flyte admin config. # Helm command
helm install flyteorg/flyte-binary \
    --generate-name \
    --kube-context=<context> \
    --namespace flyte \
    --values flyte-binary/flyte-binary-eks-values.yaml
# flyte-binary-eks-values.yaml
configuration:
  database:
    password:<RD Password>
    host: <DB Host URI>
    dbname: app

  storage:
    metadataContainer: <bucket>
    userDataContainer: <bucket>
    provider: s3
    providerConfig:
      s3:
        region: "us-west-2"
        authType: "iam"

  logging:
    level: 1
    plugins:
      cloudwatch:
        enabled: true
        templateUri: |-
          <https://console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/eks/opta-development/cluster;stream=var.log.containers.{{> .podName }}_{{ .namespace }}_{{ .containerName }}-{{ .containerId }}.log

  inline:
    plugins:
      aws:
        batch:
          roleAnnotationKey: <Redacted>
        region: us-west-2
    tasks:
      task-plugins:
        enabled-plugins:
          - container
          - sidecar
          - aws_array
        default-for-task-types:
          - container_array: aws_array
          - aws-batch: aws_array    
          - container: container

serviceAccount:
  create: true
  annotations:
    <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: <Redacted>

# Where should this go?
configMaps:
  adminServer:
    flyteadmin:
      roleNameKey: <Redacted>
      queues:
        executionQueues:
          - dynamic: <JobQueueName>
            attributes:
              - default
        workflowConfigs:
          - tags:
              - default
And when I try and run a workflow ith Batch tasks I get this error:
Workflow[flytesnacks:development:<http://workflows.example.wf|workflows.example.wf>] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [aws_array]: [BadTaskSpecification] config[dynamic_queue] is missing
Thanks for reading this far 🙂 🙏
Hello, still trying to deploy workflows with CI/CD. I am surprised by how many rough edges there see...
s

Sebastian

about 3 years ago
Hello, still trying to deploy workflows with CI/CD. I am surprised by how many rough edges there seem to be. Workflow:
jobs:
  register-flyte-workflows:
    name: Register Flyte workflows
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2

        # flytekit needs newer version than 3.8 which ships with ubuntu-latest
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - run: pip install flytekit==1.1.*

      - name: Setup flytectl
        uses: unionai-oss/flytectl-setup-action@v0.0.1

      - name: Package workflows
        shell: bash
        run: |
          pyflyte \
          --pkgs flyte.workflows package \
          --image ${{ env.DOCKER_IMAGE }} \
          --output ${{ env.FLYTE_PACKAGE }}

      - name: Register workflows
        uses: unionai-oss/flyte-register-action@v0.0.2
        with:
          project: ${{ env.FLYTE_PROJECT }}
          version: ${{ env.VERSION }}
          proto: ${{ env.FLYTE_PACKAGE }}
          domain: ${{ env.FLYTE_DOMAIN }}
          config: ${{ env.FLYTE_CONFIG }}
      # OR
      # - name: Register workflows 
      #   shell: bash
      #   run: |
      #     flytectl register files \
      #     --archive ${{ env.FLYTE_ARCHIVE }} \
      #     --project ${{ env.FLYTE_PROJECT }} \
      #     --domain ${{ env.FLYTE_DOMAIN }} \
      #     --config ${{ env.FLYTE_CONFIG }} \
      #     --version ${{ env.VERSION }}
`Package workflows`reports success, but
Register workflows
using the action fails with
Error: input package have some invalid files. try to run pyflyte package again [flyte-package.tgz]
Running
Register workflows
using
flytectl
is even worse. It fails with a bunch of errors like
Failed to unmarshal file /tmp/register789499772/00_flyte.workflows.workflow_name.pb
but it fails silently and still reports succeeding to register resources. A workflows IS indeed registered on the Flyte server, but it is broken and cannot be run. Packaging and registering work if I run locally. Please advice on how to proceed debugging this.
Any documentation we can take a look at to diagnose the following error the propeller is logging on ...
k

Katrina P

about 3 years ago
Any documentation we can take a look at to diagnose the following error the propeller is logging on restart:
"msg":"failed to load plugin - spark: no matches for kind \"SparkApplication\" in version \"<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>\"
My propeller config map looks like this:
# Source: flyte-core/templates/propeller/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: flyte
  labels: 
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
data:
  admin.yaml: | 
    admin:
      clientId: 'flytepropeller'
      clientSecretLocation: /etc/secrets/client_secret
      endpoint: flyteadmin:81
      insecure: true
    event:
      capacity: 1000
      rate: 500
      type: admin
  catalog.yaml: | 
    catalog-cache:
      endpoint: datacatalog:89
      insecure: true
      type: datacatalog
  copilot.yaml: | 
    plugins:
      k8s:
        co-pilot:
          image: <http://cr.flyte.org/flyteorg/flytecopilot:v0.0.24|cr.flyte.org/flyteorg/flytecopilot:v0.0.24>
          name: flyte-copilot-
          start-timeout: 30s
  core.yaml: | 
    manager:
      pod-application: flytepropeller
      pod-template-container-name: flytepropeller
      pod-template-name: flytepropeller-template
    propeller:
      downstream-eval-duration: 30s
      enable-admin-launcher: true
      gc-interval: 12h
      kube-client-config:
        burst: 25
        qps: 100
        timeout: 30s
      leader-election:
        enabled: true
        lease-duration: 15s
        lock-config-map:
          name: propeller-leader
          namespace: flyte
        renew-deadline: 10s
        retry-period: 2s
      limit-namespace: all
      max-workflow-retries: 50
      metadata-prefix: metadata/propeller
      metrics-prefix: flyte
      prof-port: 10254
      queue:
        batch-size: -1
        batching-interval: 2s
        queue:
          base-delay: 5s
          capacity: 1000
          max-delay: 120s
          rate: 100
          type: maxof
        sub-queue:
          capacity: 1000
          rate: 100
          type: bucket
        type: batch
      rawoutput-prefix: s3://${ parameters.s3_bucket_name }/
      workers: 40
      workflow-reeval-duration: 30s
    webhook:
      certDir: /etc/webhook/certs
      serviceName: flyte-pod-webhook
  enabled_plugins.yaml: | 
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
          spark: spark
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - spark
  k8s.yaml: | 
    plugins:
      k8s:
        default-cpus: 100m
        default-env-vars: []
        default-memory: 100Mi
  resource_manager.yaml: | 
    propeller:
      resourcemanager:
        type: noop
  storage.yaml: | 
    storage:
      type: s3
      container: "${ parameters.s3_bucket_name }"
      connection:
        auth-type: iam
        region: ${ parameters.aws_region }
      limits:
        maxDownloadMBs: 10
  cache.yaml: |
    cache:
      max_size_mbs: 1024
      target_gc_percent: 70
  task_logs.yaml: | 
    plugins:
      logs:
        cloudwatch-enabled: false
        kubernetes-enabled: false
  spark.yaml: |
    plugins:
      spark:
        spark-config-default:
          - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
          - spark.kubernetes.allocation.batch.size: "50"
          - spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
          - spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3a.multipart.threshold: "536870912"
          - spark.blacklist.enabled: "true"
          - spark.blacklist.timeout: "5m"
          - spark.task.maxfailures: "8"
And my cluster resource template
# Source: flyte-core/templates/clusterresourcesync/cluster_resource_configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: clusterresource-template
  namespace: flyte
  labels: 
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v0.1.10
data:
  aa_namespace.yaml: | 
    apiVersion: v1
    kind: Namespace
    metadata:
      name: {{ namespace }}
    spec:
      finalizers:
      - kubernetes
    
  aab_default_service_account.yaml: | 
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: default
      namespace: {{ namespace }}
    
  ab_project_resource_quota.yaml: | 
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: project-quota
      namespace: {{ namespace }}
    spec:
      hard:
        limits.cpu: {{ projectQuotaCpu }}
        limits.memory: {{ projectQuotaMemory }}
    
  ac_spark_role.yaml: | 
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: Role
    metadata:
      name: spark-role
      namespace: {{ namespace }}
    rules:
    - apiGroups: ["*"]
      resources:
      - pods
      verbs:
      - '*'
    - apiGroups: ["*"]
      resources:
      - services
      verbs:
      - '*'
    - apiGroups: ["*"]
      resources:
      - configmaps
      verbs:
      - '*'
    
  ad_spark_service_account.yaml: | 
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark
      namespace: {{ namespace }}
    
  ae_spark_role_binding.yaml: | 
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: RoleBinding
    metadata:
      name: spark-role-binding
      namespace: {{ namespace }}
    roleRef:
      apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
      kind: Role
      name: spark-role
    subjects:
    - kind: ServiceAccount
      name: spark
      namespace: {{ namespace }}