Hi Team, We have recently deployed flyte-core helm...
# flyte-deployment
n
Hi Team, We have recently deployed flyte-core helm chart on our Kubernetes cluster. We are using Azure AD for user Authentication and internal auth server for app authentication. We are able to deploy all components successfully except for
flytescheduler
component. This deployment is failing at init container step with follow error:
Copy code
{"json":{},"level":"warning","msg":"failed to get token: %!w(*url.Error=&{Post <http://flyteadmin:81/oauth2/token> 0xc0002b8280})","ts":"2024-07-02T12:12:04Z"}
Error: rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: failed to get token: Post "<http://flyteadmin:81/oauth2/token>": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"
Following is the Kubernetes configmap configuration for scheduler deployment
Copy code
apiVersion: v1
data:
  admin.yaml: |
    admin:
      clientId: 'flytepropeller'
      clientSecretLocation: /etc/secrets/client_secret
      endpoint: flyteadmin:81
      insecure: true
    event:
      capacity: 1000
      rate: 500
      type: admin
  db.yaml: |
    database:
      dbname: postgres
      host: postgresql
      passwordPath: /etc/db/pass.txt
      port: 5432
      username: postgres
  server.yaml: |
    scheduler:
      metricsScope: 'flyte:'
      profilerPort: 10254
kind: ConfigMap
metadata:
  name: flyte-scheduler-config
Can someone please help us on this issue ?
a
could you try with
admin.insecure: false
?
n
hi @average-finland-92144, I see the following error on using this option.
Copy code
panic: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: first record does not look like a TLS handshake"
a
what about the flyte admin configmap? That init container starts an authenticated session to flyteadmin
n
Hi @average-finland-92144, here is the admin configmap configuration
Copy code
apiVersion: v1
data:
  cluster_resources.yaml: |
    cluster_resources:
      customData:
      - production:
        - projectQuotaCpu:
            value: "5"
        - projectQuotaMemory:
            value: 4000Mi
        - defaultIamRole:
            value: <AWS_ROLE_SUBSTITUTED_HERE_>
      - staging:
        - projectQuotaCpu:
            value: "2"
        - projectQuotaMemory:
            value: 3000Mi
        - defaultIamRole:
            value: <AWS_ROLE_SUBSTITUTED_HERE_>
      - development:
        - projectQuotaCpu:
            value: "4"
        - projectQuotaMemory:
            value: 3000Mi
        - defaultIamRole:
            value: <AWS_ROLE_SUBSTITUTED_HERE_>
      refreshInterval: 5m
      standaloneDeployment: false
      templatePath: /etc/flyte/clusterresource/templates
  db.yaml: |
    database:
      dbname: flyteadmin
      host: '<AWS_POSTGRES_HOST_SUBSTITUTED_HERE_>'
      passwordPath: /etc/db/pass.txt
      port: 5432
      username: 'dbadmin'
  domain.yaml: |
    domains:
    - id: development
      name: development
    - id: staging
      name: staging
    - id: production
      name: production
  remoteData.yaml: |
    remoteData:
      region: us-east-1
      scheme: local
      signedUrls:
        durationMinutes: 3
  server.yaml: |
    auth:
      appAuth:
        selfAuthServer:
          staticClients:
            flyte-cli:
              grant_types:
              - refresh_token
              - authorization_code
              id: flyte-cli
              public: true
              redirect_uris:
              - <http://localhost:53593/callback>
              - <http://localhost:12345/callback>
              response_types:
              - code
              - token
              scopes:
              - all
              - offline
              - access_token
            flytectl:
              grant_types:
              - refresh_token
              - authorization_code
              id: flytectl
              public: true
              redirect_uris:
              - <http://localhost:53593/callback>
              - <http://localhost:12345/callback>
              response_types:
              - code
              - token
              scopes:
              - all
              - offline
              - access_token
            flytepropeller:
              client_secret: ''
              grant_types:
              - refresh_token
              - client_credentials
              id: flytepropeller
              public: false
              redirect_uris:
              - <http://localhost:3846/callback>
              response_types:
              - token
              scopes:
              - all
              - offline
              - access_token
        thirdPartyConfig:
          flyteClient:
            clientId: flytectl
            redirectUri: <http://localhost:53593/callback>
            scopes:
            - offline
            - all
      authorizedUris:
      - <https://localhost:30081>
      - <http://flyteadmin:80>
      - <http://flyteadmin:81>
      - <http://flyteadmin.flyte.svc.cluster.local:80>
      - <http://flyteadmin.flyte.svc.cluster.local:81>
      - <https://flyte.in.cloud.uniphoredev.com>
      - <https://flyte.in.cloud.uniphoredev.com/console>
      userAuth:
        openId:
          baseUrl: '<https://login.microsoftonline.com/><AZ_TENANT_ID_SUBSTITUTED_HERE_>/v2.0'
          clientId: '<AZ_CLIENT_ID_SUBSTITUTED_HERE_>'
          scopes:
          - profile
          - openid
    flyteadmin:
      eventVersion: 2
      metadataStoragePrefix:
      - metadata
      - admin
      metricsScope: 'flyte:'
      profilerPort: 10254
      roleNameKey: <http://iam.amazonaws.com/role|iam.amazonaws.com/role>
      testing:
        host: <http://flyteadmin>
    server:
      grpc:
        port: 8089
      httpPort: 8088
      security:
        allowCors: true
        allowedHeaders:
        - Content-Type
        - flyte-authorization
        allowedOrigins:
        - '*'
        secure: false
        useAuth: true
  storage.yaml: |
    storage:
      type: s3
      container: "uniphore-flyte-dev-in"
      connection:
        auth-type: iam
        region: ap-south-1
      enable-multicontainer: false
      limits:
        maxDownloadMBs: 10
  task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 1000m
        memory: 1000Mi
        storage: 1000Mi
      limits:
        cpu: 2
        gpu: 1
        memory: 1Gi
        storage: 2000Mi
kind: ConfigMap
metadata:
  name: flyte-admin-base-config
  namespace: flyte
Hi Team, as David Espejo is not available, Can someone please help me on this issue ?
a
@narrow-king-98655 I'm back and working to reproduce your issue
n
Thank you @average-finland-92144. Please let me know if you have any updates here.
a
I'm following the steps in this section of the docs. There's a caveat: the indentation of the
configmap.adminServer.server.security
section is wrong, that section should be under
server
and not at the same level. The rest is fine Completing the steps there leads to no errors on the Pods but the NGINX Ingress controller I'm using doesn't seem to be handling the redirect especially well. The Flyte UI and CLI prompt for authentication but once the redirect is invoked it throws a 502 error. I'm working on it
done If you're using NGINX, you'll need these two annotations on
common.ingress.annotations
Copy code
<http://nginx.ingress.kubernetes.io/proxy-buffer-size|nginx.ingress.kubernetes.io/proxy-buffer-size>: "256k"
<http://nginx.ingress.kubernetes.io/proxy-buffers|nginx.ingress.kubernetes.io/proxy-buffers>: "4"
I just filed an issue to fix that docs page
n
Thank you very much @average-finland-92144 !! Now we are not seeing this issue.
Now we are seeing a different issue on creating workflows from flyte console portal. On describing the Kubernetes CRD flyteworkflows.flyte.lyft.com, we see the following issue
Copy code
failedAttempts:4 message:failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/flyteexamples-development-a8n7ndp5xgj4qp67b6qq/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
                       status code: 403, request id: 9e8a3654-cea5-4536-a10a-0a26c3d46f78 phase:0]
I have also added AWS IAM role ARN in the annotations section in the kubernetes service accounts. Not sure what we are missing here.
a
so seems like this is propeller failing to retrieve the serialized outputs from the S3 bucket There's got to be something wrong with the IRSA setup maybe. I guess you already checked this?
n
We have done this setup and also have IAM policy as in the documentation. Absence of spark operator would lead this ?
There is also another issue, after making the changes that you mentioned here, The OIDC user authentication is not working. Is used to work before I made this change. Can you please help let me know why this is happening ?
Copy code
{
  "code": 5,
  "message": "Not Found",
  "details": []
}
a
I think if you want to use the Spark operator, there is additional config needed: https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html
The OIDC user authentication is not workin
do you have more verbose logs?
n
Is there an option to enable verbose on logs ? And, for which pods you need the logs ?
a
it should be flyteadmin
at least the default logs
n
I dont see any error in default logs
When I try to login, I see this in the web page
Copy code
{
  "code": 5,
  "message": "Not Found",
  "details": []
}
Currently we are using the helm chart
flyte-core
with helm version v1.12.0. Is this stable version ? If not, can you suggest some stable version ?
a
oh that's stable
could you share your anonymized values file?
n
Yes sure
Please find the helm values file
a
I see
security
at the same level as server
Copy code
adminServer:
    server:
      grpc:
        port: 8089
      httpPort: 8088
      security:
        allowCors: true
        allowedHeaders:
        - Content-Type
        allowedOrigins:
        - '*'
        secure: false
        useAuth: true
Based on the base values file, it should be under
server
like in the above snippet
n
Ahh! My bad. Will check this and confirm once.
Hi @average-finland-92144, we were able to resolve most of our ongoing issues except for flyte scheduler init container issue with error log
Copy code
panic: rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: failed to get token: Post "<http://flyteadmin:81/oauth2/token>": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"
Initially we haven't seen this error after fixing the indentation in values.yaml file. But later, we started to see this error in the init containers. Can you please let us know what could be the issue here ?
a
from the error message it looks like the scheduler trying to reach flyteadmin over http, can you try with
adminserver.security.secure: true
?