Hey folks we recently upgraded copilot and tasks a...
# announcements
m
Hey folks we recently upgraded copilot and tasks are failing with
Copy code
[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[flyte-copilotdownloader] terminated with exit code (1). Reason [Error]. Message: 
  --storage.cache.max_size_mbs int             Maximum size of the cache where the Blob store data is cached in-memory. If not specified or set to 0,  cache is not used
      --storage.cache.target_gc_percent int        Sets the garbage collection target percentage.
I took a look at the config in the docs and looks like we need to add these. Is there anything else I’m missing and will these default values work for our deployment?
Copy code
co-pilot:
  cpu: 500m
  default-input-path: /var/flyte/inputs
  default-output-path: /var/flyte/outputs
  image: <http://cr.flyte.org/flyteorg/flytecopilot:v0.0.15|cr.flyte.org/flyteorg/flytecopilot:v0.0.15>
  input-vol-name: flyte-inputs
  memory: 128Mi
  name: flyte-copilot-
  output-vol-name: flyte-outputs
  start-timeout: 1m40s
  storage: ""
k
cc @Yuvraj?
y
@User can you paste the all logs of container and the spec ?
m
Copy code
Init Containers:
  flyte-copilotdownloader:
    Container ID:  <docker://3adf25af273b66204973e5e05bc2d95f546de53809e9b4e2ade381a91f3a300>5
    Image:         library.pdx.l5.woven-planet.tech/application/flyteplugins/flytecopilot:v0.0.1
    Image ID:      <docker-pullable://library.pdx.l5.woven-planet.tech/application/flyteplugins/flytecopilot@sha256:5f74032b747e38079cee222a345e8a8d553702fa3702c7c66370024ef2e4288a>
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/flyte-copilot
      --storage.limits.maxDownloadMBs=0
      --storage.type=s3
      --storage.enable-multicontainer=false
      --storage.container=lyft-av-prod-pdx-flyte
      --storage.connection.secret-key=
      --storage.connection.access-key=
      --storage.connection.auth-type=iam
      --storage.connection.region=us-west-2
      --storage.connection.endpoint=
    Args:
I don’t believe there are any logs since it’s failing at entrypoint
Is there a way to use the OIDC auth s3 sdk instead of explicitly passing in secret key, access key, etc
y
are these secret empty or you just mask them
m
Empty
which is causing the failure. I’m just wondering why we need to pass them in explicitly now
y
that’s the issue, I just tested the container task on sandbox and aws. It is working fine for me
Please update your config for copiliot image, In your spec it’s v0.0.15 but latest is v0.0.24
m
Ok, i’ll update. And what I linked to above is actually from the flyte documentation so might be worth updating too
y
@Miggy Good point, They are hard coded values. Ideally we suggest people to use version that is tagged with flyte release https://github.com/flyteorg/flyte/pkgs/container/flytecopilot-release
k
I think we should update the default in the release
y
yes we update version in release but these docs are old
m
Hey so we’re still having a couple of issues with copilot sidecar failing to get credentials now. We are using OIDC and the credentials are in the container and are correct. (Alex grabbed the credentials from the container and used the aws cli to verify). However in the container the logs show
Copy code
[avcloud-prod-pdx] ➜  flytepropeller git:(master) kubectl -n prod logs -f l34ofjb147-fh3qf1ma-0 -c flyte-copilotdownloader
time="2022-04-06T01:33:39Z" level=info msg="[0] Couldn't find a config file []. Relying on env vars and pflags."
{"json":{},"level":"error","msg":"Failed to Get credentials.","ts":"2022-04-06T01:33:44Z"}
{"json":{},"level":"error","msg":"Failed to Get credentials.","ts":"2022-04-06T01:33:49Z"}
{"json":{},"level":"error","msg":"Failed to Get credentials.","ts":"2022-04-06T01:33:54Z"}
{"json":{},"level":"error","msg":"Failed to Get credentials.","ts":"2022-04-06T01:33:59Z"}
{"json":{},"level":"error","msg":"Failed to Get credentials.","ts":"2022-04-06T01:34:04Z"}
Any ideas or pointers? cc: @Alex Bain
y
@Miggy what is your pod spec ?
m
@Yuvraj this is our config
Copy code
k8s:
        co-pilot:
          name: "flyte-copilot"
          image: ${ecr_repository}:flytecopilot-${flytecopilot_version}
          start-timeout: "30s"
        scheduler-name: flyte-scheduler
        inject-finalizer: true
The yaml for the pod itself is
Copy code
- args:
    - sidecar
    - --start-timeout
    - 30s
    - --to-raw-output
    - <s3://lyft-av-prod-pdx-flyte/49/sh9aeoqd2s-fiaqf4qy-0>
    - --to-output-prefix
    - <s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avperceptionworkflows-prod-sh9aeoqd2s/n1/data/0/dynamic-run-metrics-workers-n5/0>
    - --from-local-dir
    - /var/flyte/outputs
    - --interface
    - CgASAA==
    command:
    - /bin/flyte-copilot
    - --storage.limits.maxDownloadMBs=0
    - --storage.type=s3
    - --storage.enable-multicontainer=false
    - --storage.container=lyft-av-prod-pdx-flyte
    - --storage.connection.secret-key=
    - --storage.connection.access-key=
    - --storage.connection.auth-type=iam
    - --storage.connection.region=us-west-2
    - --storage.connection.endpoint=
    env:
    - name: L5_DATACENTER
      value: pdx
    - name: L5_BASE_DOMAIN
      value: l5.woven-planet.tech
    - name: L5_ENVIRONMENT
      value: pdx
    - name: RUNTIME_POD_NAME
      valueFrom:
        fieldRef:
s
Hello Miggy were you able to resolve this issue. I am facing the exact issue
253 Views