Hey all, I'm having a little trouble running a workflow in the Flyte sandbox on my local machine - i...
t

Tom Stokes

almost 3 years ago
Hey all, I'm having a little trouble running a workflow in the Flyte sandbox on my local machine - in particular, the workflow that I'm attempting to run is failing to pull the image that I've built within the sandbox. Here you can see the containers that I have running on my host:
$ docker ps
>>>
CONTAINER ID   IMAGE                                                                               COMMAND                  CREATED             STATUS             PORTS                                                                                                                 NAMES

dbf8f5dcb150   <http://cr.flyte.org/flyteorg/flyte-sandbox:dind-bfa1dd4e6057b6fc16272579d61df7b1832b96a7|cr.flyte.org/flyteorg/flyte-sandbox:dind-bfa1dd4e6057b6fc16272579d61df7b1832b96a7>   "tini flyte-entrypoi…"   About an hour ago   Up About an hour   0.0.0.0:30081-30082->30081-30082/tcp, 0.0.0.0:30084->30084/tcp, 2375-2376/tcp, 0.0.0.0:30086-30088->30086-30088/tcp   flyte-sandbox
From which we can then find the images that exist inside the
dbf8f5dcb150
container:
$ docker exec -it dbf8f5dcb150 docker image ls
>>>
REPOSITORY                                     TAG                       IMAGE ID       CREATED          SIZE
papermill-exploration                          latest                    3c40c6deb126   23 minutes ago   948MB
...
I can see my project in there under the tag
papermill-exploration:latest
. I then serialize and submit my workflow spec as follows:
pyflyte --pkgs workflows package -f --image "papermill-exploration:latest"
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version v2
All of which works:
$ flytectl get workflows --project flytesnacks --domain development  
>>>        
 --------- ------------------------------------ ----------------------------- 
| VERSION | NAME                               | CREATED AT                  |
 --------- ------------------------------------ ----------------------------- 
| v2      | workflows.workflow.nb_to_python_wf | 2022-12-12T12:41:53.987960Z |
 --------- ------------------------------------ ----------------------------- 
| v1      | workflows.workflow.nb_to_python_wf | 2022-12-12T12:33:08.295661Z |
 --------- ------------------------------------ ----------------------------- 
2 rows
I then attempt to invoke the workflow, but the resulting pod cannot pull the image:
$ flytectl get execution --project flytesnacks --domain development azlfqvzfsbz4lr8pbmlt
>>>
 ---------------------- ------------------------------------ ------------- -------- ---------------- -------------------------------- --------------- -------------------- --------------------------------------------------------- 
| NAME                 | LAUNCH PLAN NAME                   | TYPE        | PHASE  | SCHEDULED TIME | STARTED                        | ELAPSED TIME  | ABORT DATA (TRUNC) | ERROR DATA (TRUNC)                                      |
 ---------------------- ------------------------------------ ------------- -------- ---------------- -------------------------------- --------------- -------------------- --------------------------------------------------------- 
| azlfqvzfsbz4lr8pbmlt | workflows.workflow.nb_to_python_wf | LAUNCH_PLAN | FAILED |                | 2022-12-12T13:07:23.548693519Z | 23.161600293s |                    | [1/1] currentAttempt done. Last Error: USER::containers |
|                      |                                    |             |        |                |                                |               |                    | with unready status: [azlfqvzfsbz4lr8pbmlt-n            |
 ---------------------- ------------------------------------ ------------- -------- ---------------- -------------------------------- --------------- -------------------- --------------------------------------------------------- 
1 rows

$ docker exec -it dbf8f5dcb150 kubectl -n flytesnacks-development describe pod azlfqvzfsbz4lr8pbmlt-n0-0
>>>
...
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  27m                    default-scheduler  Successfully assigned flytesnacks-development/azlfqvzfsbz4lr8pbmlt-n0-0 to dbf8f5dcb150
  Normal   Pulling    25m (x4 over 27m)      kubelet            Pulling image "papermill-exploration:latest"
  Warning  Failed     25m (x4 over 27m)      kubelet            Failed to pull image "papermill-exploration:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for papermill-exploration, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning  Failed     25m (x4 over 27m)      kubelet            Error: ErrImagePull
  Warning  Failed     25m (x6 over 27m)      kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m22s (x106 over 27m)  kubelet            Back-off pulling image "papermill-exploration:latest"
Have I missed something here? Are the pods not authenticated against the docker repo? Or am I not specifying my images correctly?
Hi all :wave: I have a problem of sharing data between tasks. I found a similar issue here in discus...
a

Anthony

about 3 years ago
Hi all 👋 I have a problem of sharing data between tasks. I found a similar issue here in discussions (link)
Workflow[flyte-anti-fraud-ml:development:app.workflow.main_flow] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: error file @[<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-f31c365f02c114639b00/n0/data/0/error.pb>] is too large [28775519] bytes, max allowed [10485760] bytes
I added
max-output-size-bytes
params to the
flyte-propeller-config
and wait to apply all changes before re-submitting a new task.
kubectl edit configmap -n flyte flyte-propeller-config
My propeller section of
flyte-propeller-config
looks like:
core.yaml: |
    manager:
      pod-application: flytepropeller
      pod-template-container-name: flytepropeller
      pod-template-name: flytepropeller-template
    propeller:
      max-output-size-bytes: 52428800
      downstream-eval-duration: 30s
      enable-admin-launcher: true
      leader-election:
        enabled: true
        lease-duration: 15s
        lock-config-map:
          name: propeller-leader
          namespace: flyte
        renew-deadline: 10s
        retry-period: 2s
      limit-namespace: all
      max-workflow-retries: 3
      metadata-prefix: metadata/propeller
      metrics-prefix: flyte
      prof-port: 10254
Task configuration has been set up via
kubectl -n flyte edit cm flyte-admin-base-config
storage.yaml: |
    storage:
      type: minio
      container: "my-s3-bucket"
      stow:
        kind: s3
        config:
          access_key_id: minio
          auth_type: accesskey
          secret_key: miniostorage
          disable_ssl: true
          endpoint: <http://minio.flyte.svc.cluster.local:9000>
          region: us-east-1
      signedUrl:
        stowConfigOverride:
          endpoint: <http://localhost:30084>
      enable-multicontainer: false
      limits:
        maxDownloadMBs: 50
  task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 1
        memory: 3000Mi
        storage: 100Mi
      limits:
        cpu: 4
        gpu: 1
        memory: 3Gi
        storage: 500Mi
Also changing
maxDownloadMBs
didn’t change the situation Changing cache
max_size_mbs
in
flyte-propeller-config
from 0 to some custom value also not working:
cache.yaml: |
    cache:
      max_size_mbs: 100
      target_gc_percent: 70
I ty to change different time with different params but the error was arising during each new executions. I saw that none of
max-output-size-bytes
or
max-workflow-retries
(changed from 30 --> 3) are passed to the workflow execution:
RuntimeExecutionError: max number of system retry attempts [31/30] exhausted...

error file @[<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-f31c365f02c114639b00/n0/data/0/error.pb>] is too large [28775519] bytes, max allowed [10485760] bytes...
Hereafter are my cli steps to create a new execution:
- kubectl -n flyte edit cm flyte-admin-base-config
- kubectl edit configmap -n flyte flyte-propeller-config
- flytectl get task-resource-attribute -p flyteexamples -d development
- flytectl update project -p flyte-anti-fraud-ml -d development --storage.cache.max_size_mbs 100
- flytectl get launchplan --project flyte-anti-fraud-ml --domain development app.workflow.main_flow --latest --execFile exec_spec.yaml
- flytectl create execution --project flyte-anti-fraud-ml --domain development --execFile exec_spec.yaml
What additional steps I have to do to force flytectl to use my propeller changes and solve the problem of a max 10Mb size allowed for serialized uploads to flyte?
Hi all, I'm trying to get my implementation of a Bazel/Flyte integration off the ground and I'm runn...
s

Sam Eckert

over 2 years ago
Hi all, I'm trying to get my implementation of a Bazel/Flyte integration off the ground and I'm running into an issue on the last bit which is stumping me. I created a bazel macro called
flyte_library
(taking cues from @Rahul Mehta’s talk!). On run, that rule 1. Creates a py3_image with the workflow file, as well as pulling in the
aws
,
flytectl
, and
pyflyte-*
cli's. I wanted to keep things hermetic within the bazel env so I didn't create a base image with `awscli`/`pyflyte` pre-installed. 2. We add the
FLYTE_INTERNAL_IMAGE
tag, and then push this image to ECR. I'm still not 100% sure what
FLYTE_INTERNAL_IMAGE
does, but followed the examples I could find. 3. We then have a genrule which runs
docker run
using the image we just created, and calls a custom register script which wraps
pyflyte register
to register the workflow, and uses
flytectl
to enable/optionally execute any launchplans registered alongside the workflow. Registration works correctly as far as I can tell. The objects are created and viewable in the Console, but all tasks fail with:
[1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution
I can see the pod starting, pulling the correct image, and the
pyflyte-fast-execute
command exiting successfully via
kubectl
. No logs are created before the script exits so I'm having a bit of trouble identifying the issue. Weirder still, the exact same
pyflyte-fast-execute
command runs fine if I run it in a docker container locally.
👍 1