Hey all, I'm having a little trouble running a wor...
# ask-the-community
t
Hey all, I'm having a little trouble running a workflow in the Flyte sandbox on my local machine - in particular, the workflow that I'm attempting to run is failing to pull the image that I've built within the sandbox. Here you can see the containers that I have running on my host:
Copy code
$ docker ps
>>>
CONTAINER ID   IMAGE                                                                               COMMAND                  CREATED             STATUS             PORTS                                                                                                                 NAMES

dbf8f5dcb150   <http://cr.flyte.org/flyteorg/flyte-sandbox:dind-bfa1dd4e6057b6fc16272579d61df7b1832b96a7|cr.flyte.org/flyteorg/flyte-sandbox:dind-bfa1dd4e6057b6fc16272579d61df7b1832b96a7>   "tini flyte-entrypoi…"   About an hour ago   Up About an hour   0.0.0.0:30081-30082->30081-30082/tcp, 0.0.0.0:30084->30084/tcp, 2375-2376/tcp, 0.0.0.0:30086-30088->30086-30088/tcp   flyte-sandbox
From which we can then find the images that exist inside the
dbf8f5dcb150
container:
Copy code
$ docker exec -it dbf8f5dcb150 docker image ls
>>>
REPOSITORY                                     TAG                       IMAGE ID       CREATED          SIZE
papermill-exploration                          latest                    3c40c6deb126   23 minutes ago   948MB
...
I can see my project in there under the tag
papermill-exploration:latest
. I then serialize and submit my workflow spec as follows:
Copy code
pyflyte --pkgs workflows package -f --image "papermill-exploration:latest"
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version v2
All of which works:
Copy code
$ flytectl get workflows --project flytesnacks --domain development  
>>>        
 --------- ------------------------------------ ----------------------------- 
| VERSION | NAME                               | CREATED AT                  |
 --------- ------------------------------------ ----------------------------- 
| v2      | workflows.workflow.nb_to_python_wf | 2022-12-12T12:41:53.987960Z |
 --------- ------------------------------------ ----------------------------- 
| v1      | workflows.workflow.nb_to_python_wf | 2022-12-12T12:33:08.295661Z |
 --------- ------------------------------------ ----------------------------- 
2 rows
I then attempt to invoke the workflow, but the resulting pod cannot pull the image:
Copy code
$ flytectl get execution --project flytesnacks --domain development azlfqvzfsbz4lr8pbmlt
>>>
 ---------------------- ------------------------------------ ------------- -------- ---------------- -------------------------------- --------------- -------------------- --------------------------------------------------------- 
| NAME                 | LAUNCH PLAN NAME                   | TYPE        | PHASE  | SCHEDULED TIME | STARTED                        | ELAPSED TIME  | ABORT DATA (TRUNC) | ERROR DATA (TRUNC)                                      |
 ---------------------- ------------------------------------ ------------- -------- ---------------- -------------------------------- --------------- -------------------- --------------------------------------------------------- 
| azlfqvzfsbz4lr8pbmlt | workflows.workflow.nb_to_python_wf | LAUNCH_PLAN | FAILED |                | 2022-12-12T13:07:23.548693519Z | 23.161600293s |                    | [1/1] currentAttempt done. Last Error: USER::containers |
|                      |                                    |             |        |                |                                |               |                    | with unready status: [azlfqvzfsbz4lr8pbmlt-n            |
 ---------------------- ------------------------------------ ------------- -------- ---------------- -------------------------------- --------------- -------------------- --------------------------------------------------------- 
1 rows

$ docker exec -it dbf8f5dcb150 kubectl -n flytesnacks-development describe pod azlfqvzfsbz4lr8pbmlt-n0-0
>>>
...
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  27m                    default-scheduler  Successfully assigned flytesnacks-development/azlfqvzfsbz4lr8pbmlt-n0-0 to dbf8f5dcb150
  Normal   Pulling    25m (x4 over 27m)      kubelet            Pulling image "papermill-exploration:latest"
  Warning  Failed     25m (x4 over 27m)      kubelet            Failed to pull image "papermill-exploration:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for papermill-exploration, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning  Failed     25m (x4 over 27m)      kubelet            Error: ErrImagePull
  Warning  Failed     25m (x6 over 27m)      kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m22s (x106 over 27m)  kubelet            Back-off pulling image "papermill-exploration:latest"
Have I missed something here? Are the pods not authenticated against the docker repo? Or am I not specifying my images correctly?
s
Have you built the docker image using
flytectl sandbox exec -- docker build . --tag "papermill-exploration:latest"
command?
t
Thanks for the response! Yeah, I did - that's why you can see the
papermill-exploration:latest
image inside the container running k3s. I think that I've found the issue - this StackOverflow message suggests that, when using the
--docker
flag with
k3s
, the host docker is used rather than
containerd
(you can see this argument in the entrypoint script). In that instance, the image will be found if the
imagePullPolicy
is set to
IfNotPresent
. However, the default pods in the sandbox are run using the
Always
policy. I've fixed this by configuring a default pod template as per the docs, applying that and then restarting the propellor component, which then uses the correct policy and fixes the problem:
Copy code
$ docker exec -i $K3S_CONTAINER_ID kubectl -n flyte -n flytesnacks-development get pod ap4xl5hwmwmgnkwm4spz-n0-0 -o yaml | yq '.spec.containers[0].imagePullPolicy'
>>>
IfNotPresent
where the template looks as follows:
Copy code
apiVersion: v1
kind: PodTemplate
metadata:
  name: default-pod-template
template:
  spec:
    containers:
    - name: default
      image: "overwrite-me"
      imagePullPolicy: IfNotPresent
Whilst I'm happy to have fixed it (thanks for the great documentation!), it seems strange to me that this doesn't work out of the box. Can I raise the issue somewhere to see if this can be fixed? Happy to contribute a PR as well, if that's an option.
s
Of course! Please feel free to file an issue. This used to work but not sure what changed. @Eduardo Apolinario (eapolinario), can Tom create a PR to fix this issue?
y
teammate and i were looking through the code when he rememberd… we don’t set it. this is controlled by k8s. and the default k8s behavior is that if your image tag is
"latest"
, then it will set the pull policy to Always.
if you just tag it with anything else, it should work (without the pod template)
t
Ah, good catch! Will give this a go instead. Thanks, both!
i
We have a similar problem. I haven't built an image, we just want to try out the "Getting started" example. I guess the problem is caused by our proxy. Is there an option to pass the HTTP_PROXY, NO_PROXY env variables?
Copy code
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m40s                  default-scheduler  Successfully assigned flytesnacks-development/fca72778d3e6f4c5c8ab-n0-0 to bc1ca7e70b3c
  Normal   Pulling    3m21s (x4 over 4m40s)  kubelet            Pulling image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>"
  Warning  Failed     3m21s (x4 over 4m40s)  kubelet            Failed to pull image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>": failed to resolve reference "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>": failed to do request: Head "<https://ghcr.io/v2/flyteorg/flytekit/manifests/py3.9-1.3.0>": dial tcp: lookup <http://ghcr.io|ghcr.io> on ...:53: no such host
  Warning  Failed     3m21s (x4 over 4m40s)  kubelet            Error: ErrImagePull
  Warning  Failed     2m56s (x6 over 4m39s)  kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m43s (x7 over 4m39s)  kubelet            Back-off pulling image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>"
s
When you are spinning up a demo cluster, you can send env variables:
flytectl demo start --env HTTP_PROXY=...
Let me know if this works!
i
Thank you, Samhita. Unfortunately, it doesn't solve my problem. The env variables are passed to the docker container
docker exec flyte-sandbox env
But now I'm getting a different error message which shows the certificate of ghcr.io is not accepted.
Copy code
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  18m                   default-scheduler  Successfully assigned flytesnacks-development/axpsj4sv5xsj6z2krc57-n0-0 to d3482e31a88c
  Normal   Pulling    16m (x4 over 18m)     kubelet            Pulling image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>"
  Warning  Failed     16m (x4 over 18m)     kubelet            Failed to pull image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>": failed to resolve reference "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>": failed to do request: Head "<https://ghcr.io/v2/flyteorg/flytekit/manifests/py3.9-1.3.0>": x509: certificate signed by unknown authority
  Warning  Failed     16m (x4 over 18m)     kubelet            Error: ErrImagePull
  Warning  Failed     16m (x6 over 17m)     kubelet            Error: ImagePullBackOff
  Normal   BackOff    2m59s (x65 over 17m)  kubelet            Back-off pulling image "<http://ghcr.io/flyteorg/flytekit:py3.9-1.3.0|ghcr.io/flyteorg/flytekit:py3.9-1.3.0>"
s
i
The container doesn't include docker or podman, I'm using crictl instead. I don't have problems to pull images from docker.io, but ghcr.io throws the shown error message. On the host system, I can use podman or crictl to pull images from ghcr.io. Most of the utilities in the SO article are not available in the container.
s
@Yee / @Eduardo Apolinario (eapolinario), any idea how to resolve this issue?
e
@Ingo Kemmerzell,just to confirm, how are you starting the sandbox? Can you share the exact command you used ? Setting env vars via
--env
should be enough for k3s (really containerd) to pick them up.
i
The --env options were successfully passed to k3s which caused a different error message "x509 .. unknown authority" instead of "dial tcp: lookup ghcr.io on ...53 no such host". I could pull images with crictl in the container from docker.io, but received the x509 error message when trying to connect to ghcr.io. In the meantime, I've submitted an internal ticket to our proxy team to check the proxy and IPS settings. I guess they have changed something, because now everything seems to work fine. I can submit the workflow examples without getting error messages. @Samhita Alla, @Eduardo Apolinario (eapolinario), thank you very much for your support, much appreciated.
370 Views