Hi. I've been reading the Flyte docs and have some...
# ask-the-community
b
Hi. I've been reading the Flyte docs and have some questions around packaging and deployment. The documentation on this centers around how to package and register the worfklows and tasks. Assume my project includes a large number of python modules (say some kind of library, e.g. ml model code, data processing logic, etc) and most of the time I'm changing those and not the workflows or the tasks. Is there an assumption that I should be isolating my library as a python package and declare it as a dependency of the Flyte project? Wouldn't that make iteration very cumbersome? Note that most of the time I'm not iterating on the workflow or the tasks (which I expect will be quite stable), but instead iterating on the library code that those tasks depend on.
k
Why do you think it has to be isolated as a python package?
you can use
Copy code
pyflyte run --copy-all
Or just use imagespec to always run
b
Let's say I want to have a code structure like this: repo \- requirements.in/.txt \- foo_lib [python modules] \- orchestration \- workflows [depends on foo_lib]
Copy code
pyflyte run --remote orchestration/workflows/foo.py foo_workflow --name "Acme" --copy-all
k
let me try and will get back to you
you can also add the files to the
docker
image too
cc @Kevin Su can you help here?
k
do you have __`init__` file in the
orchestration
folder?
if so,
pyflyte run --copy-all
will copy entire
<repo>
to s3, and you task will download it while running
b
I’ll follow up here once I try the suggestions, if I still have an issue
k
are you using bazel?
b
@Kevin Su I am not using bazel. I made some progress. I am trying local demo cluster first. I started the demo cluster:
flytectl demo start
Everything is up and running. I did:
export FLYTECTL_CONFIG=~/.flyte/config-sandbox.yaml
My config contents:
Copy code
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: localhost:30080
  insecure: true
# This is not a needed configuration, only useful if you want to explore the data in sandbox. For non sandbox, please
# do not use this configuration, instead prefer to use aws, gcs, azure sessions. Flytekit, should use fsspec to
# auto select the right backend to pull data as long as the sessions are configured. For Sandbox, this is special, as
# minio is s3 compatible and we ship with minio in sandbox.
storage:
  connection:
    endpoint: <http://localhost:30002>
    access-key: minio
    secret-key: miniostorage
Netstat shows the port is active:
Copy code
(ubuntu) ubuntu@ip-10-42-128-255:~/vfm$ netstat -an | grep 30080
tcp        0      0 0.0.0.0:30080           0.0.0.0:*               LISTEN
But when I create a project:
Copy code
flytectl create project --id "my-hello-world-project" --labels "my-label=my-project" --description "My Flyte Hello World project" --name "My Hello World Project"
I get a connection error:
Copy code
Error: Connection Info: [Endpoint: localhost:30080, InsecureConnection?: true, AuthMode: ClientSecret]: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure
{"json":{},"level":"error","msg":"Connection Info: [Endpoint: localhost:30080, InsecureConnection?: true, AuthMode: ClientSecret]: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure","ts":"2024-04-30T20:08:33Z"}
I just noticed that one service is pending:
flyteagent-5b49c94c-ggfqj                           | Pending
Checked the k8 logs:
Copy code
(ubuntu) ubuntu@ip-10-42-128-255:~/vfm$ kubectl logs flyteagent-5b49c94c-rnnl7
Error from server (BadRequest): container "flyteagent" in pod "flyteagent-5b49c94c-rnnl7" is waiting to start: trying and failing to pull image
Looks like I was able to get to the more detailed errors:
Copy code
Warning  Failed            6m34s                  kubelet            Failed to pull image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to resolve reference "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to do request: Head "<https://ghcr.io/v2/flyteorg/flyteagent/manifests/1.10.8b4>": dial tcp: lookup <http://ghcr.io|ghcr.io> on 10.42.0.2:53: read udp 10.42.0.1:49068->10.42.0.2:53: read: connection refused
Will debug myself from now on. It would be good to have instructions on debugging demo cluster issues though.
Looks like I'll need some help. I can download this image via docker, fine:
docker pull <http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>
But when Flyte cluster attempts that during deployment of flyteagent, it fails for some reason.
k
seems like a network issue, could you restart the demo cluster
Copy code
flytectl demo teardown --volume
flytectl demo start
b
@Kevin Su Same issue.
During deployment, it cannot fetch the image for flyteagent, but when using docker on command line, I can fetch it without any issues.
k
@Buğra Gedik I can pull
this is really odd
@Buğra Gedik with flytecluster do you mean the demo cluster
flytectl demo start
?
b
Yes.
k
what!
b
The flyte agent is stuck and it is due to image pull failing
But I can pull that image if I do docker pull from command line
k
but all other images are downloaded fine?
b
Yes. Looks like this is the only one from gcr? I just spot checked a few others.
Using kubectl, I see the following error:
Copy code
Warning  Failed            6m34s                  kubelet            Failed to pull image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to resolve reference "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to do request: Head "<https://ghcr.io/v2/flyteorg/flyteagent/manifests/1.10.8b4>": dial tcp: lookup <http://ghcr.io|ghcr.io> on 10.42.0.2:53: read udp 10.42.0.1:49068->10.42.0.2:53: read: connection refused
k
if you are not using agent, could you disable that first.
Copy code
flytectl demo start --disable-agent
k
ya most likely you are not using agents @Buğra Gedik just disable it for now - also it is not gcr - it is also ghcr
b
Yeah, sorry for the typo
Yes, I can get to the console now.
k
I think i know, it can be probably because you are running out of space for the docker daemon?
can you check that
this agent image is big - cc @Kevin Su?
@Buğra Gedik i was able to run everything fine
b
I'm sure it works for others. Something is off for me. I'll check the space a little later and report back.
k
but your error message is suspicious. But i think it may be red herring?
sometimes, you have to just restart your docker daemon or even the computer
😞 I am sorry
b
Running a hello world also gave a similar error within the task:
Copy code
[1/1] currentAttempt done. Last Error: USER::Grace period [3m0s] exceeded|containers with unready status: [f092c13fb2ecf4c4fb31-n0-0]|Back-off pulling image "<http://cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0|cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0>"
S
So the space issue is a good theory and I'll check that today
k
+1
b
@Kevin Su, @Ketan (kumare3) Just to close the loop here, the problem with not being able to pull images had to do with DNS problems in the demo K8 cluster. It looks like K8 inherits DNS configs for kube-dns from /etc/resolv.conf on the host machine. However, on ubuntu (which I'm using), /etc/resolv.conf is an indirection to a local address. The problem is reported here: https://github.com/kubernetes/kubernetes/issues/23474 I had to follow this (https://askubuntu.com/questions/130452/how-do-i-add-a-dns-server-via-resolv-conf/51332#51332) to add a non-local DNS (like 8.8.8.8) to my /etc/resolv.conf. This way I was able to start the local Flyte demo cluster and also the test jobs passed.
k
thank you for the feedback. It might be good to capture this in a troubleshooting guide
b
@Kevin Su going back to my original question that started this thread:
if so, pyflyte run --copy-all will copy entire <repo> to s3, and you task will download it while running
How does this 'copy' logic work if I want to use FlyteRemote and run my workflow programmatically?
I want to use FlyteRemote and launch plans to programmatically launch a job based on some light command line processing.
In this case, there is no 'pyflyte run', right? Is there a copy all equivalent for FlyteRemote.execute?
Let me take this to the top level as a separate question