Hi I ve been reading the Flyte docs and have some questions Flyte #flyte-support

Hi. I've been reading the Flyte docs and have some...

fierce-oil-47448

04/26/2024, 11:04 PM

Hi. I've been reading the Flyte docs and have some questions around packaging and deployment. The documentation on this centers around how to package and register the worfklows and tasks. Assume my project includes a large number of python modules (say some kind of library, e.g. ml model code, data processing logic, etc) and most of the time I'm changing those and not the workflows or the tasks. Is there an assumption that I should be isolating my library as a python package and declare it as a dependency of the Flyte project? Wouldn't that make iteration very cumbersome? Note that most of the time I'm not iterating on the workflow or the tasks (which I expect will be quite stable), but instead iterating on the library code that those tasks depend on.

freezing-airport-6809

04/26/2024, 11:33 PM

Why do you think it has to be isolated as a python package?

freezing-airport-6809

04/26/2024, 11:33 PM

you can use

Copy code

pyflyte run --copy-all

Or just use imagespec to always run

fierce-oil-47448

04/26/2024, 11:55 PM

Let's say I want to have a code structure like this: repo \- requirements.in/.txt \- foo_lib [python modules] \- orchestration \- workflows [depends on foo_lib]

Copy code

pyflyte run --remote orchestration/workflows/foo.py foo_workflow --name "Acme" --copy-all

freezing-airport-6809

04/28/2024, 10:03 PM

let me try and will get back to you

freezing-airport-6809

04/28/2024, 10:03 PM

you can also add the files to the

docker

image too

freezing-airport-6809

04/29/2024, 4:32 AM

cc @glamorous-carpet-83516 can you help here?

👀 1

glamorous-carpet-83516

04/29/2024, 4:39 AM

do you have __`init__` file in the

orchestration

folder?

glamorous-carpet-83516

04/29/2024, 4:40 AM

if so,

pyflyte run --copy-all

will copy entire

<repo>

to s3, and you task will download it while running

fierce-oil-47448

04/29/2024, 4:40 AM

I’ll follow up here once I try the suggestions, if I still have an issue

👍 1

freezing-airport-6809

04/29/2024, 4:48 AM

are you using bazel?

fierce-oil-47448

04/30/2024, 8:09 PM

@glamorous-carpet-83516 I am not using bazel. I made some progress. I am trying local demo cluster first. I started the demo cluster:

flytectl demo start

Everything is up and running. I did:

export FLYTECTL_CONFIG=~/.flyte/config-sandbox.yaml

My config contents:

Copy code

admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: localhost:30080
  insecure: true
# This is not a needed configuration, only useful if you want to explore the data in sandbox. For non sandbox, please
# do not use this configuration, instead prefer to use aws, gcs, azure sessions. Flytekit, should use fsspec to
# auto select the right backend to pull data as long as the sessions are configured. For Sandbox, this is special, as
# minio is s3 compatible and we ship with minio in sandbox.
storage:
  connection:
    endpoint: <http://localhost:30002>
    access-key: minio
    secret-key: miniostorage

Netstat shows the port is active:

Copy code

(ubuntu) ubuntu@ip-10-42-128-255:~/vfm$ netstat -an | grep 30080
tcp        0      0 0.0.0.0:30080           0.0.0.0:*               LISTEN

But when I create a project:

Copy code

flytectl create project --id "my-hello-world-project" --labels "my-label=my-project" --description "My Flyte Hello World project" --name "My Hello World Project"

I get a connection error:

Copy code

Error: Connection Info: [Endpoint: localhost:30080, InsecureConnection?: true, AuthMode: ClientSecret]: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure
{"json":{},"level":"error","msg":"Connection Info: [Endpoint: localhost:30080, InsecureConnection?: true, AuthMode: ClientSecret]: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure","ts":"2024-04-30T20:08:33Z"}

fierce-oil-47448

04/30/2024, 8:30 PM

I just noticed that one service is pending:

flyteagent-5b49c94c-ggfqj                           | Pending

fierce-oil-47448

04/30/2024, 8:47 PM

Checked the k8 logs:

Copy code

(ubuntu) ubuntu@ip-10-42-128-255:~/vfm$ kubectl logs flyteagent-5b49c94c-rnnl7
Error from server (BadRequest): container "flyteagent" in pod "flyteagent-5b49c94c-rnnl7" is waiting to start: trying and failing to pull image

fierce-oil-47448

04/30/2024, 8:49 PM

Looks like I was able to get to the more detailed errors:

Copy code

Warning  Failed            6m34s                  kubelet            Failed to pull image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to resolve reference "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to do request: Head "<https://ghcr.io/v2/flyteorg/flyteagent/manifests/1.10.8b4>": dial tcp: lookup <http://ghcr.io|ghcr.io> on 10.42.0.2:53: read udp 10.42.0.1:49068->10.42.0.2:53: read: connection refused

Will debug myself from now on. It would be good to have instructions on debugging demo cluster issues though.

fierce-oil-47448

04/30/2024, 9:01 PM

Looks like I'll need some help. I can download this image via docker, fine:

docker pull <http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>

But when Flyte cluster attempts that during deployment of flyteagent, it fails for some reason.

glamorous-carpet-83516

04/30/2024, 9:11 PM

seems like a network issue, could you restart the demo cluster

Copy code

flytectl demo teardown --volume
flytectl demo start

fierce-oil-47448

04/30/2024, 9:19 PM

@glamorous-carpet-83516 Same issue.

fierce-oil-47448

04/30/2024, 9:20 PM

During deployment, it cannot fetch the image for flyteagent, but when using docker on command line, I can fetch it without any issues.

freezing-airport-6809

04/30/2024, 9:20 PM

@fierce-oil-47448 I can pull

freezing-airport-6809

04/30/2024, 9:21 PM

this is really odd

freezing-airport-6809

04/30/2024, 9:21 PM

@fierce-oil-47448 with flytecluster do you mean the demo cluster

flytectl demo start

fierce-oil-47448

04/30/2024, 9:21 PM

Yes.

freezing-airport-6809

04/30/2024, 9:21 PM

what!

fierce-oil-47448

04/30/2024, 9:21 PM

The flyte agent is stuck and it is due to image pull failing

fierce-oil-47448

04/30/2024, 9:22 PM

But I can pull that image if I do docker pull from command line

freezing-airport-6809

04/30/2024, 9:22 PM

but all other images are downloaded fine?

fierce-oil-47448

04/30/2024, 9:22 PM

Yes. Looks like this is the only one from gcr? I just spot checked a few others.

fierce-oil-47448

04/30/2024, 9:22 PM

Using kubectl, I see the following error:

Copy code

Warning  Failed            6m34s                  kubelet            Failed to pull image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to resolve reference "<http://ghcr.io/flyteorg/flyteagent:1.10.8b4|ghcr.io/flyteorg/flyteagent:1.10.8b4>": failed to do request: Head "<https://ghcr.io/v2/flyteorg/flyteagent/manifests/1.10.8b4>": dial tcp: lookup <http://ghcr.io|ghcr.io> on 10.42.0.2:53: read udp 10.42.0.1:49068->10.42.0.2:53: read: connection refused

glamorous-carpet-83516

04/30/2024, 9:23 PM

if you are not using agent, could you disable that first.

Copy code

flytectl demo start --disable-agent

freezing-airport-6809

04/30/2024, 9:23 PM

ya most likely you are not using agents @fierce-oil-47448 just disable it for now - also it is not gcr - it is also ghcr

fierce-oil-47448

04/30/2024, 9:24 PM

Yeah, sorry for the typo

fierce-oil-47448

04/30/2024, 9:26 PM

Yes, I can get to the console now.

freezing-airport-6809

04/30/2024, 9:26 PM

I think i know, it can be probably because you are running out of space for the docker daemon?

freezing-airport-6809

04/30/2024, 9:26 PM

can you check that

freezing-airport-6809

04/30/2024, 9:26 PM

this agent image is big - cc @glamorous-carpet-83516?

freezing-airport-6809

04/30/2024, 9:27 PM

@fierce-oil-47448 i was able to run everything fine

fierce-oil-47448

04/30/2024, 9:28 PM

I'm sure it works for others. Something is off for me. I'll check the space a little later and report back.

freezing-airport-6809

04/30/2024, 9:28 PM

but your error message is suspicious. But i think it may be red herring?

freezing-airport-6809

04/30/2024, 9:28 PM

sometimes, you have to just restart your docker daemon or even the computer

freezing-airport-6809

04/30/2024, 9:28 PM

😞 I am sorry

fierce-oil-47448

04/30/2024, 9:47 PM

Running a hello world also gave a similar error within the task:

Copy code

[1/1] currentAttempt done. Last Error: USER::Grace period [3m0s] exceeded|containers with unready status: [f092c13fb2ecf4c4fb31-n0-0]|Back-off pulling image "<http://cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0|cr.flyte.org/flyteorg/flytekit:py3.11-1.11.0>"

fierce-oil-47448

04/30/2024, 9:48 PM

So the space issue is a good theory and I'll check that today

freezing-airport-6809

04/30/2024, 9:58 PM

fierce-oil-47448

05/02/2024, 12:18 AM

@glamorous-carpet-83516, @freezing-airport-6809 Just to close the loop here, the problem with not being able to pull images had to do with DNS problems in the demo K8 cluster. It looks like K8 inherits DNS configs for kube-dns from /etc/resolv.conf on the host machine. However, on ubuntu (which I'm using), /etc/resolv.conf is an indirection to a local address. The problem is reported here: https://github.com/kubernetes/kubernetes/issues/23474 I had to follow this (https://askubuntu.com/questions/130452/how-do-i-add-a-dns-server-via-resolv-conf/51332#51332) to add a non-local DNS (like 8.8.8.8) to my /etc/resolv.conf. This way I was able to start the local Flyte demo cluster and also the test jobs passed.

freezing-airport-6809

05/02/2024, 12:32 AM

thank you for the feedback. It might be good to capture this in a troubleshooting guide

fierce-oil-47448

05/02/2024, 7:12 PM

@glamorous-carpet-83516 going back to my original question that started this thread:

if so, pyflyte run --copy-all will copy entire <repo> to s3, and you task will download it while running

fierce-oil-47448

05/02/2024, 7:13 PM

How does this 'copy' logic work if I want to use FlyteRemote and run my workflow programmatically?

fierce-oil-47448

05/02/2024, 7:14 PM

I want to use FlyteRemote and launch plans to programmatically launch a job based on some light command line processing.

fierce-oil-47448

05/02/2024, 7:16 PM

In this case, there is no 'pyflyte run', right? Is there a copy all equivalent for FlyteRemote.execute?

fierce-oil-47448

05/03/2024, 12:19 AM

Let me take this to the top level as a separate question

19 Views

Open in Slack

Previous Next