https://flyte.org logo
#ask-the-community
Title
# ask-the-community
i

Istiyak H. Siddiquee

01/27/2024, 3:18 PM
Hello everyone, I am having some trouble with my bare-metal deployment. So, I have two clusters: one is composed of two geekom mini-pcs and the other has 6 AMD pcs. I am following David Espejo's guide on deploying Flyte the hard way on bare-metal cluster (https://github.com/davidmirror-ops/flyte-the-hard-way). the codebase works perfectly well in my mini-cluster. but when I deploy the same code to AMD-cluster, I am seeing the following error: Traceback (most recent call last): File "/opt/venv/lib/python3.11/site-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point return wrapped(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.11/site-packages/flytekit/core/base_task.py", line 603, in dispatch_execute raise type(exc)(msg) from exc Message: Failed to convert inputs of task 'script.fit_logistic_model': [Errno 2] No such file or directory: 'file:///tmp/flyte-vdaei2t9/raw/9dba8aa2d373ba768586e64f6b6f8833/ff80b294119064057ed27c26d3e33483' SYSTEM ERROR! Contact platform administrators. Please note, I have tried deploying Flyte without persistence volumes and the error remains the same. In addition, minio has no access issue with the folder as I can see files being uploaded as soon as I deploy my code. Am I missing something? Thanks in advance for helping me.
Also, I have checked the /tmp folder. the aforementioned file exists in that folder. apparently, flyte created that file as soon as the execution started and then it cannot access it anymore. surprising!
k

Ketan (kumare3)

01/27/2024, 3:45 PM
You cannot access the file from a local folder in the cluster, you need some sort of a global store
i

Istiyak H. Siddiquee

01/27/2024, 3:48 PM
Could you suggest an approach to setting up that?
k

Ketan (kumare3)

01/27/2024, 3:48 PM
Minio, nfs, s3
i

Istiyak H. Siddiquee

01/27/2024, 3:51 PM
I do have minio in the cluster. How can I point that store to pyflyte?
k

Ketan (kumare3)

01/27/2024, 4:03 PM
I don’t know your code
You have to upload the data and pass the s3//… key as ref
i

Istiyak H. Siddiquee

01/27/2024, 5:33 PM
Thanks! I'll store it somewhere else.
Hi @Ketan (kumare3), Please correct me if I am wrong: if there is any file access in the codebase, those files will be kept inside the tmp folder, even I am running Flyte in a cluster. My code does access some files for which I might get the aforementioned error. So, if I put those files somewhere else, I should be good, right? To test this hypothesis, I wrote the following code as a simple test where I am not accessing any file, but I am passing X and Y to a task from my workflow. So, technically it should run. But, unfortunately, I am still getting the same error saying Flyte could not convert the input of task as it could not read data from /tmp folder which is situated inside the master node. So apparently tasks parameters are also written to tmp folder. Could you suggest some way around it? Could it be the case that the blob store is not setup properly in my installation? -------------------- the experiment code: @task(container_image="istiyaksiddiquee/dummy-test-for-flyte:test01") def fitting_task( X: np.ndarray, Y: np.ndarray ) -> None: logreg = LogisticRegression(C=1e5) logreg.fit(X, Y) print("inside task") return @workflow def wf() -> None: iris = datasets.load_iris() X = iris.data[:, :2] Y = iris.target fitting_task(X=X, Y=Y) print("task returned") return ----------------- the error: Message: Failed to convert inputs of task 'workflows.example.fitting_task': Failed to get data from file:///tmp/flyte-ijh97429/raw/fb171d4d3d0ab819cf0ac3b38093a5c9/48d64e45692d894b249fa58163be483b.npy to /tmp/flytefk5c2_f3/local_flytekit/4380687bfa064af66da9d9657fdd75a4 (recursive=False). Original exception: Value error! Received: file:///tmp/flyte-ijh97429/raw/fb171d4d3d0ab819cf0ac3b38093a5c9/48d64e45692d894b249fa58163be483b.npy. File not found
update: I ran the same code in both my clusters. In the cluster where the code ran successfully, I can see the input locations are mentioned as s3://my-s3-bucket/.... and in the cluster where the code did not run successfully, the location is mentioned as file:///tmp/.... so clearly, there is some setting mismatch. but, I am running the same setup scripts for both the clusters. here are the two images depicting the setting mismatch. please help me in fixing this. thanks.
Hi @Ketan (kumare3), could you give me some input please?
k

Ketan (kumare3)

01/28/2024, 6:23 PM
Yes this looks like either the config in propeller is wrong or you can simply set raw output prefix for workflow registrations. Also it is a Sunday here and so sorry for the delay, but also we cannot be available for open source help 24/7
i

Istiyak H. Siddiquee

01/28/2024, 6:51 PM
thank you. I will set the output prefix.
d

David Espejo (he/him)

01/29/2024, 6:02 PM
@Istiyak H. Siddiquee I guess you're running
flyte-binary
. if that's the case, make sure that your new deployment (AMD) is configured pointing to your S3-compatible bucket: https://github.com/flyteorg/flyte/blob/fb9ffd56e81e7f7e4657cd668e53b2f1557e9178/charts/flyte-binary/eks-production.yaml#L9. Also the env vars with the minio credentials. That chart has a template that builds the `s3://...`prefix based on the `configuration.storage.provider`you have passed to the Helm chart. This section covers some of the knowledge required to see how, even if you're not moving files, Flyte still upload/download content to/from blob storage: https://docs.flyte.org/en/latest/concepts/data_management.html#types-of-data
i

Istiyak H. Siddiquee

01/29/2024, 6:27 PM
Hi @David Espejo (he/him) thanks for your response. I checked all of the above and set the raw-output-dir for run and register commands and edited the configMap inside K8S, but still could not make it work. however, then I tore down the cluster and reset everything. this time it worked. so, i think, there could be some left-over misconfiguration from previous installation when I used helm charts. now everything is working as expected and I am able to run the workloads. thanks for all of your patience and cooperation.
2 Views