https://flyte.org logo
#ask-the-community
Title
# ask-the-community
i

Istiyak H. Siddiquee

12/05/2023, 12:10 PM
Hello everyone, I am having some trouble with my bare metal flyte cluster. After successful deployment, I am able to run my workflow in the cluster. But, after a while I am getting an access denied error from S3. Apparently, flyte can PutObjects to S3, but it cannot list objects from it. So, I tried to do some manual policy check and created a ListBucket policy, by copying an existing policy. Still, I am getting the same error. Could anyone please help me in this? the following text block is the error I got from the Kubernetes Pod. The screenshots show the policy I have created, which did not work. ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied. The above exception was the direct cause of the following exception: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/local/lib/python3.10/site-packages/flytekit/core/data_persistence.py:29 │ │ 9 in get_data │ │ │ │ ❱ 299 │ │ │ │ self.get(remote_path, to_path=local_path, recursive=is │ │ │ │ /usr/local/lib/python3.10/site-packages/flytekit/core/data_persistence.py:20 │ │ 3 in get │ │ │ │ ❱ 203 │ │ │ │ return file_system.get(from_path, to_path, recursive=r │ │ │ │ /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:118 in wrapper │ │ │ │ ❱ 118 │ │ return sync(self.loop, func, *args, **kwargs) │ │ │ │ /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:103 in sync │ │ │ │ ❱ 103 │ │ raise return_result │ │ │ │ /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:56 in _runner │ │ │ │ ❱ 56 │ │ result[0] = await coro │ │ │ │ /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:619 in _get │ │ │ │ ❱ 619 │ │ │ │ rpaths = [ │ │ │ │ /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:620 in listcomp │ │ │ │ ❱ 620 │ │ │ │ │ p for p in rpaths if not (trailing_sep(p) or awai │ │ │ │ /usr/local/lib/python3.10/site-packages/s3fs/core.py:1449 in _isdir │ │ │ │ ❱ 1449 │ │ │ return bool(await self._lsdir(path)) │ │ │ │ /usr/local/lib/python3.10/site-packages/s3fs/core.py:757 in _lsdir │ │ │ │ ❱ 757 │ │ │ │ raise translate_boto_error(e) │ ╰──────────────────────────────────────────────────────────────────────────────╯ PermissionError: Access Denied. During handling of the above exception, another exception occurred: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/local/bin/pyflyte-fast-execute:8 in module │ │ │ │ ❱ 8 │ sys.exit(fast_execute_task_cmd()) │ │ │ │ /usr/local/lib/python3.10/site-packages/click/core.py:1157 in call │ │ │ │ ❱ 1157 │ │ return self.main(*args, **kwargs) │ │ │ │ /usr/local/lib/python3.10/site-packages/click/core.py:1078 in main │ │ │ │ ❱ 1078 │ │ │ │ │ rv = self.invoke(ctx) │ │ │ │ /usr/local/lib/python3.10/site-packages/click/core.py:1434 in invoke │ │ │ │ ❱ 1434 │ │ │ return ctx.invoke(self.callback, **ctx.params) │ │ │ │ /usr/local/lib/python3.10/site-packages/click/core.py:783 in invoke │ │ │ │ ❱ 783 │ │ │ │ return __callback(*args, **kwargs) │ │ │ │ /usr/local/lib/python3.10/site-packages/flytekit/bin/entrypoint.py:519 in │ │ fast_execute_task_cmd │ │ │ │ ❱ 519 │ │ _download_distribution(additional_distribution, dest_dir) │ │ │ │ /usr/local/lib/python3.10/site-packages/flytekit/core/utils.py:295 in │ │ wrapper │ │ │ │ ❱ 295 │ │ │ │ return func(*args, **kwargs) │ │ │ │ /usr/local/lib/python3.10/site-packages/flytekit/tools/fast_registration.py: │ │ 113 in download_distribution │ │ │ │ ❱ 113 │ FlyteContextManager.current_context().file_access.get_data(additio │ │ │ │ /usr/local/lib/python3.10/site-packages/flytekit/core/data_persistence.py:30 │ │ 1 in get_data │ │ │ │ ❱ 301 │ │ │ raise FlyteAssertion( │ ╰──────────────────────────────────────────────────────────────────────────────╯ FlyteAssertion: Failed to get data from s3://my-s3-bucket/fakeray/staging/K4H42DEO2E6GLIFGQW3DZUEH7E======/script_mode.t ar.gz to ./ (recursive=False). Original exception: Access Denied.
to add some more context: I have tried accessing the bucket from outside using the NodePort. This works without any issue.
s

Samhita Alla

12/06/2023, 7:31 AM
cc @David Espejo (he/him) i think there's a missing connection between flyte and the s3 bucket to list objects. have you used the flyte binary helm chart?
also, have you attached any iam role to your default service account? if so, could you check the iam role policy?
i

Istiyak H. Siddiquee

12/06/2023, 10:38 AM
@Samhita Alla Thanks for your response. Yes, I am using flyte-binary helm chart. To be more specific, I am using flyte-sandbox deployment. Do you want to see my values.yaml file? No, I have not used any IAM in my default service account. Do I need IAM for on-prem deployment with MinIO? the access policy here is password based and I can easily acccess files with simple python code, without IAM. Moreover, I have checked out the stacktrace and tested the access with s3fs package as well. The exception happened when the codebase tried to hit the internal method _lsdir of s3fs. As it was not possible to use the internal method directly, I tried the listdir and ls methods of s3fs, which should in turn hit the internal _lsdir method. This test was successful with password-based authentication, without IAM.
d

David Espejo (he/him)

12/06/2023, 11:22 AM
hey @Istiyak H. Siddiquee sorry for the delay So the access policy you shared is for minio? In any case, the minimum permissions are:
Copy code
"Action": [
    "s3:DeleteObject*",
    "s3:GetObject*",
    "s3:ListBucket",
    "s3:PutObject*"
   ],
 "Resource": [
          "arn:aws:s3:::<your-S3-bucket>*",
          "arn:aws:s3:::<your-S3-bucket>*/*"
      ],
i

Istiyak H. Siddiquee

12/06/2023, 11:40 AM
Hi @David Espejo (he/him), Yes, the access policy is for MinIo. I am working on a bare-metal cluster. So, the flyte-sandbox does not create the ListBucket permission, but I added it manually. Still, it did not work.
Could you also tell me, whether this will require ServiceAccount permission?
d

David Espejo (he/him)

12/06/2023, 11:50 AM
well, on a cloud environment it does, but on prem you should be able to use env vars to pass credentials to the Pods by making sure this info is in your values file:
Copy code
inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-env-vars:
          - FLYTE_AWS_ENDPOINT: "<http://minio.flyte.svc.cluster.local:9000>" #change to the full name of your minio service
          - FLYTE_AWS_ACCESS_KEY_ID: "minio" #change to your particular environment
          - FLYTE_AWS_SECRET_ACCESS_KEY: "miniostorage" #Use the same value as the MINIO_ROOT_PASSWORD
i

Istiyak H. Siddiquee

12/06/2023, 11:50 AM
I am doing exactly this.
d

David Espejo (he/him)

12/06/2023, 11:51 AM
am able to run my workflow in the cluster. But, after a while I am getting an access denied error from S3.
has anything changed?
i

Istiyak H. Siddiquee

12/06/2023, 11:58 AM
Actually there is some progress, but I am not sure whether this progress means I have crossed that error. That's why I was not mentioning that. However, I realized that flyte-binary creates two folders inside S3 bucket: one for the project (with bunch of other stuff, and the other is for metadata). Then right before spawning up the container, it tries to access these items with ListObjectV2 and fails. Now I have made the bucket public and it seems like it has moved to a new error where the flyte-binary can't find the script I am trying to run (it is throwing ModuleNotFoundError). But the script is right there in the dockerfile.
d

David Espejo (he/him)

12/06/2023, 12:03 PM
can you share your Dockerfile? also, make sure your workflow sits in the
workflows
dir and you have a
__init.py__
file there
i

Istiyak H. Siddiquee

12/06/2023, 12:05 PM
I have definitely missed the instruction of putting the coding inside workflows directory. However, the following is my dockerfile. Previously, I was using a user flyte, but then I switched to root as I thought it might be helpful with the permission issue. FROM python:3.10-slim-buster WORKDIR /work USER root ARG VERSION ARG DOCKER_IMAGE RUN apt-get update && apt-get install build-essential -y # Pod tasks should be exposed in the default image RUN pip install flytekit==1.9.1 \ flytekitplugins-pod==1.9.1 \ flytekitplugins-deck-standard==1.9.1 COPY ./requirements.txt . RUN pip install -r requirements.txt COPY ./script.py . COPY ./rq2_vosoughi_False_features.csv . COPY ./rq2_vosoughi_True_features.csv .
is there a standard Dockerfile that I can follow for this?
I found it. Thanks.
d

David Espejo (he/him)

12/06/2023, 12:39 PM
sorry, I was trying to find an example of using custom scripts on ImageSpec (the no-Dockerfile alternative)
i

Istiyak H. Siddiquee

12/06/2023, 12:43 PM
that's a good idea. Is it like this?
d

David Espejo (he/him)

12/06/2023, 12:50 PM
more or less. According to docs, the base image is determined based on the Python version+flytekit version. So it would be
Copy code
base_image="ghcr.io/flyteorg/flytekit:py3.10-1.10.0"
i

Istiyak H. Siddiquee

12/06/2023, 12:55 PM
okay! let me try with the Dockerfile first, then I am gonna try this, if the first experiment fails
thanks @David Espejo (he/him) 😄
Hi @David Espejo (he/him), quick update: the error is still there. I have followed these instructions (https://docs.flyte.org/projects/cookbook/en/stable/getting_started/creating_flyte_project.html) to build the docker image and then deployed the codebase again. My MinIo is set to have public access with an additional permission for ListBucket (as you mentioned earlier). Still, the error is there.
d

David Espejo (he/him)

12/06/2023, 3:29 PM
what's the current contents of the S3 policy?
ListBucket
by itself is not enough I believe
i

Istiyak H. Siddiquee

12/06/2023, 3:30 PM
the contents of S3 policy are still the same like the first image I provided.
Hi @David Espejo (he/him), any suggestion?
s

Samhita Alla

12/07/2023, 9:36 AM
could you share the description of the failing pod?
i

Istiyak H. Siddiquee

12/07/2023, 10:04 AM
this is the description of the failed pod.
s

Samhita Alla

12/07/2023, 11:56 AM
looks like the workdir you set in the dockerfile isn't accessible. can you set the python path env variable like in this dockerfile?
can you also remove the USER and check if that's working?
d

David Espejo (he/him)

12/07/2023, 2:51 PM
@Istiyak H. Siddiquee I haven't been able to reproduce this. I'm using an on-prem K8s cluster with a private minio bucket and I'm able to run workflows. I haven't attached any particular policy to any user. AFAICT, the ListBucket permission won't be enough for Flyte. The following permissions are required:
Copy code
"Action": [
    "s3:DeleteObject*",
    "s3:GetObject*",
    "s3:ListBucket",
    "s3:PutObject*"
   ],
 "Resource": [
          "arn:aws:s3:::<your-S3-bucket>*",
          "arn:aws:s3:::<your-S3-bucket>*/*"
      ],
i

Istiyak H. Siddiquee

12/07/2023, 3:06 PM
@David Espejo (he/him), could you share the dockerfile of your workflow? rest of the things are standard I guess. About the bucket permission: as i shown in the picture, the minio bucket has all the permissions you have listed above. I am yet to try the suggestions put forth by @Samhita Alla. I will update you as soon as I am done.
d

David Espejo (he/him)

12/07/2023, 3:25 PM
I used the default image. I'm not sure this is tied to the contents of the image. Can you try running the hello world example?
i

Istiyak H. Siddiquee

12/07/2023, 11:32 PM
Dear @David Espejo (he/him), I have done some debugging. I think this is an issue of the S3FS library that flyte-core is using as the backbone to communicate with the Minio. Hence, no matter how I create my Dockerfile, it is not helping me. Below is the approach I adopted for testing this claim: I tried deploying another MinIo instance in my cluster, in the same namespace. Then I tried communicating with both the instances from a separate namespace with the FQDN of the MinIo service to mimic the real-world scenario of a deployed workflow's interaction with the MinIo pod. I have tried to access both the buckets from two different libraries: S3FS (the one flyte-core is using) and boto3. With boto3, everything works, I can do list, put, get. But S3FS fails in all tasks. Then I tried creating a new user with all the permissions you have mentioned. Once again, tests with boto3 is successful with the new user and tests with S3FS fails on the same task. Please remember, both the tests took place inside a pod running in a separate namespace. Below is the screenshot of the code and the verification of the newly created user's access permission. As you have mentioned that you could not re-create the same issue, could you provide me the values.yaml file for your helm installation of the flyte-sandbox? I think using a different version might resolve the issue for me, as I am using the latest version of all the images.
s

Samhita Alla

12/08/2023, 6:49 AM
what flytekit version are you using?
thanks for going so deep to find the root cause of the issue!
could you also share the fsspec version?
i

Istiyak H. Siddiquee

12/08/2023, 7:51 AM
@Samhita Alla I am using latest tags for all the images. But, I am unable to find the fsspec version from these containers. Could you tell me a different image tag which might work? I am using helm chart for this installation. Here is the screenshot of my value.yaml file.
s

Samhita Alla

12/08/2023, 8:51 AM
since you mentioned s3fs is the root cause of your issue, it may not have to do with your deployment (just a guess). since you're using a docker image, can you pip install the requirements in your local env and run
pip show fsspec
?
also, are you using image spec?
i

Istiyak H. Siddiquee

12/08/2023, 8:52 AM
no, I am not using image spec. however, in my docker image, I am not using fsspec. it is being used by flyte-kit.
s

Samhita Alla

12/08/2023, 8:53 AM
okay, can you use the latest version of flytekit, please? 1.10.2
i

Istiyak H. Siddiquee

12/08/2023, 8:54 AM
okay. but I am using the sandbox deployment of helm chart. could you please tell me the corresponding version of the flyte-binary?
s

Samhita Alla

12/08/2023, 8:54 AM
the latest is 1.10.6
i

Istiyak H. Siddiquee

12/08/2023, 8:54 AM
great, let me try with 1.10.6. thanks
@Samhita Alla, with 1.10.6 as the image version, I am getting ErrImagePull. I have tried the following repositories for the flyte-binary: cr.flyte.org/flyteorg/flyte-binary and ghcr.io/flyteorg/flyte-binary
s

Samhita Alla

12/08/2023, 9:31 AM
can you try v1.10.6? is that what you tried?
i

Istiyak H. Siddiquee

12/08/2023, 9:32 AM
yes, i tried 1.10.6
s

Samhita Alla

12/08/2023, 9:32 AM
have you set the tag as "v1.10.6"?
i

Istiyak H. Siddiquee

12/08/2023, 9:33 AM
i did not use the v infront of 1.10.6. let me try again
still the same
s

Samhita Alla

12/08/2023, 9:35 AM
can you try "sha-c049865cba017ad826405c7145cd3eccbc553232"?
"latest" should actually pull the latest image. so you shouldn't worry about it.
have you tried using flytekit 1.10.2?
i

Istiyak H. Siddiquee

12/08/2023, 9:44 AM
actually I was using the latest tag and it did not work.
let me try with 1.10.2
@Samhita Alla v1.10.2 results in the same way. is it possible to provide a values.yaml file for the helm chart that will work in the bare-metal case? I think @David Espejo (he/him) deployed something similar in a cluster built with micro-k8s.
s

Samhita Alla

12/12/2023, 6:40 AM
cc @David Espejo (he/him)
d

David Espejo (he/him)

12/12/2023, 12:13 PM
@Istiyak H. Siddiquee this is the values file I've been using to run
flyte-binary
on a K3s cluster: https://github.com/davidmirror-ops/flyte-the-hard-way/blob/main/docs/on-premises/local-values.yaml
hey @Istiyak H. Siddiquee how did you solve the access issue to the minio buckets? I'm having the same issue on a multi-node bare metal config with NFS
5 Views