I'm trying to work around the issue in this thread...
# ask-the-community
r
I'm trying to work around the issue in this thread: https://flyte-org.slack.com/archives/CP2HDHKE1/p1685547623758749 I added
insecureSkipVerify: true
to my config.yaml, and when I try to do a
pyflyte run -remote
I get this error: Failed with Exception Code: SYSTEM:Unknown RPC Failed, with Status: StatusCode.UNAVAILABLE details: unavailable Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T105201.368208-04:00", grpc_status:14, grpc_message:"unavailable"} We are accessing through an AWS ALB, and
flytectl get project
works so I think the grpc routing is working. I'm guessing the grpc error message is because python's grpc doesn't like invalid ssl? If I'm correct, does the error message I am receiving seem inline with expectations if the ssl cert isn't verified?
d
@Rob Rati not sure, what I've seen for self-signed or invalid SSL is more commonly ``OPENSSL_internal:WRONG_VERSION_NUMBER`` but not ``status:14`` Sorry if you did it already but, could you share the anonymized content of your
config.yaml
?
r
admin: # For GRPC endpoints you might want to use dns:///flyte.myexample.com endpoint: dns///&lt;alb&gt;443 authType: Pkce insecure: false #caCertFilePath: /Users/a708083/<cert>.pem insecureSkipVerify: true logger: show-source: true level: 6
We aren't using a self-signed cert. However, I believe we do have own own signing ca key which we'll need to pass to verify the cert.
y
talked to rob about it. the original slack thread linked in this one is caused by this. https://github.com/flyteorg/flyte/issues/3715
rob’s tested that and it’s gotten past that error, but is now timing out. though we think the timeout is a different problem
r
The timeout from the above fix results in the same error I got when I used skip. Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T123702.751568-04:00", grpc_status:14, grpc_message:"unavailable"} This makes me wonder if a path is missing in the grpc ingress? WAG
y
can you tell me the version of grpcio you have?
pip show grpcio
r
Name: grpcio Version: 1.54.2
y
yeah not sure. i’ll try to come back to this later.
r
This doesn't appear to be a problem in the sandbox config, but that uses a proxy and different setup. I can look at the proxy routes and compare to the ingress, but not sure where the sandbox proxy config is defined
nm, found it. 😄
Well, scratch that idea. ingress matches the proxy routes for grpc
j
@Rob Rati: can you paste the full traceback form the most recent error? i've definitely had this working (albeit with Kong) with the existing ingress with a self-signed cert.
r
@jeev I din't get a full traceback. All. got was: Failed with Exception Code: SYSTEM:Unknown RPC Failed, with Status: StatusCode.UNAVAILABLE details: unavailable Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T123702.751568-04:00", grpc_status:14, grpc_message:"unavailable"} Is there a way to get a full bt?
j
thats so not useful 😅
UNAVAILABLE 14 The service is currently unavailable. This is most likely a transient condition, which can be corrected by retrying with a backoff. Note that it is not always safe to retry non-idempotent operations.
thanks......
y
this is working in flytectl right?
r
The timeout is a long time though
j
are you sure its the same endpoint?
does it work if you port-forward to the service directly
r
Yes, if we do port-forward it "works", in that we get a completely different error. A sts timeout error (which is also baffling us)
y
and not in pyflyte… but we’ve pretty much stripped all the flyte stuff out of pyflyte. i think you can repro this rob just by calling the protobuf generated flyteadmin client right?
if that’s the case, then this is something deeper in the python grpc library
j
i wish i had a way to repro locally 😞
coz it works fine on a public ALB from flytectl and pyflyte
baffling that it works with flytectl, but not with pyflyte
@Rob Rati: did you try
flytectl
from the same config file?
Copy code
flytectl get project --config=<PATH_TO_CONFIG>
r
Yes, flytectl works: % flytectl get project --admin.endpoint=<alb>:443 {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [storage] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [root] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:400"},"level":"debug","msg":"Config section [admin] updated. Firing updated event.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [files] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [console] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"client.go:63"},"level":"info","msg":"Initialized Admin client","ts":"2023-06-01T144844-04:00"} {"json":{"src":"project.go:102"},"level":"debug","msg":"Retrieved 1 projects","ts":"2023-06-01T144845-04:00"} ------------- ------------- ------------------------- | ID | NAME | DESCRIPTION | ------------- ------------- ------------------------- | flytesnacks | flytesnacks | flytesnacks description | ------------- ------------- ------------------------- 1 rows
j
i mean, dont specify the
admin.endpoint
use
--config=
instead
r
% flytectl get project --config ~/.flyte/config.yaml {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [storage] updated. No update handler registered.","ts":"2023-06-01T144957-04:00"} {"json":{"src":"client.go:63"},"level":"info","msg":"Initialized Admin client","ts":"2023-06-01T144957-04:00"} {"json":{"src":"project.go:102"},"level":"debug","msg":"Retrieved 1 projects","ts":"2023-06-01T144959-04:00"} ------------- ------------- ------------------------- | ID | NAME | DESCRIPTION | ------------- ------------- ------------------------- | flytesnacks | flytesnacks | flytesnacks description | ------------- ------------- ------------------------- 1 rows
j
ok
and this?
Copy code
> flyte-cli -c ~/.flyte/config.yaml list-projects
DeprecationWarning: The command 'flyte-cli' is deprecated.

################################################################################################################################
# flyte-cli is being deprecated in favor of flytectl. More details about flytectl in <https://docs.flyte.org/projects/flytectl/> #
################################################################################################################################

Welcome to Flyte CLI! Version: 1.6.2

Projects Found

	flytesnacks
r
That gives me an error too
TypeError: expected certificate to be bytes, got <class 'OpenSSL.crypto.X509'*>*
I've got a bt, but it doesn't post well
j
what if you just comment out the cert, and use
insecureSkipVerify
?
or are you already doing that?
flyte-cli
may not have support for
insecureSkipVerify
r
Using skip it works: % flyte-cli -c ~/.flyte/config.yaml list-projects DeprecationWarning: The command 'flyte-cli' is deprecated. ################################################################################################################################ # flyte-cli is being deprecated in favor of flytectl. More details about flytectl in https://docs.flyte.org/projects/flytectl/ # ################################################################################################################################ Welcome to Flyte CLI! Version: 1.6.2 Projects Found flytesnacks
j
ok good!
this seems to suggest the the GRPC client is fine right @Yee?
@Rob Rati: can you run the pyflyte command, and paste the full anonymized output?
im starting to think its related to the STS timeout
r
pyflyte in which configuration? 😄
j
the skipped: no cert,
insecureSkipVerify: true
r
It's running. Takes a while to timeout
j
any corresponding logs in the flyte-binary pod?
r
Nothing in the logs
Just this over and over: {"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2023-06-01T190647Z"} {"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2023-06-01T190647Z"} {"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2023-06-01T190647Z"}
% FLYTE_SDK_LOGGING_LEVEL=1 ./flyte/bin/pyflyte run --remote cookbook/core/flyte_basics/hello_world.py my_wf 2023-06-01 150610,011657 INFO {"asctime": "2023-06-01 150610,011", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150610,016220 DEBUG {"asctime": "2023-06-01 150610,016", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.endpoint could not be found file.py:222 in yaml config"} 2023-06-01 150610,017337 DEBUG {"asctime": "2023-06-01 150610,017", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.access-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,018172 DEBUG {"asctime": "2023-06-01 150610,018", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,101644 INFO {"asctime": "2023-06-01 150610,101", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150610,103554 DEBUG {"asctime": "2023-06-01 150610,103", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.caCertFilePath could not be found in file.py:222 yaml config"} 2023-06-01 150610,104145 DEBUG {"asctime": "2023-06-01 150610,104", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.command could not be found in yaml file.py:222 config"} 2023-06-01 150610,104687 DEBUG {"asctime": "2023-06-01 150610,104", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientId could not be found in yaml file.py:222 config"} 2023-06-01 150610,105274 DEBUG {"asctime": "2023-06-01 150610,105", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientSecretLocation could not be found file.py:222 in yaml config"} 2023-06-01 150610,105780 DEBUG {"asctime": "2023-06-01 150610,105", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.scopes could not be found in yaml file.py:222 config"} 2023-06-01 150610,106279 DEBUG {"asctime": "2023-06-01 150610,106", "name": "flytekit", "levelname": "DEBUG", "message": "Switch console.endpoint could not be found in yaml file.py:222 config"} 2023-06-01 150610,106778 DEBUG {"asctime": "2023-06-01 150610,106", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.httpProxyURL could not be found in yaml file.py:222 config"} 2023-06-01 150610,107287 DEBUG {"asctime": "2023-06-01 150610,107", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.endpoint could not be found file.py:222 in yaml config"} 2023-06-01 150610,107775 DEBUG {"asctime": "2023-06-01 150610,107", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.access-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,108259 DEBUG {"asctime": "2023-06-01 150610,108", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,117386 INFO {"asctime": "2023-06-01 150610,117", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150610,762872 DEBUG {"asctime": "2023-06-01 150610,762", "name": "flytekit", "levelname": "DEBUG", "message": "\t\t[2] Pushing context - execute, context_manager.py:781 branch[False], StackOrigin(load_naive_entity, 593, /Users/<user>/repos/github/flyteorg/flytesnacks/flyte/lib/python3.11/site-packages/flytekit/clis/sdk_in_container/run.py)"} 2023-06-01 150610,774167 DEBUG {"asctime": "2023-06-01 150610,774", "name": "flytekit", "levelname": "DEBUG", "message": "Task returns unnamed native tuple <class interface.py:462 'str'>"} 2023-06-01 150611,898595 DEBUG {"asctime": "2023-06-01 150611,898", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.PandasToParquetEncodingHandler object at 0x113b33d50> as handler for <class 'pandas.core.frame.DataFrame'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,899756 DEBUG {"asctime": "2023-06-01 150611,899", "name": "flytekit", "levelname": "DEBUG", "message": "Setting format parquet for dataframes structured_dataset.py:502 of type <class 'pandas.core.frame.DataFrame'> from handler <flytekit.types.structured.basic_dfs.PandasToParquetEncodingHandler object at 0x113b33d50>"} 2023-06-01 150611,900673 DEBUG {"asctime": "2023-06-01 150611,900", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.ParquetToPandasDecodingHandler object at 0x168268e90> as handler for <class 'pandas.core.frame.DataFrame'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,901409 DEBUG {"asctime": "2023-06-01 150611,901", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.ArrowToParquetEncodingHandler object at 0x137e377d0> as handler for <class 'pyarrow.lib.Table'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,902102 DEBUG {"asctime": "2023-06-01 150611,902", "name": "flytekit", "levelname": "DEBUG", "message": "Setting format parquet for dataframes structured_dataset.py:502 of type <class 'pyarrow.lib.Table'> from handler <flytekit.types.structured.basic_dfs.ArrowToParquetEncodingHandler object at 0x137e377d0>"} 2023-06-01 150611,902903 DEBUG {"asctime": "2023-06-01 150611,902", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.ParquetToArrowDecodingHandler object at 0x1683c5fd0> as handler for <class 'pyarrow.lib.Table'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,905907 DEBUG {"asctime": "2023-06-01 150611,905", "name": "flytekit", "levelname": "DEBUG", "message": "Task returns unnamed native tuple <class interface.py:462 'str'>"} 2023-06-01 150611,906764 DEBUG {"asctime": "2023-06-01 150611,906", "name": "flytekit", "levelname": "DEBUG", "message": "\t\t[2] Popping context - execute, context_manager.py:792 branch[False], StackOrigin(load_naive_entity, 593, /Users/<user>/repos/github/flyteorg/flytesnacks/flyte/lib/python3.11/site-packages/flytekit/clis/sdk_in_container/run.py)"} 2023-06-01 150611,907511 INFO {"asctime": "2023-06-01 150611,907", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150611,910548 INFO {"asctime": "2023-06-01 150611,910", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150611,913373 DEBUG {"asctime": "2023-06-01 150611,913", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.caCertFilePath could not be found in file.py:222 yaml config"} 2023-06-01 150611,914004 DEBUG {"asctime": "2023-06-01 150611,914", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.command could not be found in yaml file.py:222 config"} 2023-06-01 150611,914536 DEBUG {"asctime": "2023-06-01 150611,914", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientId could not be found in yaml file.py:222 config"} 2023-06-01 150611,915044 DEBUG {"asctime": "2023-06-01 150611,915", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientSecretLocation could not be found file.py:222 in yaml config"} 2023-06-01 150611,915626 DEBUG {"asctime": "2023-06-01 150611,915", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.scopes could not be found in yaml file.py:222 config"} 2023-06-01 150611,916293 DEBUG {"asctime": "2023-06-01 150611,916", "name": "flytekit", "levelname": "DEBUG", "message": "Switch console.endpoint could not be found in yaml file.py:222 config"} 2023-06-01 150611,916820 DEBUG {"asctime": "2023-06-01 150611,916", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.httpProxyURL could not be found in yaml file.py:222 config"} 2023-06-01 150611,917341 DEBUG {"asctime": "2023-06-01 150611,917", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.endpoint could not be found file.py:222 in yaml config"} 2023-06-01 150611,917847 DEBUG {"asctime": "2023-06-01 150611,917", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.access-key could not be file.py:222 found in yaml config"} 2023-06-01 150611,918340 DEBUG {"asctime": "2023-06-01 150611,918", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be file.py:222 found in yaml config"} Failed with Exception Code: SYSTEM:Unknown RPC Failed, with Status: StatusCode.UNAVAILABLE details: unavailable Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T151013.599143-04:00", grpc_status:14, grpc_message:"unavailable"}
j
😬
r
The sts error is: Failed with Exception: Reason: SYSTEM:Unknown RPC Failed, with Status: StatusCode.INTERNAL details: failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials caused by: RequestError: send request failed caused by: Post "https://sts.us-east-2.amazonaws.com/": dial tcp 52.95.18.19443 i/o timeout Debug string UNKNOWN:Error received from peer ipv6:%5B::1%5D:8081 {grpc_message:"failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.us-east-2.amazonaws.com/\": dial tcp 52.95.18.19443 i/o timeout", grpc_status:13, created_time:"2023-06-01T113312.674371-04:00"}
grpc_status 13, vs 14 through the ingress.
j
i think we should dig into the STS issue. looks like flyte-binary is missing some configuration
is the iam role binding set up correctly?
m
I’ve deployed a parallel pod with AWS client and using the same service account. I am able to created a pre signed s3 url with it
r
Quite possibly not. We are running the pod with a SA setup with IRSA. In the
flyte-binary-config
secret we have 010-inline-config.yaml: | cluster_resources: customData: - production: - defaultIamRole: value: <iam_role> - staging: - defaultIamRole: value: <iam_role> - development: - defaultIamRole: value: <iam_role>
j
in the same namespace as flyte-binary?
m
Yes
j
perhaps worth execing into the flyte-binary pod and installing awscli v2 and trying to run “aws sts get-caller-identity”
m
I’d have to create a new image to do that. Can’t just install it directly, but running the command in the other pod is successful
r
We run with readOnlyRootFS so this is a little challenging. We should be able to re-spin the pod with an emptydir volume mount, and then d/l the awscli tarball and extract there.
j
this is nuts, but have we tried killing the flyte-binary pod 😬
r
Yep
Through various config map changes we've killed the pod and had it restart
j
ok so to recap: flyte-binary still can't get credentials. that is affecting its ability to generated signed URLs. the same KSA bound to an awscli pod works as expected.
can you paste the flyte-binary pod spec:
Copy code
kubectl get pod flyte-binary-... -o yaml
r
Yes
j
you might need to anonymize some of the envvars
r
I think the awscli test will take a bit. Looks like the container doesn't have python, so d/ling awscli tarball won't work. We would have to spin a container with python to do it.
j
thats ok. the fact that the other pod works is probably good enough evidence that IRSA is working as intended.
m
Ok
j
@Mike Morgan for the sake of my sanity, can you post the anonymized awscli pod spec too along with the flyte-binary pod spec? 🙂
m
Yes. Here you are.
Few minutes for the other one please
And here is the one with working awscli
I believe the error is coming from the pod itself, but there are no logs or traces on this. Really just a bunch of gorm tracing
Is there a way to get more out of that log?
j
i’m gonna try and reproduce from your pod spec. i’ll report back tomorrow :)
m
Thank you very much!! I appreciate it
j
i dont have a update yet unfortunately 😅. will hopefully have one later in the afternoon
r
Anything you would like us to try?
j
@Rob Rati yes: 2 things to try: make sure the awscli test pod is running on the same node as flyte-binary. make sure the awscli test pod also has readonlyrootfs enabled.
m
Thanks @jeev. Second item is correct since that is all we allow. I will check on the first point
j
@Mike Morgan it was set to false in the pod spec above for the airflow utils test pod.
m
Oh never mind me then. Let me see
j
as for the first point, maybe we can use a node selector to target the test pod to the same node as flyte-binary.
r
Or pod affinity
m
So quick update on testing changing the Files system to read only broke the working part, but changing the file system to not be read only didn’t fix flyte pod
I think next steps might be creating a flye docker image with Aws cli on it, and make it easier to test
Please let me know if you have any other pointers for me. Thank you very much.
j
that sounds like a good plan @Mike Morgan. did you also test by placing the test pod on the same node as the flyte-binary pod?
m
I have my run that test yet. I will do that this morning
Co locating the pods didn’t make a difference. I am creating new image with flyte and Aws cli now and will report back
r
I have a theory on this. I think the sts service needs to exit AWS network in order to resolve/function properly (or at least outside the network we access in EKS). This is different than say RDS or S3, which appear access stays within the AWS/EKS network. In our environment, you can't reach the outside world without setting a proxy configuration. However, when we set that proxy information, propeller won't start because it can't find a service at port 443 that it is looking for (we just get an IP). To test out this theory I need to know what service name(s) propeller tries to contact. Any idea?
j
afaik, if you dont have any web-based plugins enabled, it should just talk to the k8s api
r
That's it!
It's looking up the kubernetes cluster service
Hrm, it appears that propeller doesn't actually do a lookup of the kubernetes service, but assumes a specific ip
We can get around that by excluding the specific IP, but I think propeller should probably use the service name
So, excluding the IP of the kubernetes service I think fixed it. I can get much further with my pyflyte run.
It broke the UI though. :(
j
hmm that's odd. i think it hits the api server at
kubernetes.default
?
its just using the incluster kubeconfig basically
do you need a VPC endpoint for STS @Rob Rati?
r
Maybe. Can flyte be configured to use that endpoint though? Aws commands by default go to sts.amazon.com, so we'd need to be able to configure flyte to use a custom endpoint for sts
j
flyte doesnt talk to STS directly at all. should be just through the AWS SDK. that should just hit sts.amazonaws.com i believe. and the networking will take care of routing it to the internal VPC endpoint
that will allow you to hit the STS endpoint without egress
but how did the test pod work?
r
We were able to reproduce the issue with a custom image that included awscli. From there we were able to figure out the proxy issue
j
ah ok awesome
in that case, i think a STS VPC endpoint will work for you
whats the issue with the UI now?
r
If we have the proxy set, the UI is messed up. I assume we just need to figure out proper additions to ignore proxy. I think the vpc endpoint might be the best path
Setting the vpc endpoint got us past the sts issue. Now we are getting a 403 when trying to u/l data to the s3 bucket. Is the client or admin service attempting to do the s3 data upload?
j
yes. that’s how fast register / pyflyte run work. it needs to upload the source code to s3
r
Makes sense. So, admin service generates a pre-signed s3 url and passes it to pyflyte, and pyflyte does the data upload?
The url comes back with bucket/<project>/<env> path, so who creates the project and env part? I assume pyflyte?
j
it should be admin.
that’s the flyte project and domain that you are registering the wf to
r
Yep
j
you should be able to set that in pyflyte though
r
atm we just have a bucket. Just wanted to make sure I understood who is doing what. I'm pretty sure our IAM access is too restrictive atm. We're opening it up
j
since we’re using signed URLs for registration , only admin needs perms on s3. tasks will need perms as well since they don’t use signed URLs.
r
Tasks are pods launched by propeller, right? I saw in the docs we can use PodTemplates to define defaults for things like that, so we should be able to assign the pod to use a KSA with proper IAM access.
j
I think the KSA will be overridden if set in the pod template. it should be set in admin (I think) or specified during registration (attached to launch plan).
m
One thing that would be helpful is if we can get more logs from the admin service. I see debug messages in the code that would be nice to see. I see a lot of other debug messages but not the admin service
j
the log level applies everywhere. the issue is probably that propeller is too verbose.
r
Even loosening the IAM restrictions we still get 403. We also don't see any dirs being created in the s3 bucket.
This is our IAM policy governing s3:
Copy code
{
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::<unique_prefix>*",
                "arn:aws:s3:::<unique_prefix>*/*"
            ],
            "Effect": "Allow",
            "Sid": "AllowS3BucketCRUD"
        },
j
can you paste the error?
the IAM policy looks reasonable
it also takes a few mins to propagate, so maybe can just retry in a bit.
r
Copy code
warnings.warn(
Failed with Exception Code: USER:ValueError
Value error!  Received: 403. Request to send data https://<bucket>.<http://s3.us-east-2.amazonaws.com/flytesnacks/development/|s3.us-east-2.amazonaws.com/flytesnacks/development/>... failed
We updated the iam role a while ago and just tried now and got the same error
j
so you get a signed url back but using the signed url results in a 403?
r
I guess? I'm running: FLYTE_SDK_LOGGING_LEVEL=1 ./flyte/bin/pyflyte run --remote cookbook/core/flyte_basics/hello_world.py my_wf So, just trying to run an example. The error is coming from pyflyte.
m
Where is the configuration for pre-signing such as duration etc. can’t find tat anywhere
@Rob Rati: pyflyte will throw a 403 if the entity that generated the signed url doesnt have permissions
r
So then it sounds like we're back to an issue with the IAM role?
m
We are able to create a pre-signed url to a specific file that works on the same flyte pod.
r
Interesting. We did this for something else and it seemed like it worked. The docs don't show any examples, but wording implies you can wildcard any segment of an arn. We'll try removing the bucket name wildcards and see if that helps.
j
sorry you are right @Rob Rati. you should be able to wildcard a bucket name
if the pre-signed url works, you should be good 🤔
r
When we try to hit that url it complains (before it times out): <Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message>
j
hmm
is it possible its another gotcha with your infra setup? 😅
r
Highly probable
Question is where to look for it
m
@jeev the error message @Rob Rati shared above is why I was looking for flyte signing Config
r
For anyone else hitting this, we use KMS keys to encrypt data in our S3 buckets. We needed to add kms:GenerateDataKey* to our IAM role
j
nice find. all resolved then?
r
Well, we are at the next hurdle. 🙂 We got a workload submitted and it moves to the running state, but doesn't do anything. I suspect this is because we have requirements for workloads to be admitted to our kubernetes cluster, but I don't see any logs indicating a rejection by an admission controller (which is what we would get locally). Maybe our log level isn't verbose enough?
j
try with log level 5
r
Bumping the logging level got us to find this error: {"json":{"exec_id":"fb8d670837f954478b70","ns":"flytesnacks-development","res_ver":"641457952","routine":"worker-1","src":"admin_eventsink.go:44","wf":"flytesnacksdevelopmentcore.flyte_basics.hello_world.my_wf"},"level":"debug","msg":"AdminEventSink received a new event execution_id\u003cproject\"flytesnacks\" domain:\"development\" name:\"fb8d670837f954478b70\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at\u003cseconds1686232511 nanos:640124539 \u003e error\u003ccode\"Workflow abort failed\" message:\"Workflow[flytesnacksdevelopmentcore.flyte_basics.hello_world.my_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/flytesnacks-development-fb8d670837f954478b70/start-node/data/0/outputs.pb].: PutObject, putting object: MissingEndpoint: 'Endpoint' configuration is required for this service\" kind:SYSTEM \u003e ","ts":"2023-06-08T135511Z"}
What endpoint is it talking about and where? VPC endpoint?
j
hmm. that sounds like the s3 endpoint. but shouldn’t need to specify that. unless it thinks it’s using minio or something.
r
This is our storage config. We pointed it at stow:
Copy code
003-storage.yaml: "propeller:\n  rawoutput-prefix: s3://<bucket>/data\nstorage:\n
    \ type: stow\n  stow:\n    kind: s3\n    config:\n      region: us-east-2\n      disable_ssl:
    true \n      v2_signing: \n      auth_type: iam\n  container: <bucket>\n"
Is this what would impact that?
I notice we have auth_type as iam, but nothing in that config specifying the iam role.
j
it should use IRSA
are y’all using the chart?
r
Yes and no. 😄 While we are debugging what we need to get deployed, we are generating the deployment yaml from the charts and deploying that, making mods to the yaml if needed. ATM we have to do this because the chart wants to create rbac entities, and in our cluster we have to do that a special way. We can't create normal rbac objects directly
j
got it.
maybe render the chart with just region set, and look at the storage config
r
We'll give that a try and report back
No change by regenerating the chart from the latest mainline. In my research, this error seems to be related to an aws service and likely a misconfig. Is there an aws service other than S3 that is involved in starting a job? Do you know what actions/permissions are needed? I'm guessing we have another iam action missing.
j
the error is just about propeller failing to write to what it thinks is the right object storage. can you paste the new storage config?
r
Copy code
003-storage.yaml: "propeller:\n  rawoutput-prefix: s3://>bucket>/data\nstorage:\n
    \ type: stow\n  stow:\n    kind: s3\n    config:\n      region: us-east-2\n      disable_ssl:
    false \n      v2_signing: false\n      auth_type: iam\n  container: <bucket>\n"
We think we see files in the proper location on s3. What is propeller trying to write? Maybe we can verify if those files exist
m
The perplexing thing is we can see that file in s3 at that location
r
I'm a bit confused about the storage config. In the CM, there is a 003-storage.yaml which maps to this config struct: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/config/config.go#L120 However, it looks like propeller is using this config: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/controller.go#L411 Which maps to this config struct: https://github.com/flyteorg/flytestdlib/blob/master/storage/config.go#L47 Which seems to be consuming this stanza in the cm: storage: cache: max_size_mbs: 10 target_gc_percent: 100 Do we need to configure the s3 options in the storage stanza as well? Is flytestdlib somehow using the propeller config if not defined?
j
they should get merged
177 Views