https://flyte.org logo
#ask-the-community
Title
# ask-the-community
r

Rob Rati

06/01/2023, 3:04 PM
I'm trying to work around the issue in this thread: https://flyte-org.slack.com/archives/CP2HDHKE1/p1685547623758749 I added
insecureSkipVerify: true
to my config.yaml, and when I try to do a
pyflyte run -remote
I get this error: Failed with Exception Code: SYSTEM:Unknown RPC Failed, with Status: StatusCode.UNAVAILABLE details: unavailable Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T105201.368208-04:00", grpc_status:14, grpc_message:"unavailable"} We are accessing through an AWS ALB, and
flytectl get project
works so I think the grpc routing is working. I'm guessing the grpc error message is because python's grpc doesn't like invalid ssl? If I'm correct, does the error message I am receiving seem inline with expectations if the ssl cert isn't verified?
d

David Espejo (he/him)

06/01/2023, 3:37 PM
@Rob Rati not sure, what I've seen for self-signed or invalid SSL is more commonly ``OPENSSL_internal:WRONG_VERSION_NUMBER`` but not ``status:14`` Sorry if you did it already but, could you share the anonymized content of your
config.yaml
?
r

Rob Rati

06/01/2023, 3:40 PM
admin: # For GRPC endpoints you might want to use dns:///flyte.myexample.com endpoint: dns///&lt;alb&gt;443 authType: Pkce insecure: false #caCertFilePath: /Users/a708083/<cert>.pem insecureSkipVerify: true logger: show-source: true level: 6
We aren't using a self-signed cert. However, I believe we do have own own signing ca key which we'll need to pass to verify the cert.
y

Yee

06/01/2023, 4:35 PM
talked to rob about it. the original slack thread linked in this one is caused by this. https://github.com/flyteorg/flyte/issues/3715
rob’s tested that and it’s gotten past that error, but is now timing out. though we think the timeout is a different problem
r

Rob Rati

06/01/2023, 4:38 PM
The timeout from the above fix results in the same error I got when I used skip. Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T123702.751568-04:00", grpc_status:14, grpc_message:"unavailable"} This makes me wonder if a path is missing in the grpc ingress? WAG
y

Yee

06/01/2023, 4:57 PM
can you tell me the version of grpcio you have?
pip show grpcio
r

Rob Rati

06/01/2023, 4:57 PM
Name: grpcio Version: 1.54.2
y

Yee

06/01/2023, 5:02 PM
yeah not sure. i’ll try to come back to this later.
r

Rob Rati

06/01/2023, 5:05 PM
This doesn't appear to be a problem in the sandbox config, but that uses a proxy and different setup. I can look at the proxy routes and compare to the ingress, but not sure where the sandbox proxy config is defined
nm, found it. 😄
Well, scratch that idea. ingress matches the proxy routes for grpc
j

jeev

06/01/2023, 6:03 PM
@Rob Rati: can you paste the full traceback form the most recent error? i've definitely had this working (albeit with Kong) with the existing ingress with a self-signed cert.
r

Rob Rati

06/01/2023, 6:38 PM
@jeev I din't get a full traceback. All. got was: Failed with Exception Code: SYSTEM:Unknown RPC Failed, with Status: StatusCode.UNAVAILABLE details: unavailable Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T123702.751568-04:00", grpc_status:14, grpc_message:"unavailable"} Is there a way to get a full bt?
j

jeev

06/01/2023, 6:40 PM
thats so not useful 😅
UNAVAILABLE 14 The service is currently unavailable. This is most likely a transient condition, which can be corrected by retrying with a backoff. Note that it is not always safe to retry non-idempotent operations.
thanks......
y

Yee

06/01/2023, 6:41 PM
this is working in flytectl right?
r

Rob Rati

06/01/2023, 6:41 PM
The timeout is a long time though
j

jeev

06/01/2023, 6:41 PM
are you sure its the same endpoint?
does it work if you port-forward to the service directly
r

Rob Rati

06/01/2023, 6:42 PM
Yes, if we do port-forward it "works", in that we get a completely different error. A sts timeout error (which is also baffling us)
y

Yee

06/01/2023, 6:42 PM
and not in pyflyte… but we’ve pretty much stripped all the flyte stuff out of pyflyte. i think you can repro this rob just by calling the protobuf generated flyteadmin client right?
if that’s the case, then this is something deeper in the python grpc library
j

jeev

06/01/2023, 6:45 PM
i wish i had a way to repro locally 😞
coz it works fine on a public ALB from flytectl and pyflyte
baffling that it works with flytectl, but not with pyflyte
@Rob Rati: did you try
flytectl
from the same config file?
Copy code
flytectl get project --config=<PATH_TO_CONFIG>
r

Rob Rati

06/01/2023, 6:49 PM
Yes, flytectl works: % flytectl get project --admin.endpoint=<alb>:443 {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [storage] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [root] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:400"},"level":"debug","msg":"Config section [admin] updated. Firing updated event.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [files] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [console] updated. No update handler registered.","ts":"2023-06-01T144844-04:00"} {"json":{"src":"client.go:63"},"level":"info","msg":"Initialized Admin client","ts":"2023-06-01T144844-04:00"} {"json":{"src":"project.go:102"},"level":"debug","msg":"Retrieved 1 projects","ts":"2023-06-01T144845-04:00"} ------------- ------------- ------------------------- | ID | NAME | DESCRIPTION | ------------- ------------- ------------------------- | flytesnacks | flytesnacks | flytesnacks description | ------------- ------------- ------------------------- 1 rows
j

jeev

06/01/2023, 6:49 PM
i mean, dont specify the
admin.endpoint
use
--config=
instead
r

Rob Rati

06/01/2023, 6:50 PM
% flytectl get project --config ~/.flyte/config.yaml {"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [storage] updated. No update handler registered.","ts":"2023-06-01T144957-04:00"} {"json":{"src":"client.go:63"},"level":"info","msg":"Initialized Admin client","ts":"2023-06-01T144957-04:00"} {"json":{"src":"project.go:102"},"level":"debug","msg":"Retrieved 1 projects","ts":"2023-06-01T144959-04:00"} ------------- ------------- ------------------------- | ID | NAME | DESCRIPTION | ------------- ------------- ------------------------- | flytesnacks | flytesnacks | flytesnacks description | ------------- ------------- ------------------------- 1 rows
j

jeev

06/01/2023, 6:51 PM
ok
and this?
Copy code
> flyte-cli -c ~/.flyte/config.yaml list-projects
DeprecationWarning: The command 'flyte-cli' is deprecated.

################################################################################################################################
# flyte-cli is being deprecated in favor of flytectl. More details about flytectl in <https://docs.flyte.org/projects/flytectl/> #
################################################################################################################################

Welcome to Flyte CLI! Version: 1.6.2

Projects Found

	flytesnacks
r

Rob Rati

06/01/2023, 6:56 PM
That gives me an error too
TypeError: expected certificate to be bytes, got <class 'OpenSSL.crypto.X509'*>*
I've got a bt, but it doesn't post well
j

jeev

06/01/2023, 6:57 PM
what if you just comment out the cert, and use
insecureSkipVerify
?
or are you already doing that?
flyte-cli
may not have support for
insecureSkipVerify
r

Rob Rati

06/01/2023, 6:58 PM
Using skip it works: % flyte-cli -c ~/.flyte/config.yaml list-projects DeprecationWarning: The command 'flyte-cli' is deprecated. ################################################################################################################################ # flyte-cli is being deprecated in favor of flytectl. More details about flytectl in https://docs.flyte.org/projects/flytectl/ # ################################################################################################################################ Welcome to Flyte CLI! Version: 1.6.2 Projects Found flytesnacks
j

jeev

06/01/2023, 6:58 PM
ok good!
this seems to suggest the the GRPC client is fine right @Yee?
@Rob Rati: can you run the pyflyte command, and paste the full anonymized output?
im starting to think its related to the STS timeout
r

Rob Rati

06/01/2023, 7:00 PM
pyflyte in which configuration? 😄
j

jeev

06/01/2023, 7:02 PM
the skipped: no cert,
insecureSkipVerify: true
r

Rob Rati

06/01/2023, 7:03 PM
It's running. Takes a while to timeout
j

jeev

06/01/2023, 7:04 PM
any corresponding logs in the flyte-binary pod?
r

Rob Rati

06/01/2023, 7:06 PM
Nothing in the logs
Just this over and over: {"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2023-06-01T190647Z"} {"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2023-06-01T190647Z"} {"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2023-06-01T190647Z"}
% FLYTE_SDK_LOGGING_LEVEL=1 ./flyte/bin/pyflyte run --remote cookbook/core/flyte_basics/hello_world.py my_wf 2023-06-01 150610,011657 INFO {"asctime": "2023-06-01 150610,011", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150610,016220 DEBUG {"asctime": "2023-06-01 150610,016", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.endpoint could not be found file.py:222 in yaml config"} 2023-06-01 150610,017337 DEBUG {"asctime": "2023-06-01 150610,017", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.access-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,018172 DEBUG {"asctime": "2023-06-01 150610,018", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,101644 INFO {"asctime": "2023-06-01 150610,101", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150610,103554 DEBUG {"asctime": "2023-06-01 150610,103", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.caCertFilePath could not be found in file.py:222 yaml config"} 2023-06-01 150610,104145 DEBUG {"asctime": "2023-06-01 150610,104", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.command could not be found in yaml file.py:222 config"} 2023-06-01 150610,104687 DEBUG {"asctime": "2023-06-01 150610,104", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientId could not be found in yaml file.py:222 config"} 2023-06-01 150610,105274 DEBUG {"asctime": "2023-06-01 150610,105", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientSecretLocation could not be found file.py:222 in yaml config"} 2023-06-01 150610,105780 DEBUG {"asctime": "2023-06-01 150610,105", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.scopes could not be found in yaml file.py:222 config"} 2023-06-01 150610,106279 DEBUG {"asctime": "2023-06-01 150610,106", "name": "flytekit", "levelname": "DEBUG", "message": "Switch console.endpoint could not be found in yaml file.py:222 config"} 2023-06-01 150610,106778 DEBUG {"asctime": "2023-06-01 150610,106", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.httpProxyURL could not be found in yaml file.py:222 config"} 2023-06-01 150610,107287 DEBUG {"asctime": "2023-06-01 150610,107", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.endpoint could not be found file.py:222 in yaml config"} 2023-06-01 150610,107775 DEBUG {"asctime": "2023-06-01 150610,107", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.access-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,108259 DEBUG {"asctime": "2023-06-01 150610,108", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be file.py:222 found in yaml config"} 2023-06-01 150610,117386 INFO {"asctime": "2023-06-01 150610,117", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150610,762872 DEBUG {"asctime": "2023-06-01 150610,762", "name": "flytekit", "levelname": "DEBUG", "message": "\t\t[2] Pushing context - execute, context_manager.py:781 branch[False], StackOrigin(load_naive_entity, 593, /Users/<user>/repos/github/flyteorg/flytesnacks/flyte/lib/python3.11/site-packages/flytekit/clis/sdk_in_container/run.py)"} 2023-06-01 150610,774167 DEBUG {"asctime": "2023-06-01 150610,774", "name": "flytekit", "levelname": "DEBUG", "message": "Task returns unnamed native tuple <class interface.py:462 'str'>"} 2023-06-01 150611,898595 DEBUG {"asctime": "2023-06-01 150611,898", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.PandasToParquetEncodingHandler object at 0x113b33d50> as handler for <class 'pandas.core.frame.DataFrame'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,899756 DEBUG {"asctime": "2023-06-01 150611,899", "name": "flytekit", "levelname": "DEBUG", "message": "Setting format parquet for dataframes structured_dataset.py:502 of type <class 'pandas.core.frame.DataFrame'> from handler <flytekit.types.structured.basic_dfs.PandasToParquetEncodingHandler object at 0x113b33d50>"} 2023-06-01 150611,900673 DEBUG {"asctime": "2023-06-01 150611,900", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.ParquetToPandasDecodingHandler object at 0x168268e90> as handler for <class 'pandas.core.frame.DataFrame'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,901409 DEBUG {"asctime": "2023-06-01 150611,901", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.ArrowToParquetEncodingHandler object at 0x137e377d0> as handler for <class 'pyarrow.lib.Table'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,902102 DEBUG {"asctime": "2023-06-01 150611,902", "name": "flytekit", "levelname": "DEBUG", "message": "Setting format parquet for dataframes structured_dataset.py:502 of type <class 'pyarrow.lib.Table'> from handler <flytekit.types.structured.basic_dfs.ArrowToParquetEncodingHandler object at 0x137e377d0>"} 2023-06-01 150611,902903 DEBUG {"asctime": "2023-06-01 150611,902", "name": "flytekit", "levelname": "DEBUG", "message": "Registered structured_dataset.py:493 <flytekit.types.structured.basic_dfs.ParquetToArrowDecodingHandler object at 0x1683c5fd0> as handler for <class 'pyarrow.lib.Table'>, protocol fsspec, fmt parquet"} 2023-06-01 150611,905907 DEBUG {"asctime": "2023-06-01 150611,905", "name": "flytekit", "levelname": "DEBUG", "message": "Task returns unnamed native tuple <class interface.py:462 'str'>"} 2023-06-01 150611,906764 DEBUG {"asctime": "2023-06-01 150611,906", "name": "flytekit", "levelname": "DEBUG", "message": "\t\t[2] Popping context - execute, context_manager.py:792 branch[False], StackOrigin(load_naive_entity, 593, /Users/<user>/repos/github/flyteorg/flytesnacks/flyte/lib/python3.11/site-packages/flytekit/clis/sdk_in_container/run.py)"} 2023-06-01 150611,907511 INFO {"asctime": "2023-06-01 150611,907", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150611,910548 INFO {"asctime": "2023-06-01 150611,910", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config file.py:272 /Users/<user>/.flyte/config.yaml"} 2023-06-01 150611,913373 DEBUG {"asctime": "2023-06-01 150611,913", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.caCertFilePath could not be found in file.py:222 yaml config"} 2023-06-01 150611,914004 DEBUG {"asctime": "2023-06-01 150611,914", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.command could not be found in yaml file.py:222 config"} 2023-06-01 150611,914536 DEBUG {"asctime": "2023-06-01 150611,914", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientId could not be found in yaml file.py:222 config"} 2023-06-01 150611,915044 DEBUG {"asctime": "2023-06-01 150611,915", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.clientSecretLocation could not be found file.py:222 in yaml config"} 2023-06-01 150611,915626 DEBUG {"asctime": "2023-06-01 150611,915", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.scopes could not be found in yaml file.py:222 config"} 2023-06-01 150611,916293 DEBUG {"asctime": "2023-06-01 150611,916", "name": "flytekit", "levelname": "DEBUG", "message": "Switch console.endpoint could not be found in yaml file.py:222 config"} 2023-06-01 150611,916820 DEBUG {"asctime": "2023-06-01 150611,916", "name": "flytekit", "levelname": "DEBUG", "message": "Switch admin.httpProxyURL could not be found in yaml file.py:222 config"} 2023-06-01 150611,917341 DEBUG {"asctime": "2023-06-01 150611,917", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.endpoint could not be found file.py:222 in yaml config"} 2023-06-01 150611,917847 DEBUG {"asctime": "2023-06-01 150611,917", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.access-key could not be file.py:222 found in yaml config"} 2023-06-01 150611,918340 DEBUG {"asctime": "2023-06-01 150611,918", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be file.py:222 found in yaml config"} Failed with Exception Code: SYSTEM:Unknown RPC Failed, with Status: StatusCode.UNAVAILABLE details: unavailable Debug string UNKNOWN:Error received from peer {created_time:"2023-06-01T151013.599143-04:00", grpc_status:14, grpc_message:"unavailable"}
j

jeev

06/01/2023, 7:23 PM
😬
r

Rob Rati

06/01/2023, 7:24 PM
The sts error is: Failed with Exception: Reason: SYSTEM:Unknown RPC Failed, with Status: StatusCode.INTERNAL details: failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials caused by: RequestError: send request failed caused by: Post "https://sts.us-east-2.amazonaws.com/": dial tcp 52.95.18.19443 i/o timeout Debug string UNKNOWN:Error received from peer ipv6:%5B::1%5D:8081 {grpc_message:"failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.us-east-2.amazonaws.com/\": dial tcp 52.95.18.19443 i/o timeout", grpc_status:13, created_time:"2023-06-01T113312.674371-04:00"}
grpc_status 13, vs 14 through the ingress.
j

jeev

06/01/2023, 8:33 PM
i think we should dig into the STS issue. looks like flyte-binary is missing some configuration
is the iam role binding set up correctly?
m

Mike Morgan

06/01/2023, 8:38 PM
I’ve deployed a parallel pod with AWS client and using the same service account. I am able to created a pre signed s3 url with it
r

Rob Rati

06/01/2023, 8:38 PM
Quite possibly not. We are running the pod with a SA setup with IRSA. In the
flyte-binary-config
secret we have 010-inline-config.yaml: | cluster_resources: customData: - production: - defaultIamRole: value: <iam_role> - staging: - defaultIamRole: value: <iam_role> - development: - defaultIamRole: value: <iam_role>
j

jeev

06/01/2023, 8:39 PM
in the same namespace as flyte-binary?
m

Mike Morgan

06/01/2023, 8:39 PM
Yes
j

jeev

06/01/2023, 8:40 PM
perhaps worth execing into the flyte-binary pod and installing awscli v2 and trying to run “aws sts get-caller-identity”
m

Mike Morgan

06/01/2023, 8:46 PM
I’d have to create a new image to do that. Can’t just install it directly, but running the command in the other pod is successful
r

Rob Rati

06/01/2023, 8:48 PM
We run with readOnlyRootFS so this is a little challenging. We should be able to re-spin the pod with an emptydir volume mount, and then d/l the awscli tarball and extract there.
j

jeev

06/01/2023, 8:50 PM
this is nuts, but have we tried killing the flyte-binary pod 😬
r

Rob Rati

06/01/2023, 8:50 PM
Yep
Through various config map changes we've killed the pod and had it restart
j

jeev

06/01/2023, 8:52 PM
ok so to recap: flyte-binary still can't get credentials. that is affecting its ability to generated signed URLs. the same KSA bound to an awscli pod works as expected.
can you paste the flyte-binary pod spec:
Copy code
kubectl get pod flyte-binary-... -o yaml
r

Rob Rati

06/01/2023, 8:52 PM
Yes
j

jeev

06/01/2023, 8:52 PM
you might need to anonymize some of the envvars
r

Rob Rati

06/01/2023, 8:53 PM
I think the awscli test will take a bit. Looks like the container doesn't have python, so d/ling awscli tarball won't work. We would have to spin a container with python to do it.
j

jeev

06/01/2023, 8:55 PM
thats ok. the fact that the other pod works is probably good enough evidence that IRSA is working as intended.
m

Mike Morgan

06/01/2023, 8:56 PM
Ok
j

jeev

06/01/2023, 8:57 PM
@Mike Morgan for the sake of my sanity, can you post the anonymized awscli pod spec too along with the flyte-binary pod spec? 🙂
m

Mike Morgan

06/01/2023, 9:00 PM
Yes. Here you are.
Few minutes for the other one please
And here is the one with working awscli
I believe the error is coming from the pod itself, but there are no logs or traces on this. Really just a bunch of gorm tracing
Is there a way to get more out of that log?
j

jeev

06/01/2023, 10:10 PM
i’m gonna try and reproduce from your pod spec. i’ll report back tomorrow :)
m

Mike Morgan

06/01/2023, 10:11 PM
Thank you very much!! I appreciate it
j

jeev

06/02/2023, 3:24 PM
i dont have a update yet unfortunately 😅. will hopefully have one later in the afternoon
r

Rob Rati

06/02/2023, 3:25 PM
Anything you would like us to try?
j

jeev

06/02/2023, 9:28 PM
@Rob Rati yes: 2 things to try: make sure the awscli test pod is running on the same node as flyte-binary. make sure the awscli test pod also has readonlyrootfs enabled.
m

Mike Morgan

06/02/2023, 9:31 PM
Thanks @jeev. Second item is correct since that is all we allow. I will check on the first point
j

jeev

06/02/2023, 10:51 PM
@Mike Morgan it was set to false in the pod spec above for the airflow utils test pod.
m

Mike Morgan

06/02/2023, 10:51 PM
Oh never mind me then. Let me see
j

jeev

06/02/2023, 10:53 PM
as for the first point, maybe we can use a node selector to target the test pod to the same node as flyte-binary.
r

Rob Rati

06/02/2023, 10:53 PM
Or pod affinity
m

Mike Morgan

06/05/2023, 3:47 AM
So quick update on testing changing the Files system to read only broke the working part, but changing the file system to not be read only didn’t fix flyte pod
I think next steps might be creating a flye docker image with Aws cli on it, and make it easier to test
Please let me know if you have any other pointers for me. Thank you very much.
j

jeev

06/05/2023, 1:40 PM
that sounds like a good plan @Mike Morgan. did you also test by placing the test pod on the same node as the flyte-binary pod?
m

Mike Morgan

06/05/2023, 2:10 PM
I have my run that test yet. I will do that this morning
Co locating the pods didn’t make a difference. I am creating new image with flyte and Aws cli now and will report back
r

Rob Rati

06/05/2023, 9:03 PM
I have a theory on this. I think the sts service needs to exit AWS network in order to resolve/function properly (or at least outside the network we access in EKS). This is different than say RDS or S3, which appear access stays within the AWS/EKS network. In our environment, you can't reach the outside world without setting a proxy configuration. However, when we set that proxy information, propeller won't start because it can't find a service at port 443 that it is looking for (we just get an IP). To test out this theory I need to know what service name(s) propeller tries to contact. Any idea?
j

jeev

06/05/2023, 9:09 PM
afaik, if you dont have any web-based plugins enabled, it should just talk to the k8s api
r

Rob Rati

06/05/2023, 9:11 PM
That's it!
It's looking up the kubernetes cluster service
Hrm, it appears that propeller doesn't actually do a lookup of the kubernetes service, but assumes a specific ip
We can get around that by excluding the specific IP, but I think propeller should probably use the service name
So, excluding the IP of the kubernetes service I think fixed it. I can get much further with my pyflyte run.
It broke the UI though. :(
j

jeev

06/05/2023, 10:51 PM
hmm that's odd. i think it hits the api server at
kubernetes.default
?
its just using the incluster kubeconfig basically
do you need a VPC endpoint for STS @Rob Rati?
r

Rob Rati

06/05/2023, 10:58 PM
Maybe. Can flyte be configured to use that endpoint though? Aws commands by default go to sts.amazon.com, so we'd need to be able to configure flyte to use a custom endpoint for sts
j

jeev

06/05/2023, 11:02 PM
flyte doesnt talk to STS directly at all. should be just through the AWS SDK. that should just hit sts.amazonaws.com i believe. and the networking will take care of routing it to the internal VPC endpoint
that will allow you to hit the STS endpoint without egress
but how did the test pod work?
r

Rob Rati

06/05/2023, 11:05 PM
We were able to reproduce the issue with a custom image that included awscli. From there we were able to figure out the proxy issue
j

jeev

06/05/2023, 11:05 PM
ah ok awesome
in that case, i think a STS VPC endpoint will work for you
whats the issue with the UI now?
r

Rob Rati

06/05/2023, 11:07 PM
If we have the proxy set, the UI is messed up. I assume we just need to figure out proper additions to ignore proxy. I think the vpc endpoint might be the best path
Setting the vpc endpoint got us past the sts issue. Now we are getting a 403 when trying to u/l data to the s3 bucket. Is the client or admin service attempting to do the s3 data upload?
j

jeev

06/06/2023, 3:50 PM
yes. that’s how fast register / pyflyte run work. it needs to upload the source code to s3
r

Rob Rati

06/06/2023, 3:52 PM
Makes sense. So, admin service generates a pre-signed s3 url and passes it to pyflyte, and pyflyte does the data upload?
The url comes back with bucket/<project>/<env> path, so who creates the project and env part? I assume pyflyte?
j

jeev

06/06/2023, 3:54 PM
it should be admin.
that’s the flyte project and domain that you are registering the wf to
r

Rob Rati

06/06/2023, 3:55 PM
Yep
j

jeev

06/06/2023, 3:55 PM
you should be able to set that in pyflyte though
r

Rob Rati

06/06/2023, 3:55 PM
atm we just have a bucket. Just wanted to make sure I understood who is doing what. I'm pretty sure our IAM access is too restrictive atm. We're opening it up
j

jeev

06/06/2023, 3:56 PM
since we’re using signed URLs for registration , only admin needs perms on s3. tasks will need perms as well since they don’t use signed URLs.
r

Rob Rati

06/06/2023, 4:04 PM
Tasks are pods launched by propeller, right? I saw in the docs we can use PodTemplates to define defaults for things like that, so we should be able to assign the pod to use a KSA with proper IAM access.
j

jeev

06/06/2023, 4:14 PM
I think the KSA will be overridden if set in the pod template. it should be set in admin (I think) or specified during registration (attached to launch plan).
m

Mike Morgan

06/06/2023, 4:31 PM
One thing that would be helpful is if we can get more logs from the admin service. I see debug messages in the code that would be nice to see. I see a lot of other debug messages but not the admin service
j

jeev

06/06/2023, 4:40 PM
the log level applies everywhere. the issue is probably that propeller is too verbose.
r

Rob Rati

06/06/2023, 4:46 PM
Even loosening the IAM restrictions we still get 403. We also don't see any dirs being created in the s3 bucket.
This is our IAM policy governing s3:
Copy code
{
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::<unique_prefix>*",
                "arn:aws:s3:::<unique_prefix>*/*"
            ],
            "Effect": "Allow",
            "Sid": "AllowS3BucketCRUD"
        },
j

jeev

06/06/2023, 4:48 PM
can you paste the error?
the IAM policy looks reasonable
it also takes a few mins to propagate, so maybe can just retry in a bit.
r

Rob Rati

06/06/2023, 4:54 PM
Copy code
warnings.warn(
Failed with Exception Code: USER:ValueError
Value error!  Received: 403. Request to send data https://<bucket>.<http://s3.us-east-2.amazonaws.com/flytesnacks/development/|s3.us-east-2.amazonaws.com/flytesnacks/development/>... failed
We updated the iam role a while ago and just tried now and got the same error
j

jeev

06/06/2023, 5:35 PM
so you get a signed url back but using the signed url results in a 403?
r

Rob Rati

06/06/2023, 5:37 PM
I guess? I'm running: FLYTE_SDK_LOGGING_LEVEL=1 ./flyte/bin/pyflyte run --remote cookbook/core/flyte_basics/hello_world.py my_wf So, just trying to run an example. The error is coming from pyflyte.
m

Mike Morgan

06/06/2023, 6:38 PM
Where is the configuration for pre-signing such as duration etc. can’t find tat anywhere
@Rob Rati: pyflyte will throw a 403 if the entity that generated the signed url doesnt have permissions
r

Rob Rati

06/06/2023, 8:17 PM
So then it sounds like we're back to an issue with the IAM role?
m

Mike Morgan

06/06/2023, 8:25 PM
We are able to create a pre-signed url to a specific file that works on the same flyte pod.
r

Rob Rati

06/06/2023, 8:25 PM
Interesting. We did this for something else and it seemed like it worked. The docs don't show any examples, but wording implies you can wildcard any segment of an arn. We'll try removing the bucket name wildcards and see if that helps.
j

jeev

06/06/2023, 8:25 PM
sorry you are right @Rob Rati. you should be able to wildcard a bucket name
if the pre-signed url works, you should be good 🤔
r

Rob Rati

06/06/2023, 8:26 PM
When we try to hit that url it complains (before it times out): <Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message>
j

jeev

06/06/2023, 8:27 PM
hmm
is it possible its another gotcha with your infra setup? 😅
r

Rob Rati

06/06/2023, 8:28 PM
Highly probable
Question is where to look for it
m

Mike Morgan

06/06/2023, 8:29 PM
@jeev the error message @Rob Rati shared above is why I was looking for flyte signing Config
r

Rob Rati

06/08/2023, 1:38 PM
For anyone else hitting this, we use KMS keys to encrypt data in our S3 buckets. We needed to add kms:GenerateDataKey* to our IAM role
j

jeev

06/08/2023, 1:39 PM
nice find. all resolved then?
r

Rob Rati

06/08/2023, 1:42 PM
Well, we are at the next hurdle. 🙂 We got a workload submitted and it moves to the running state, but doesn't do anything. I suspect this is because we have requirements for workloads to be admitted to our kubernetes cluster, but I don't see any logs indicating a rejection by an admission controller (which is what we would get locally). Maybe our log level isn't verbose enough?
j

jeev

06/08/2023, 1:43 PM
try with log level 5
r

Rob Rati

06/08/2023, 2:10 PM
Bumping the logging level got us to find this error: {"json":{"exec_id":"fb8d670837f954478b70","ns":"flytesnacks-development","res_ver":"641457952","routine":"worker-1","src":"admin_eventsink.go:44","wf":"flytesnacksdevelopmentcore.flyte_basics.hello_world.my_wf"},"level":"debug","msg":"AdminEventSink received a new event execution_id\u003cproject\"flytesnacks\" domain:\"development\" name:\"fb8d670837f954478b70\" \u003e producer_id:\"propeller\" phase:FAILED occurred_at\u003cseconds1686232511 nanos:640124539 \u003e error\u003ccode\"Workflow abort failed\" message:\"Workflow[flytesnacksdevelopmentcore.flyte_basics.hello_world.my_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/flytesnacks-development-fb8d670837f954478b70/start-node/data/0/outputs.pb].: PutObject, putting object: MissingEndpoint: 'Endpoint' configuration is required for this service\" kind:SYSTEM \u003e ","ts":"2023-06-08T135511Z"}
What endpoint is it talking about and where? VPC endpoint?
j

jeev

06/08/2023, 2:13 PM
hmm. that sounds like the s3 endpoint. but shouldn’t need to specify that. unless it thinks it’s using minio or something.
r

Rob Rati

06/08/2023, 2:15 PM
This is our storage config. We pointed it at stow:
Copy code
003-storage.yaml: "propeller:\n  rawoutput-prefix: s3://<bucket>/data\nstorage:\n
    \ type: stow\n  stow:\n    kind: s3\n    config:\n      region: us-east-2\n      disable_ssl:
    true \n      v2_signing: \n      auth_type: iam\n  container: <bucket>\n"
Is this what would impact that?
I notice we have auth_type as iam, but nothing in that config specifying the iam role.
j

jeev

06/08/2023, 2:17 PM
it should use IRSA
are y’all using the chart?
r

Rob Rati

06/08/2023, 2:28 PM
Yes and no. 😄 While we are debugging what we need to get deployed, we are generating the deployment yaml from the charts and deploying that, making mods to the yaml if needed. ATM we have to do this because the chart wants to create rbac entities, and in our cluster we have to do that a special way. We can't create normal rbac objects directly
j

jeev

06/08/2023, 2:30 PM
got it.
maybe render the chart with just region set, and look at the storage config
r

Rob Rati

06/08/2023, 2:32 PM
We'll give that a try and report back
No change by regenerating the chart from the latest mainline. In my research, this error seems to be related to an aws service and likely a misconfig. Is there an aws service other than S3 that is involved in starting a job? Do you know what actions/permissions are needed? I'm guessing we have another iam action missing.
j

jeev

06/08/2023, 9:01 PM
the error is just about propeller failing to write to what it thinks is the right object storage. can you paste the new storage config?
r

Rob Rati

06/08/2023, 9:04 PM
Copy code
003-storage.yaml: "propeller:\n  rawoutput-prefix: s3://>bucket>/data\nstorage:\n
    \ type: stow\n  stow:\n    kind: s3\n    config:\n      region: us-east-2\n      disable_ssl:
    false \n      v2_signing: false\n      auth_type: iam\n  container: <bucket>\n"
We think we see files in the proper location on s3. What is propeller trying to write? Maybe we can verify if those files exist
m

Mike Morgan

06/08/2023, 9:12 PM
The perplexing thing is we can see that file in s3 at that location
r

Rob Rati

06/09/2023, 7:35 PM
I'm a bit confused about the storage config. In the CM, there is a 003-storage.yaml which maps to this config struct: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/config/config.go#L120 However, it looks like propeller is using this config: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/controller.go#L411 Which maps to this config struct: https://github.com/flyteorg/flytestdlib/blob/master/storage/config.go#L47 Which seems to be consuming this stanza in the cm: storage: cache: max_size_mbs: 10 target_gc_percent: 100 Do we need to configure the s3 options in the storage stanza as well? Is flytestdlib somehow using the propeller config if not defined?
j

jeev

06/10/2023, 5:09 AM
they should get merged
84 Views