Hi everyone, I am new to K8S and Flyte but I managed to install Flyte on EKS by following this guid...

Panos Strouth

Hi everyone, I am new to K8S and Flyte but I managed to install Flyte on EKS by following this guide: https://docs.flyte.org/en/latest/deployment/aws/manual.html I tried to access flyte using flytectl and it worked. Unfortunately, when I try to use pyflyte to execute a workflow remotely I get the following error:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:

status = StatusCode.UNKNOWN

details = "failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials

caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity

status code: 403, request id: 88d09420-d2e3-4772-8767-83cff32d91af"

debug_error_string = "UNKNOWN:Error received from peer ipv4:xx.xx.xx.xx:443 {grpc_message:"failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403

Seems like an error in IRSA (IAM Role for ServiceAccount). The installation guide suggests to attach IAM roles to the whole EC2 node. Personally I decided to use IRSA because I think this is the correct way to provide permissions to applications. Using EC2-wide roles means that every application running on the instance has the role permissions. With IRSA you allow IAM roles be assumed by applications running in specific namespaces…some kind of more fine-grained control. But as I said I am still a K8S beginner so no strong opinion. My IAM setup has 2 roles: flyte-user-role and iam-role-flyte. Both roles have full s3 permissions. The most important part is the trust policy. Since I use IRSA both roles have the following trust policy:

"Version": "2012-10-17",

"Statement": [

"Sid": "",

"Effect": "Allow",

"Principal": {

"Federated": "arn:aws:iam::xxxxxxxx:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy"

},

"Action": "sts:AssumeRoleWithWebIdentity",

"Condition": {

"StringEquals": {

"<http://oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:aud|oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>",

"<http://oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:sub|oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:sub>": "system:serviceaccount:flyte:default"

Note the “flyte” namespace in the Condition. My flyte services run in “flyte” namespace and they should be able to assume the above roles. I think the problem is related to IAM trust policies because flyte service does not have the required permissions to assume the IAM role. Has anyone faced a similar issue? Any help is appreciated!

I'm getting this error on flytescheduler running on AWS EKS, which I guess is admin ip: ```Error: rp...

Andrew Korzhuev

about 3 years ago

I'm getting this error on flytescheduler running on AWS EKS, which I guess is admin ip:

Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp ...:81: i/o timeout"

panic: authentication error! Original Error: rpc error: code = Unauthenticated desc = token parse error [JWT_VERIFICATION_FAILED] Could not retrieve id token from metadata, caused by: rpc error: code = Unauthenticated desc = Request unauthenticated with IDToken, Auth Error: failed to initialized token source provider. Err: failed to fetch auth metadata. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp ...:81: i/o timeout"

Could this be related to auth config? And if so where do I look to fix it?

Hello guys I've set up Kubernetes and Flyte on AWS and then port forwarded to my local machine. Ho...

Fhuad Balogun

over 2 years ago

Hello guys I've set up Kubernetes and Flyte on AWS and then port forwarded to my local machine. However, when I try to register a workflow, I get this error:

error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

Hello, I'm going to deploy my workflow on the local container. But the workflow fails with this mess...

HaoboGu

over 3 years ago

Hello, I'm going to deploy my workflow on the local container. But the workflow fails with this message:

[1/1] currentAttempt done. Last Error: USER::containers with unready status: [fba452891990446a7a49-n0-0]|Back-off pulling image "cosine:2ce90d3ad559b4e5b1981af8726f4d19eeedc835"

Is there any way I can use to debug what's happened?

👀 1

Hi. Is there any method to pass env_file to `pyflyte run --remote` command? I was wondering if I cou...

Masa Nakamura

over 2 years ago

Hi. Is there any method to pass env_file to

pyflyte run --remote

command? I was wondering if I could add

--env-file

option like

docker run

command.

Hey everyone again :raised_hands: I see a stuck in triggering to the next task. My main workflow is...

Anthony

about 3 years ago

Hey everyone again 🙌 I see a stuck in triggering to the next task. My main workflow is depicted in attached pic. First

preproc_and_split

step was executed successfully:

pyflyte-execute
--inputs
<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-a27rchl5z9ndpw297nk8/n0/data/inputs.pb>
--output-prefix
<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-a27rchl5z9ndpw297nk8/n0/data/0>
--raw-output-data-prefix
<s3://my-s3-bucket/vo/a27rchl5z9ndpw297nk8-n0-0>
--checkpoint-path
<s3://my-s3-bucket/vo/a27rchl5z9ndpw297nk8-n0-0/_flytecheckpoints>
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
app.workflow
task-name
preproc_and_split

On the output one should expect a small train dataset with 50k records. In Nods allocation i see a sufficient mem available. But then the first task has been succeeded i see an eternal hang in this step and flyte don’t produce next executions according to the workflow.

task_resource_defaults

conf is the next:

task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 1
        memory: 3000Mi
        storage: 200Mi
      limits:
        cpu: 5
        gpu: 1
        memory: 8Gi
        storage: 500Mi

I have one task that generates a dataclasses instances on the exit and another task should takes these classes as input params:

@workflow
def main_flow() -> Forecast:
    """
    Main Flyte WorkFlow consisting of three tasks:
        -  @preproc_and_split
        -  @train_xgboost_clf
        -  @get_predictions
    """
    <http://logger.info|logger.info>(log="#START -- START Raw Preprocessing and Splitting", timestamp=None)
    train_cls, target_cls = preproc_and_split()

    <http://logger.info|logger.info>(log="#START -- START Initialize Boosting Params", timestamp=None)
    saved_mpath = train_xgboost_clf(
                            feat_cls=train_cls,
                            target_cls=target_cls,
                            xgb_params=xgb_params,
                            cust_metric=BoostingCustMetric
                         )

Where

def preproc_and_split() -> Tuple[Fraud_Raw_PostProc_Data_Class, Fraud_Raw_Target_Data_Class]:

Any advices why I faced this behaviour?

👋 2

Hey guyz, I did a setup of flyte on GCP. I'm able to access the UI but on running pyflyte run --remo...

HIMANSHU JOSHI

over 2 years ago

Hey guyz, I did a setup of flyte on GCP. I'm able to access the UI but on running pyflyte run --remote sample_workflow.py main --x 10 --y 20 I'm getting this error,

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:172.25.0.46:443: Ssl handshake failed: SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED"
        debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.25.0.46:443: Ssl handshake failed: SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED {grpc_status:14, created_time:"2023-02-07T18:47:06.951145+05:30"}"

Any body has an idea what is the problem here? P.s I have TLS disabled on ingress.

Hi comm, I'm new to Flyte. I have been trying to get around this, but don't know what the issue is...

Fhuad Balogun

over 2 years ago

Hi comm, I'm new to Flyte. I have been trying to get around this, but don't know what the issue is. I installed the flyte sandbox to test things out. When I run a workflow locally, it works fine and completed (typically completed in 3 mins). However, when I run it remotely in the sandbox, it takes very long time (about an hour) and eventually fails with this error:

Pod failed. No message received from kubernetes.

Hi! I hope everyone is well and safe I have a flyte cluster running on kubernetes. I am trying to ...

Vinicius Esperança

over 3 years ago

Hi! I hope everyone is well and safe I have a flyte cluster running on kubernetes. I am trying to run a simple workflow, from my machine, on that cluster. I am struggling a bit how to do that. Is there any documentation explaining how to do that?

Half solved the issue ^ and now I'm getting a s3 putObject auth error ```lytekit.exceptions.user.Fly...

Laura Lin

about 3 years ago

Half solved the issue ^ and now I'm getting a s3 putObject auth error

lytekit.exceptions.user.FlyteAssertion: Failed to put data from /tmp/flyte-gnaguqof/sandbox/local_flytekit/engine_dir to <s3://flyte-bucket/metadata/propeller/flytetester-development-a9xg68jktjbhd7nz7sc7/n0/data/0> (recursive=True).

Original exception: Called process exited with error code: 1.  Stderr dump:

b'upload failed: ../../tmp/flyte-gnaguqof/sandbox/local_flytekit/engine_dir/error.pb to <s3://flyte-bucket/metadata/propeller/flytetester-development-a9xg68jktjbhd7nz7sc7/n0/data/0/error.pb> An error occurred (AccessDenied) when calling the PutObject operation: Access Denied\n'

the flyte-user-role has AmazonS3FullAccess, I verified that the failing pod has the env var

AWS_ROLE_ARN

set to the flyte-user-role. And when I look inside the bucket, I can see that there's a

<s3://flyfte-bucket/metadata/propeller/flytetester-development-a9xg68jktjbhd7nz7sc7/n0/data/inputs.pb>

So something is working correctly to put in objects but then fails? Not using minio either and docker image has

awscli==1.25.94

Previous 789 Next

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.