Hi everyone, I am new to K8S and Flyte but I managed to install Flyte on EKS by following this guid...
p

Panos Strouth

almost 3 years ago
Hi everyone, I am new to K8S and Flyte but I managed to install Flyte on EKS by following this guide: https://docs.flyte.org/en/latest/deployment/aws/manual.html I tried to access flyte using flytectl and it worked. Unfortunately, when I try to use pyflyte to execute a workflow remotely I get the following error:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
status code: 403, request id: 88d09420-d2e3-4772-8767-83cff32d91af"
debug_error_string = "UNKNOWN:Error received from peer ipv4:xx.xx.xx.xx:443 {grpc_message:"failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403
Seems like an error in IRSA (IAM Role for ServiceAccount). The installation guide suggests to attach IAM roles to the whole EC2 node. Personally I decided to use IRSA because I think this is the correct way to provide permissions to applications. Using EC2-wide roles means that every application running on the instance has the role permissions. With IRSA you allow IAM roles be assumed by applications running in specific namespaces…some kind of more fine-grained control. But as I said I am still a K8S beginner so no strong opinion. My IAM setup has 2 roles: flyte-user-role and iam-role-flyte. Both roles have full s3 permissions. The most important part is the trust policy. Since I use IRSA both roles have the following trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::xxxxxxxx:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"<http://oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:aud|oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>",
"<http://oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:sub|oidc.eks.eu-central-1.amazonaws.com/id/yyyyyy:sub>": "system:serviceaccount:flyte:default"
}
}
}
]
}
Note the “flyte” namespace in the Condition. My flyte services run in “flyte” namespace and they should be able to assume the above roles. I think the problem is related to IAM trust policies because flyte service does not have the required permissions to assume the IAM role. Has anyone faced a similar issue? Any help is appreciated!
Hey everyone again :raised_hands: I see a stuck in triggering to the next task. My main workflow is...
a

Anthony

about 3 years ago
Hey everyone again 🙌 I see a stuck in triggering to the next task. My main workflow is depicted in attached pic. First
preproc_and_split
step was executed successfully:
pyflyte-execute
--inputs
<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-a27rchl5z9ndpw297nk8/n0/data/inputs.pb>
--output-prefix
<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-a27rchl5z9ndpw297nk8/n0/data/0>
--raw-output-data-prefix
<s3://my-s3-bucket/vo/a27rchl5z9ndpw297nk8-n0-0>
--checkpoint-path
<s3://my-s3-bucket/vo/a27rchl5z9ndpw297nk8-n0-0/_flytecheckpoints>
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
app.workflow
task-name
preproc_and_split
On the output one should expect a small train dataset with 50k records. In Nods allocation i see a sufficient mem available. But then the first task has been succeeded i see an eternal hang in this step and flyte don’t produce next executions according to the workflow.
task_resource_defaults
conf is the next:
task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 1
        memory: 3000Mi
        storage: 200Mi
      limits:
        cpu: 5
        gpu: 1
        memory: 8Gi
        storage: 500Mi
I have one task that generates a dataclasses instances on the exit and another task should takes these classes as input params:
@workflow
def main_flow() -> Forecast:
    """
    Main Flyte WorkFlow consisting of three tasks:
        -  @preproc_and_split
        -  @train_xgboost_clf
        -  @get_predictions
    """
    <http://logger.info|logger.info>(log="#START -- START Raw Preprocessing and Splitting", timestamp=None)
    train_cls, target_cls = preproc_and_split()

    <http://logger.info|logger.info>(log="#START -- START Initialize Boosting Params", timestamp=None)
    saved_mpath = train_xgboost_clf(
                            feat_cls=train_cls,
                            target_cls=target_cls,
                            xgb_params=xgb_params,
                            cust_metric=BoostingCustMetric
                         )
Where
def preproc_and_split() -> Tuple[Fraud_Raw_PostProc_Data_Class, Fraud_Raw_Target_Data_Class]:
Any advices why I faced this behaviour?
👋 2