Help please. My first `pyflyte run --remote` comma...
# ask-the-community
b
Help please. My first
pyflyte run --remote
command fails with Handshake failed with fatal error
SSL_ERROR_SSL
.
Copy code
$ FLYTE_SDK_LOGGING_LEVEL=20 pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'

{"asctime": "2023-04-15 16:54:36,923", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 16:54:36,950", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 16:54:36,954", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 16:54:37,937", "name": "flytekit", "levelname": "INFO", "message": "We won't register PyTorchCheckpointTransformer, PyTorchTensorTransformer, and PyTorchModuleTransformer because torch is not installed."}
{"asctime": "2023-04-15 16:54:38,379", "name": "flytekit", "levelname": "INFO", "message": "We won't register TensorFlowRecordFileTransformer, TensorFlowRecordsDirTransformer and TensorFlowModelTransformerbecause tensorflow is not installed."}
{"asctime": "2023-04-15 16:54:38,408", "name": "flytekit", "levelname": "INFO", "message": "We won't register bigquery handler for structured dataset because we can't find the packages google-cloud-bigquery-storage and google-cloud-bigquery"}
{"asctime": "2023-04-15 16:54:38,696", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 16:54:38,697", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
E0415 16:54:39.685038207  177107 <http://ssl_transport_security.cc:1420]|ssl_transport_security.cc:1420]>       Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
E0415 16:54:40.191239374  177107 <http://ssl_transport_security.cc:1420]|ssl_transport_security.cc:1420]>       Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
Failed with Exception: Reason: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
	details: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8088: Ssl handshake failed: SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
	Debug string UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8088: Ssl handshake failed: SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER {created_time:"2023-04-15T16:54:40.193233866+09:00", grpc_status:14}
I understand this error usually occurs when the
.flyte/config.yaml
and env variable config is not correct. I have checked that but I must be missing something obvious. Here is my setup... Remote cluster is AWS EKS running in a VPC Flyte was installed following instructions in https://docs.flyte.org/en/latest/deployment/deployment/cloud_simple.html Local ports are proxied to these flyte services...
Copy code
kubectl -n flyte port-forward service/flyte-backend-flyte-binary-grpc 8089:8089 &
kubectl -n flyte port-forward service/flyte-backend-flyte-binary-http 8088:8088 &
Env vars...
Copy code
$ echo $FLYTECTL_CONFIG
/home/blair/.flyte/config.yaml

$ echo $KUBECONFIG
:/home/blair/.kube/config
.flyte/config.yaml
Copy code
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: dns:///localhost:8088
  authType: Pkce
  insecure: false
logger:
  show-source: true
  level: 0
If is set
insecure: true
I get the following error
Copy code
$ FLYTE_SDK_LOGGING_LEVEL=20 pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'
{"asctime": "2023-04-15 17:00:10,016", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 17:00:10,032", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 17:00:10,034", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 17:00:11,255", "name": "flytekit", "levelname": "INFO", "message": "We won't register PyTorchCheckpointTransformer, PyTorchTensorTransformer, and PyTorchModuleTransformer because torch is not installed."}
{"asctime": "2023-04-15 17:00:11,875", "name": "flytekit", "levelname": "INFO", "message": "We won't register TensorFlowRecordFileTransformer, TensorFlowRecordsDirTransformer and TensorFlowModelTransformerbecause tensorflow is not installed."}
{"asctime": "2023-04-15 17:00:11,897", "name": "flytekit", "levelname": "INFO", "message": "We won't register bigquery handler for structured dataset because we can't find the packages google-cloud-bigquery-storage and google-cloud-bigquery"}
{"asctime": "2023-04-15 17:00:12,160", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-15 17:00:12,162", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
Failed with Exception: Reason: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
	details: failed to connect to all addresses; last error: INTERNAL: ipv4:127.0.0.1:8088: Trying to connect an http1.x server
	Debug string UNKNOWN:failed to connect to all addresses; last error: INTERNAL: ipv4:127.0.0.1:8088: Trying to connect an http1.x server {created_time:"2023-04-15T17:00:13.277402386+09:00", grpc_status:14}
I looks like grpc can not connect, but the port proxy looks to be working fine as I can open the web console in a browser http://localhost:8088/console
k
Hmm, cc @jeev have you seen this?
j
port in flyte config should be 8089 (grpc) not 8088 (http)
but then when it returns a link, change that port to 8088 before opening in browser 😅
b
@jeev thank you for the suggestion. I updated
.flyte/config.yaml
but see the same error with port 8089
Copy code
$ FLYTE_SDK_LOGGING_LEVEL=20 pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'
{"asctime": "2023-04-16 10:37:01,444", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:37:01,467", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:37:01,470", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:37:02,739", "name": "flytekit", "levelname": "INFO", "message": "We won't register PyTorchCheckpointTransformer, PyTorchTensorTransformer, and PyTorchModuleTransformer because torch is not installed."}
{"asctime": "2023-04-16 10:37:03,340", "name": "flytekit", "levelname": "INFO", "message": "We won't register TensorFlowRecordFileTransformer, TensorFlowRecordsDirTransformer and TensorFlowModelTransformerbecause tensorflow is not installed."}
{"asctime": "2023-04-16 10:37:03,382", "name": "flytekit", "levelname": "INFO", "message": "We won't register bigquery handler for structured dataset because we can't find the packages google-cloud-bigquery-storage and google-cloud-bigquery"}
{"asctime": "2023-04-16 10:37:03,707", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:37:03,709", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
E0416 10:37:04.581003542  260714 <http://ssl_transport_security.cc:1420]|ssl_transport_security.cc:1420]>       Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
E0416 10:37:04.878786218  260714 <http://ssl_transport_security.cc:1420]|ssl_transport_security.cc:1420]>       Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
Failed with Exception: Reason: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
	details: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8089: Ssl handshake failed: SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
	Debug string UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8089: Ssl handshake failed: SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER {created_time:"2023-04-16T10:37:04.880480722+09:00", grpc_status:14}
j
what value are you using for
insecure
?
b
false
Copy code
$ cat ~/.flyte/config.yaml
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: dns:///localhost:8089
  authType: Pkce
  insecure: false
logger:
  show-source: true
  level: 0
j
try
insecure: true
b
Oh progress, now I get an S3 error.
Copy code
$ FLYTE_SDK_LOGGING_LEVEL=20 pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'
{"asctime": "2023-04-16 10:53:40,037", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:53:40,058", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:53:40,060", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:53:41,334", "name": "flytekit", "levelname": "INFO", "message": "We won't register PyTorchCheckpointTransformer, PyTorchTensorTransformer, and PyTorchModuleTransformer because torch is not installed."}
{"asctime": "2023-04-16 10:53:41,924", "name": "flytekit", "levelname": "INFO", "message": "We won't register TensorFlowRecordFileTransformer, TensorFlowRecordsDirTransformer and TensorFlowModelTransformerbecause tensorflow is not installed."}
{"asctime": "2023-04-16 10:53:41,971", "name": "flytekit", "levelname": "INFO", "message": "We won't register bigquery handler for structured dataset because we can't find the packages google-cloud-bigquery-storage and google-cloud-bigquery"}
{"asctime": "2023-04-16 10:53:42,300", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 10:53:42,302", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
Failed with Exception: Reason: USER:ValueError
Value error!  Received: 403. Request to send data <https://meta-bucket.s3.us-west-2.amazonaws.com/flytesnacks/development/4FSJB7UHCV36ICGGUNFBJCEDZU%3D%3D%3D%3D%3D%3D/script_mode.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=xxxxx(redacted)xxxx> failed
Hmm eks-starter.yaml does have role with S3 permissions for that meta bucket
Copy code
serviceAccount:
  create: true
  annotations:
  <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: "arn:aws:iam::xxx(redacted)xxx:role/flyte-role"
If I look into the service accounts I would have expected to see an annotation like this..
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::xxx(redacted)xxx:role/flyte-role
but it is not there, so I have to assume the service account creation is not adding the IAM role
Copy code
$ kubectl describe serviceaccount default -n flyte
Name:                default
Namespace:           flyte
Labels:              <none>
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   default-token-t29lf
Tokens:              default-token-t29lf
Events:              <none>

$ kubectl describe serviceaccount flyte-backend-flyte-binary -n flyte
Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.5.0|helm.sh/chart=flyte-binary-v1.5.0>
Annotations:         <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   flyte-backend-flyte-binary-token-4pxfl
Tokens:              flyte-backend-flyte-binary-token-4pxfl
Events:              <none>
j
might need to indent the annotation so it’s nested under “annotations”
also, the helm chart doesn’t create the iam role. it assumes that it already exists
b
🤯 yes the yaml indent fixed the missing role annotation
Copy code
$ kubectl describe serviceaccount flyte-backend-flyte-binary -n flyte
Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.5.0|helm.sh/chart=flyte-binary-v1.5.0>
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::xxxx:role/poc-eks-flyte3_iamserviceaccount_role
                     <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   flyte-backend-flyte-binary-token-4pxfl
Tokens:              flyte-backend-flyte-binary-token-4pxfl
Events:              <none>
I tried pyflyte again but still get an S3 error
Copy code
$ FLYTE_SDK_LOGGING_LEVEL=20 pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'
{"asctime": "2023-04-16 13:04:29,645", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 13:04:29,663", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 13:04:29,666", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 13:04:30,941", "name": "flytekit", "levelname": "INFO", "message": "We won't register PyTorchCheckpointTransformer, PyTorchTensorTransformer, and PyTorchModuleTransformer because torch is not installed."}
{"asctime": "2023-04-16 13:04:31,587", "name": "flytekit", "levelname": "INFO", "message": "We won't register TensorFlowRecordFileTransformer, TensorFlowRecordsDirTransformer and TensorFlowModelTransformerbecause tensorflow is not installed."}
{"asctime": "2023-04-16 13:04:31,631", "name": "flytekit", "levelname": "INFO", "message": "We won't register bigquery handler for structured dataset because we can't find the packages google-cloud-bigquery-storage and google-cloud-bigquery"}
{"asctime": "2023-04-16 13:04:31,994", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
{"asctime": "2023-04-16 13:04:31,996", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/blair/.flyte/config.yaml"}
Failed with Exception: Reason: USER:ValueError
Value error!  Received: 403. Request to send data <https://meta-bucket.s3.us-west-2.amazonaws.com/flytesnacks/development/4FSJB7UHCV36ICGGUNFBJCEDZU%3D%3D%3D%3D%3D%3D/script_mode.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=xxxxx(redacted)xxxx> failed
The iamservice account looks like this...
Copy code
$ kubectl describe serviceaccount flyte-backend-flyte-binary -n flyte
Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.5.0|helm.sh/chart=flyte-binary-v1.5.0>
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::xxxx:role/poc-eks-flyte3_iamserviceaccount_role
                     <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   flyte-backend-flyte-binary-token-4pxfl
Tokens:              flyte-backend-flyte-binary-token-4pxfl
Events:              <none>
The trust relationship for the role looks like this...
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::xxxx:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXX"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<http://oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXX:sub|oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXX:sub>": "system:serviceaccount:flyte:flyte-backend-flyte-binary",
          "<http://oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXX:aud|oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXX:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>"
        }
      }
    }
  ]
}
j
does the
poc-eks-flyte3_iamserviceaccount_role
iam role have permissions on the
meta-bucket
S3 bucket?
b
yes, this is the policy attached to that role
Copy code
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:ListStorageLensConfigurations",
                "s3:ListAccessPointsForObjectLambda",
                "s3:GetAccessPoint",
                "s3:PutAccountPublicAccessBlock",
                "s3:GetAccountPublicAccessBlock",
                "s3:ListAllMyBuckets",
                "s3:ListAccessPoints",
                "s3:PutAccessPointPublicAccessBlock",
                "s3:ListJobs",
                "s3:PutStorageLensConfiguration",
                "s3:ListMultiRegionAccessPoints",
                "s3:CreateJob"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::meta-bucket",
                "arn:aws:s3:::user-bucket"
            ]
        }
...
j
try this?
Copy code
{
  "Sid": "VisualEditor1",
  "Effect": "Allow",
  "Action": "s3:*",
  "Resource": [
    "arn:aws:s3:::meta-bucket",
    "arn:aws:s3:::meta-bucket/*",
    "arn:aws:s3:::user-bucket",
    "arn:aws:s3:::user-bucket/*"
  ]
}
b
success! The workflow is submitted and runs on the remote cluster, thank you 🙂 Now I get an error on the first step (get_data) of the workflow
Copy code
Pod failed. No message received from kubernetes.
[feccea785a4c046ee848-n0-0] terminated with exit code (137). Reason [OOMKilled]. Message: 
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-hk2xuhf_ because the default path (/home/flytekit/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
I fixed the OOM with a resource override, but it would be nice to know if there was somewhere where I could set the default resources for tasks
Copy code
@task(limits=Resources(mem="256Mi")
Next error in the workflow appears to be an S3 permission error
Copy code
[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[f9ecd5cb543bf4a68b9d-n0-0] terminated with exit code (1). Reason [Error]. Message: 
on3.8/asyncio/tasks.py", line 455, in wait_for
    return await fut
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 1171, in _get_file
    body, content_length = await _open_file(range=0)
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 1162, in _open_file
    resp = await self._call_s3(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 347, in _call_s3
    return await _error_wrapper(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 139, in _error_wrapper
    raise err
PermissionError: Access Denied
I guess the workflow does not have access to the iamserviceaccount role as that is only trusted with
"system:serviceaccount:flyte:flyte-backend-flyte-binary"
?
Copy code
$ kubectl get pods --all-namespaces
NAMESPACE                 NAME                                          READY   STATUS      RESTARTS   AGE
flyte                     flyte-backend-flyte-binary-74b4dbb9ff-ktmns   1/1     Running     0          11m
flytesnacks-development   f9ecd5cb543bf4a68b9d-n0-0                     0/1     Error       0          5m29s
I tried creating an iamserviceaccount for namespace
flytesnacks-development
, but the Access Denied error still occurs
Copy code
eksctl create iamserviceaccount \
  --name flytesnacks-development-role-sa \
  --namespace flytesnacks-development \
  --cluster poc-eks-flyte3 \
  --attach-policy-arn arn:aws:iam::xxxx:policy/flyte-policy \
  --approve \
  --role-name poc-eks-flyte3_flytesnacks-development_iamserviceaccount_role
I thought the issue might be related to one discussed in this thread... but the proposed solutions of downgrading flytekit (I downgraded to 1.4.1) and adding a file to the s3 bucket did not work. In the end I had to manually assign the S3 policy to the kubernetes node group... not ideal, it is just a temporary workaround https://flyte-org.slack.com/archives/CP2HDHKE1/p1681312166751079
The remote workflow now runs successfully with the S3 policy workaround
j
hmm yea it should work with IRSA. does the flyte task pod’s KSA have the appropriate annotation for the right IAM role?
b
The KSA I created looks to have the IAM role...
Copy code
$ kubectl describe sa flytesnacks-development-role-sa -n flytesnacks-development
Name:                flytesnacks-development-role-sa
Namespace:           flytesnacks-development
Labels:              <http://app.kubernetes.io/managed-by=eksctl|app.kubernetes.io/managed-by=eksctl>
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::xxxx:role/poc-eks-flyte3_flytesnacks-development_iamserviceaccount_role
Image pull secrets:  <none>
Mountable secrets:   flytesnacks-development-role-sa-token-s4ltw
Tokens:              flytesnacks-development-role-sa-token-s4ltw
Events:              <none>
Here is the description of the pods with the error
Copy code
kubectl describe pod fd049603a9d0f4a58a91-n0-0 -n flytesnacks-development
Name:         fd049603a9d0f4a58a91-n0-0
Namespace:    flytesnacks-development
Priority:     0
Node:         ip-10-1-1-49.us-west-2.compute.internal/10.1.1.49
Start Time:   Sun, 16 Apr 2023 15:24:47 +0900
Labels:       domain=development
              execution-id=fd049603a9d0f4a58a91
              interruptible=false
              node-id=n0
              project=flytesnacks
              shard-key=6
              task-name=example-get-data
              workflow-name=example-training-workflow
Annotations:  <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
              <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
....
btw. I am also trying to use
flytekitplugins.papermill
in the example here https://docs.flyte.org/projects/cookbook/en/latest/auto/case_studies/feature_engineering/eda/notebook.html It works locally, but when run remote it fails with
ModuleNotFoundError: No module named 'flytekitplugins.papermill'
I don't see any instructions for installing the papermill plugin to the cluster in the link below... so how is that module meant to be installed? https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html
j
the pod is probably just using the default KSA instead of your new one.
there is extra config, either in flyteadmin, at registration time or launch time for specifying a custom KSA
b
Thanks, I will look into the custom KSA settings. Do you have any suggestion for the papermill plugin issue?
j
b
Yes that is already installed in the conda environment I am running pyflyte from.
Copy code
$ pip freeze | grep flyte
flyteidl==1.3.17
flytekit==1.4.1
flytekitplugins-papermill==1.5.0
j
it’ll need to be installed in the image that the task is running on as well. you can probably use the flytekit image as a base and install over it.
then when calling “pyflyte run” specify this new image
b
Ok, thanks I see
pyflyte *--image xxxx run xxxx*
I also see
@task(container_image="xxxx")
, but I don't see the same option for
NotebookTask
. Is is possible to pass in the image to the
NotebookTask
? Also is there a way to change the default task image, so I don't have to override the image being used?
j
will have to see the code to confirm. not sure atm.
i’d make sure it’s working as intended first, and then we can polish
b
Ok, the Dockerfile for the new image
Copy code
FROM <http://ghcr.io/flyteorg/flytekit:py3.8-latest|ghcr.io/flyteorg/flytekit:py3.8-latest>

USER root
RUN pip install -U flytekitplugins-papermill
USER flytekit
That fixed the module not found error. The workflow runs. However I now have this error
Copy code
[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[f1cedfad3740243a8801-n0-0] terminated with exit code (1). Reason [Error]. Message: 
r(InstanceTrackingMeta, cls).__call__(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/flytekitplugins/papermill/task.py", line 140, in __init__
    raise ValueError(f"Illegal notebook path passed in {self._notebook_path}")
ValueError: Illegal notebook path passed in /root/supermarket_regression.ipynb
i’m not sure if “pyflyte run” packages up the notebook. you can inspect the tarball that’s uploaded.
b
In the task I see this tar ball
Copy code
<s3://meta-bucket/flytesnacks/development/BXAKRFPKUSL47MC36AP7GTZZDA======/scriptmode.tar.gz>
Inside it is only the
workflow.py
file
j
ok yea. “pyflyte run” probably does work for this use case. might need a fast-register
b
What is a fast-register?
j
take a look at “pyflyte register”
b
So I ran pyflyte register to generate a tar ball
Copy code
pyflyte register -d development  -p flytesnacks -i <http://xxxxxx.dkr.ecr.us-west-2.amazonaws.com/flyteorg/flytekit:py3.8-latest|xxxxxx.dkr.ecr.us-west-2.amazonaws.com/flyteorg/flytekit:py3.8-latest> ./
How do I run using that tar ball, I don't see an option in
pyflyte run
?
j
you can trigger from the UI, flyte remote or flytectl now since it has been registered
maybe others have a better idea of how to get pyflyte run working with additional files
we can def make this better. some ideas: • a more complete image with all(?) plugins • pyflyte run packages like pyflyte register
@Yee @Eduardo Apolinario (eapolinario) cc
k
Imagespec is landing in 1.6. And pyflyte run now supports multiple files
Imagespec will auto build images
b
pyflyte run now supports multiple files
what is the syntax for multiple files? I tried this but get an error
Copy code
$ pyflyte run --remote --image <http://xxxx.dkr.ecr.us-west-2.amazonaws.com/flyteorg/flytekit:py3.8-latest|xxxx.dkr.ecr.us-west-2.amazonaws.com/flyteorg/flytekit:py3.8-latest> workflow1.py notebook_wf --n_estimators 100 supermarket_regression.ipynb
k
What is ipynb
b
It is a jupyter notebook that I want to include in the package sent to the remote server. The workflow runs a papermill task which runs that notebook
k
Is it an input
We do not auto package a notebook today, in pyflyte run
Only way supported today is building an ensign with it or doing pyflyte register fast
b
Ok thanks, understood. A few messages above I did a
pyflyte register
but could not see how to then run the registered file. Do you have an example command you could share? https://flyte-org.slack.com/archives/CP2HDHKE1/p1681662219041209?thread_ts=1681545516.802469&amp;cid=CP2HDHKE1
k
You can run using flytectl create execution and or the Ui or using programmatic flytekit.remote
b
Thanks I got it working now. pyflyte register was creating an archive based off the root of the git project, which was a little unexpected but workable. Hopefully the last little issue is that when I set the NotebookTask to render it is not rendering the notebook in the flyte webconsole like I see in the examples
Copy code
NotebookTask(render_deck=True, ....
I was expecting this... but got this...
k
That is not Flyte deck that’s just input and outputs
In the top
b
I clicked the "View Inputs & Outputs" link in the top right... is there a different link to click to see the notebook render?
k
yes
cc @Samhita Alla can you please help @Blair Anson understand where he can find the FlyteDecks link, sorry i am heading to bed now
s
Can you click "View inputs & outputs" of a node? You should be able to view FlyteDecks there.
b
Thanks. If I click on a node I don't see a "Flyte Deck" button next to the "RERUN" button
s
Can you add
disable_deck=False
to your
@task
decorator? Forgot to mention that you need to enable it.
Copy code
@task(disable_deck=False)
def t1() -> str:
    ...
b
My code was like this with just a NotebookTask in addition to a workflow. I had a`render_deck=True` parameter enabled as per the web link. When I change it to also include the
disable_deck=False
it now displays the rendered notebook. Thank you! https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/flytekit_plugins/papermilltasks/simple.html
Copy code
nb = NotebookTask(
    name="pipeline-nb",
    notebook_path=os.path.join(
        pathlib.Path(__file__).parent.absolute(), "supermarket_regression.ipynb"
    ),
    inputs=kwtypes(
        n_estimators=int,
        max_depth=int,
        max_features=str,
        min_samples_split=int,
        random_state=int,
    ),
    outputs=kwtypes(mae_score=float),
    requests=Resources(cpu="2", mem="1Gi"),
    render_deck=True
)
k
Hmm we should only make render deck true and auto do disable deck
b
@jeev Hi, thank you for your help so far. Can we discuss how to change the default docker image for a
NotebookTask
? I have been using a
PodTemplate
to set the default docker image for a normal
@task()
, as per the link below. However the
NotebookTask
ignores the
image
setting in the
PodTemplate
, although it does apply other settings such as
VolumeMount
. How do I change the default docker image for a
NotebookTask
without using
pyflyte *--image xxxx run xxxx*
? https://docs.flyte.org/en/latest/deployment/configuration/general.html#using-default-k8s-podtemplates
701 Views