Laura Lin
06/02/2023, 10:35 PMremote.sync
or remote.sync_execution
This is a workflow containing a dynamic workflow that calls a map_task
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.NOT_FOUND
details = "missing entity of type TASK with identifier project:"flytetester" domain:"development" name:"MAP_TASK_NAME" version:"VERSION" "
debug_error_string = "UNKNOWN:Error received from peer ipv4:10.34.2.243:443 {grpc_message:"missing entity of type TASK with identifier project:\"flytetester\" domain:\"development\" name:\"MAP_TASK_NAME\" version:\"VERSION\" ", grpc_status:5, created_time:"2023-06-02T15:32:48.383186-07:00"}"
>
seunggs
06/03/2023, 6:05 AMModuleNotFoundError: No module named
'HR-Analytics--Predicting-Employee-Promotion'
HR-Analytics--Predicting-Employee-Promotion
is the project root dir name and I’m packaging the workflow via pyflyte package
and registering via flytectl register files
(both commands are showing success messages).
Here’s the dir structure:
/HR-Analytics--Predicting-Employee-Promotion # project root dir and cwd
/src
main.py # tasks here
wf_87.py # with wf_87 as the workflow fn name
My cwd is the project root - it’s a bit strange that pyflyte --pkgs src package …
shows that the workflow name includes the project folder itself.
packageFlyteWorkflowRes Loading packages ['src'] under source root /userRepoData/__sidetrek__/seunggs/HR-Analytics--Predicting-Employee-Promotion
Successfully serialized 8 flyte objects
Packaging HR-Analytics--Predicting-Employee-Promotion.src.main.create_df -> 0_HR-Analytics--Predicting-Employee-Promotion.src.main.create_df_1.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.main.clean_ds -> 1_HR-Analytics--Predicting-Employee-Promotion.src.main.clean_ds_1.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.main.handle_cat_cols -> 2_HR-Analytics--Predicting-Employee-Promotion.src.main.handle_cat_cols_1.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.main.split_train_test -> 3_HR-Analytics--Predicting-Employee-Promotion.src.main.split_train_test_1.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.main.train_model -> 4_HR-Analytics--Predicting-Employee-Promotion.src.main.train_model_1.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.wf_87.dataset_tylo_hr_analytics -> 5_HR-Analytics--Predicting-Employee-Promotion.src.wf_87.dataset_tylo_hr_analytics_1.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.wf_87.wf_87 -> 6_HR-Analytics--Predicting-Employee-Promotion.src.wf_87.wf_87_2.pb
Packaging HR-Analytics--Predicting-Employee-Promotion.src.wf_87.wf_87 -> 7_HR-Analytics--Predicting-Employee-Promotion.src.wf_87.wf_87_3.pb
Successfully packaged 8 flyte objects into /userRepoData/__sidetrek__/seunggs/HR-Analytics--Predicting-Employee-Promotion/flyte-workflow-package.tgz
I think I’m basically following the getting started guide with the same dir structure. Not sure why execution is failing? Any help would be appreciated!Sebastian Büttner
06/03/2023, 11:46 AMVictor Churikov
06/04/2023, 8:32 AMModuleNotFoundError
saying my code’s module doesn’t exist.
I found that when the workflow is executed the python files are not present inside the Kubernetes container. I tested this by catching the kubectl get pod <pod-name -oyaml>
output of the task pod, editing its entrypoint to sleep 99999
, then ran kubectl exec -it <pod-name> bash
, and observed that the container starts in /root folder which is totally empty. I could not find the files anywhere else inside the container using tools like find
and grep
, it seems missing
Should I use register_script instead? What is the difference?
Code example attached as a comment in this threadAriel Kaspit
06/04/2023, 9:40 AMcluster_resource_manager
as documented, but still getting permissions errors. I followed the documentation, specifically in here: https://docs.flyte.org/en/latest/deployment/configuration/general.html#cluster-resources
This is my configuration in `values.yaml`:
configmap:
domain:
domains:
- id: development
name: development
- id: staging
name: staging
namespace_config:
namespace_mapping:
template: "{{ domain }}"
cluster_resource_manager:
config:
cluster_resources:
customData:
- development:
- projectQuotaCpu:
value: "5"
- projectQuotaMemory:
value: 4000Mi
- gsa:
value: <mailto:flyte@projectid.iam.gserviceaccount.com|flyte@projectid.iam.gserviceaccount.com>
- staging:
- projectQuotaCpu:
value: "5"
- projectQuotaMemory:
value: 4000Mi
- gsa:
value: <mailto:flyte@projectid.iam.gserviceaccount.com|flyte@projectid.iam.gserviceaccount.com>
templates:
- key: aa_namespace
value: |
apiVersion: v1
kind: Namespace
metadata:
name: {{ namespace }}
spec:
finalizers:
- kubernetes
- key: aab_default_service_account
value: |
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: {{ namespace }}
annotations:
<http://iam.gke.io/gcp-service-account|iam.gke.io/gcp-service-account>: <mailto:flyte@projectid.iam.gserviceaccount.com|flyte@projectid.iam.gserviceaccount.com>
- key: ab_project_resource_quota
value: |
apiVersion: v1
kind: ResourceQuota
metadata:
name: project-quota
namespace: {{ namespace }}
spec:
hard:
limits.cpu: {{ projectQuotaCpu }}
limits.memory: {{ projectQuotaMemory }}
In console, I don’t see any IAM and serviceaccounts assigned to the project (attached a screenshot).
Using pyflyte
, I’m trying to run the hello-world
workflow (I use the basic workflow just for testing, it’s from flytesnacks/cookbook/core/flyte_basics/hello_world.py
) - and I get 403 Permissions denied. Is there something I need to configure in the workflow itself / in ./flyte/config.yaml
?Тигран Григорян
06/04/2023, 10:08 AMKevin Blanchette
06/04/2023, 2:25 PMseunggs
06/04/2023, 8:33 PMFaisal Anees
06/05/2023, 7:10 AMget_data
task seemed stuck in Running. I noticed that the node group had nodes each with 2 CPUs so I ended up updating the nodes to run on 4 CPUs each. This "i think" got me out of the queued state, but now the task was failing - leading to the next issue
2. Task failing with died with <Signals.SIGKILL: 9>
error : This was the log for the failed task. I searched some slack threads and someone mentioned that this error might be happening due to OOM but not sure if that's the case here as each node had 16GB of memory. Isn't that sufficient ?
[1/1] currentAttempt done. Last Error: USER:: │
│ ❱ 760 │ │ │ │ return __callback(*args, **kwargs) │
│ │
│ /usr/local/lib/python3.10/site-packages/flytekit/bin/entrypoint.py:508 in │
│ fast_execute_task_cmd │
│ │
│ ❱ 508 │ subprocess.run(cmd, check=True) │
│ │
│ /usr/local/lib/python3.10/subprocess.py:526 in run │
│ │
│ ❱ 526 │ │ │ raise CalledProcessError(retcode, process.args, │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['pyflyte-execute', '--inputs',
'<s3://flyte-cluster-bucket-2023/metadata/propeller/flytesnacks-development-ff067>
d646b0684b76a94/n0/data/inputs.pb', '--output-prefix',
'<s3://flyte-cluster-bucket-2023/metadata/propeller/flytesnacks-development-ff067>
d646b0684b76a94/n0/data/0', '--raw-output-data-prefix',
'<s3://flyte-cluster-bucket-2023/data/2b/ff067d646b0684b76a94-n0-0>',
'--checkpoint-path',
'<s3://flyte-cluster-bucket-2023/data/2b/ff067d646b0684b76a94-n0-0/_flytecheckpoi>
nts', '--prev-checkpoint', '""', '--dynamic-addl-distro',
'<s3://flyte-cluster-bucket-2023/flytesnacks/development/4MOWXYYMZXUPWCJJKGSQ6EOI>
24======/script_mode.tar.gz', '--dynamic-dest-dir', '/root', '--resolver',
'flytekit.core.python_auto_container.default_task_resolver', '--',
'task-module', 'example', 'task-name', 'get_data']' died with <Signals.SIGKILL:
9>.
Can someone please help me out ?Stephen
06/05/2023, 7:52 AMflytekit.remote.FlyteRemote.recent_executions
) and then choose an execution, and get/download the output of a particular task in that workflow. What is the recommended way?Albert Wibowo
06/05/2023, 10:05 AMflytectl demo start
But I encountered the following error:
Error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
My docker engine is running tho and I can build images if I want to. Anyone has experienced this before?Mücahit
06/05/2023, 4:39 PMGeorge D. Torres
06/05/2023, 6:00 PMwith_overrides
to change resources, but that isn't quite what I'm looking for. I'd like to assign a certain task in my workflow a different amount of memory depending on what type of inputs I give it at execution timeMelody Lui
06/06/2023, 1:51 AMmodel=transformers.AutoModelForCausalLM.from_pretrained(...)
, but i cannot directly return model and pass it to the next tasks because the model cannot be pickled cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
. Do i need to save the model with transformers’ save_pretrained
to nfs and load it when i need to use it in the next task?Tommy Nam
06/06/2023, 12:31 PMpyflyte --remote
command but not included int he fast-execute runtime whereas every other package gets included just fineLev Udaltsov
06/06/2023, 1:34 PMRunning
with the following error:
Exceeded resourcequota: [BackOffError] The operation was attempted
but failed, caused by: pods "fefe21c746a0b4964b45-n0-0" is forbidden: exceeded
quota: project-quota, requested: limits.cpu=600m, used: limits.cpu=64, limited:
limits.cpu=64
In flyte_values.yaml
I have
- projectQuotaCpu:
value: '64'
yet on GKE I have much fever than 64 vCPUs requested.
How does Flyte determine how many CPUs are you using in your project? Is it based on Limits for Tasks rather than Requests? Anyone have a similar issue before?Laura Lin
06/06/2023, 4:29 PM@dynamic
def rerun():
relative_func = "flyte.tasks.example_task"
relative_mod, input_func = relative_func.rsplit(".", 1)
import_mod = importlib.import_module(relative_mod)
rerun_task = getattr(import_mod, input_func)
inputs = [SOME STUFF]
map_task(rerun_task, concurrency=8)(input=inputs)
getting errors like
File "/usr/local/lib/python3.9/site-packages/flytekit/core/tracker.py", line 229, in extract_task_module
name = f.lhs
File "/usr/local/lib/python3.9/site-packages/flytekit/core/tracker.py", line 70, in lhs
return self.find_lhs()
File "/usr/local/lib/python3.9/site-packages/flytekit/core/tracker.py", line 96, in find_lhs
raise _system_exceptions.FlyteSystemException(f"Error looking for LHS in {self._instantiated_in}")
Message:
Error looking for LHS in __main__
Xinzhou Liu
06/06/2023, 6:10 PMNickolas da Rocha Machado
06/06/2023, 7:08 PMRequest failed with status code 500 failed to create workflow in propeller create not allowed while custom resource definition is terminating
Nicholas Roberson
06/06/2023, 9:08 PMexisting_execution = flyte_remote.fetch_execution(
project=project, name=execution_name
)
return existing_execution
if it doesn't exist it will throw an error, however if it does but is ~1 year old, will there be any issues?Frank Shen
06/06/2023, 10:13 PMpyflyte register --project examples --image <http://xxx.dkr.ecr.us-east-1.amazonaws.com/yyy:latest|xxx.dkr.ecr.us-east-1.amazonaws.com/yyy:latest> train/train_wf.py
And when it finally failed with the error:
OverflowError: string longer than 2147483647 bytes
...
flytekit.exceptions.user.FlyteAssertion: Failed to put data from /var/folders/xn/j7gcmr5j12b7jy0nm2kfykhm0000gp/T/tmp58m18z2p/fast53e0b9edd2101668d22c8cd5fe99d0b8.tar.gz to <https://dev-wm-max-flyte-us-east-1.s3.amazonaws.com/examples/development/>....... (recursive=False).
Original exception: string longer than 2147483647 bytes
2147483647 bytes is 2000+ MB.
I did du -h .
and the total size of my project folder is only 380+ KB.
Why does flyte fast register needs to send 2000+ MB data to S3?Vrinda Vasavada
06/06/2023, 11:03 PMA
installs the package produced by repository B
and uses the tasks written in repository B
and when I register Flyte workflows in repo A
I want it to register all the tasks from both reposSlackbot
06/07/2023, 8:06 AMMücahit
06/07/2023, 8:47 AMkubectl get sparkapplication ...
we see
FLYTE_MAX_ATTEMPTS: "1"
FLYTE_ATTEMPT_NUMBER: "3"
Is this some default bahavior with spark tasks or a potential bug?Edvard Majakari
06/07/2023, 10:49 AMIngo Kemmerzell
06/07/2023, 11:25 AMconfiguration:
database:
host: postgresql.mlops.svc.cluster.local
dbname: flyteadmindb
username: flyteuser
password: "..."
options: sslmode=disable
storage:
type: minio
metadataContainer: "flyte-container"
userDataContainer: "flyte-container"
provider: s3
providerConfig:
# s3 Provider configuration for S3 object store
s3:
# disableSSL Switch to disable SSL for communicating with S3-compatible service
disableSSL: true
# v2Signing Flag to sign requests with v2 signature
# Useful for s3-compatible blob stores (e.g. minio)
v2Signing: false
# endpoint URL of S3-compatible service
endpoint: <http://mls3api.corp.intern:9000/>
# authType Type of authentication to use for connecting to S3-compatible service (Supported values: iam, accesskey)
authType: accesskey
# accessKey Access key for authenticating with S3-compatible service
accessKey: "..."
# secretKey Secret key for authenticating with S3-compatible service
secretKey: "..."
logging:
level: 5
auth:
enabled: true
oidc:
baseUrl: "<https://login.microsoftonline.com/tenant_id/v2.0>"
clientId: "..."
clientSecret: "..."
scopes:
- openid
- email
- profile
internal:
clientSecret: "..."
clientSecretHash: ".."
flyteClient:
# clientId Client ID for Flyte client authentication
clientId: "..."
# redirectUri Redirect URI for Flyte client authentication
redirectUri: "<http://localhost:53593/callback>"
# scopes Scopes for Flyte client authentication
scopes:
- all
authorizedUris:
- <https://login.microsoftonline.com/tenant_id/oauth2/v2.0>
- <https://mlflyte.corp.intern>
inline:
plugins:
k8s:
inject-finalizer: true
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
storage:
cache:
max_size_mbs: 100
target_gc_percent: 100
serviceAccount:
create: true
annotations: {}
ingress:
create: true
commonAnnotations:
<http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: nginx
httpAnnotations:
<http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
grpcAnnotations:
<http://nginx.ingress.kubernetes.io/backend-protocol|nginx.ingress.kubernetes.io/backend-protocol>: GRPC
deployment:
extraEnvVars:
- name: HTTP_PROXY
value: "..."
- name: HTTPS_PROXY
value: "..."
- name: NO_PROXY
value: "..."
- name: no_proxy
value: "..."
Drew Yang
06/07/2023, 6:58 PMzeyu pan
06/08/2023, 6:20 AMAlbert Wibowo
06/08/2023, 10:44 AMErik Dao
06/08/2023, 11:13 AMmy_package
|-- __init__.py
|-- data.py
|-- workflow.py
my_notebook.ipynb
My workflow.py
is basically like this
import pandas as pd
from flyte import task, workflow
from data import generate_data, normalize_data
@task
def load_data() -> pd.DataFrame:
return generate_data()
@task
def preprocess_data(data: pd.DataFrame) -> pd.DataFrame:
return normalize_data(data)
@workflow
def simple_workflow():
data = load_data()
preprocess_data(data)
In my notebook, I first add the path to my local package to my system path, then create a FlyteRemote instance and try to register the workflow
import os
import sys
sys.path.append(os.getcwd())
sys.path.append(os.path.join(os.getcwd(), "my_package"))
from flytekit.remote import FlyteRemote
from flytekit.configuration import Config, PlatformConfig, ImageConfig, SerializationSettings
from flytekit.configuration import DataConfig, S3Config
remote = FlyteRemote(
config=Config(
platform=PlatformConfig(
endpoint=f"dns:///{os.environ['FLYTE_ENDPOINT']}",
insecure=True,
insecure_skip_verify=True,
),
data_config=DataConfig(s3=S3Config(
endpoint=os.environ['AWS_S3_ENDPOINT'],
enable_debug=True,
access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
))
),
default_project="my_project",
default_domain="development",
data_upload_location=os.environ['FLYTE_S3_BUCKET'],
)
from my_package.workflow import simple_workflow
flyte_workflow = remote.register_script(
simple_workflow,
image_config=ImageConfig.auto_default_image(),
version="v1",
module_name="my_package",
source_path="./"
)
remote.execute(flyte_workflow, inputs={})
The error I've been facing is
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "failed to compile workflow for [resource_type:WORKFLOW
project:"7b33c23287ae4d2481160448171b307c" domain:"development"
name:"my_package.worklow.simple_workflow" version:"v1" ] with err failed to compile workflow with
err Collected Errors: 1
Error 0: Code: NoNodesFound, Node Id: resource_type:WORKFLOW project:"7b33c23287ae4d2481160448171b307c"
domain:"development" name:""my_package.worklow.simple_workflow" version:"v1" , Description: Can't
find any nodes in workflow [resource_type:WORKFLOW project:"7b33c23287ae4d2481160448171b307c" domain:"development"
name:""my_package.worklow.simple_workflow" version:"v1" ].
"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to compile workflow for
[resource_type:WORKFLOW project:\"7b33c23287ae4d2481160448171b307c\" domain:\"development\"
name:\""my_package.worklow.simple_workflow\" version:\"v1\" ] with err failed to compile workflow
with err Collected Errors: 1\n\tError 0: Code: NoNodesFound, Node Id: resource_type:WORKFLOW
project:\"7b33c23287ae4d2481160448171b307c\" domain:\"development\"
name:\""my_package.worklow.simple_workflow\" version:\"v1\" , Description: Can\'t find any nodes in
workflow [resource_type:WORKFLOW project:\"7b33c23287ae4d2481160448171b307c\" domain:\"development\"
name:\"my_package.worklow.simple_workflow\" version:\"v1\" ].\n", grpc_status:13,
created_time:"2023-06-08T10:45:49.81363907+00:00"}"
Any idea on the cause of this problem and how to resolve it?
Flyte seems to require proper structure of python modules, which might not be the case in the Jupyter notebook.
Thanks,