Nan Qin
03/14/2023, 8:42 PMEvan Sadler
03/14/2023, 10:01 PMThe Databricks Flyte tasks launch successfully, but hang indefinitely after the DB job finishes successfully.So far I have tried looking at flytepropeller, but I haven't found any logs relating to the execution id in question. Any tips on ways to debug is much appreciated 🙏
Jimmy Du
03/15/2023, 1:24 AMKilled
. New files aren't generated at the specified output prefix, raw output prefix, or checkpoint path locations.honnix
03/15/2023, 11:35 AM<http://sigs.k8s.io/yaml|sigs.k8s.io/yaml>
in flytectl: https://github.com/flyteorg/flytectl/blob/cea39d9bdf2476f9a5313d1bf19bf08b3923237a/cmd/create/execution_util.go#L13. I'm asking because it depends on go-yaml v2, which does not support yaml 1.2. More specifically, it has weird handling of n
, y
, no
, yes
, etc. when coming to key name, see https://github.com/go-yaml/yaml/issues/214. If an input argument is named as n
, when creating an execution via a execution spec file, that argument must be quoted as "n"
, otherwise flytectl would fail because when unmarshalling n
is parsed as false
as the argument name.Emanuel Hasselberg
03/15/2023, 2:12 PM@task(cache=True, cache_version="2.0")
def process_video() -> FlyteFile:
"""
Run clip_extract on video
"""
input_path = FlyteFile("<s3://my-s3-bucket/dataset/test.asf>")
output_file = FlyteFile("output.mjpg")
command = ['ffmpeg', '-i', str(input_path), '-c:v', 'mjpeg', '-q:v', '3', '-an', str(output_file)]
subprocess.check_call(command)
return output_file
justin hallquist
03/15/2023, 3:35 PMflytekit.exceptions.user.FlyteAssertion: Failed to put data from /tmp/tmp030l19_7/script_mode.tar.gz to <http://localhost:30084/my-s3-bucket/a/b/PJPRTCSJDCALQRYI44IFNWNQ5M%3D%3D%3D%3D%3D%3D/scriptmode.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20230315%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230315T152958Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-md5%3Bhost&X-Amz-Signature=9b449916e2b9ca077d45255d05f86fec463c3e999e14c2c38a0075928906afa0> (recursive=False).
Original exception: HTTPConnectionPool(host='localhost', port=30084): Max retries exceeded with url: /my-s3-bucket/a/b/PJPRTCSJDCALQRYI44IFNWNQ5M%3D%3D%3D%3D%3D%3D/scriptmode.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20230315%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230315T152958Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-md5%3Bhost&X-Amz-Signature=9b449916e2b9ca077d45255d05f86fec463c3e999e14c2c38a0075928906afa0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f205d87edc0>: Failed to establish a new connection: [Errno 111] Connection refused'))
noticed it was using a signed url override (not sure why it's doing that all of the sudden) and pretty much anything I change/update in the charts just doesnt apply, for example:
storage:
# -- Sets the storage type. Supported values are sandbox, s3, gcs and custom.
type: sandbox
# -- bucketName defines the storage bucket flyte will use. Required for all types except for sandbox.
bucketName: my-s3-bucket
# -- settings for storage type s3
signedUrl:
stowConfigOverride:
endpoint: <http://minio.flyte.svc.cluster.local:9000>
anyone have any thoughts?Greg Gydush
03/15/2023, 4:37 PMFrank Shen
03/15/2023, 6:02 PMFEATHR_SECRET_GROUP = 'aws'
@task(secret_requests=[Secret(group=FEATHR_SECRET_GROUP, key='s3_access_key'), Secret(group=FEATHR_SECRET_GROUP, key='s3_secret_key')])
def get_feathr_s3_secrets() -> Tuple[str, str]:
context = current_context()
s3_access_key = context.secrets.get(FEATHR_SECRET_GROUP, 's3_access_key')
s3_secret_key = context.secrets.get(FEATHR_SECRET_GROUP, 's3_secret_key')
Frank Shen
03/15/2023, 6:08 PM@task(secret_requests=[Secret(group=FEATHR_SECRET_GROUP, key='xxx'), Secret(group=FEATHR_SECRET_GROUP, key='yyy')])
def get_feathr_s3_secrets() -> Tuple[str, str]:
context = current_context()
s3_access_key = context.secrets.get(FEATHR_SECRET_GROUP, 'xxx')
s3_secret_key = context.secrets.get(FEATHR_SECRET_GROUP, 'yyy')
Ahmed Laadraoui
03/16/2023, 1:40 PMFhuad Balogun
03/16/2023, 3:04 PMMessage:
Failed to convert return value for var o0 for function flyte.workflows.challenge_task_cpu with error <class 'AttributeError'>: __args__
SYSTEM ERROR! Contact platform administrators.
Fhuad Balogun
03/16/2023, 3:13 PMTypeTransformerFailedError: Type of Val 'typing.Dict' is not an instance of typing.Dict[str, str]
Leiqing
03/16/2023, 4:32 PMn
tasks, each returning an `int`value, within a dynamic workflow, Is there an easy way for me to sum up / aggregate all the results, potentially in a single task? Currently, I have to do this in a cascade manner, and it’s not that efficient given a reasonably large n
seunggs
03/16/2023, 5:33 PMpyflyte package
to work without causing module import issues - any help would be greatly appreciated. The project is in /project
(root dir) and inside is /wf
dir with wf_10.py
where the workflow code lives, where I import tasks from main.py
in the same dir. I’m running pyflyte from /project
dir with this command pyflyte --pkgs wf packages ...
and the packaging fails with:
Loading packages ['wf'] under source root /project
Failed with Unknown Exception <class 'ModuleNotFoundError'> Reason: No module named 'main'
No module named 'main'
seunggs
03/16/2023, 5:33 PM--pkgs
flag is given, shouldn’t that put the path in PYTHON_PATH and be able to import python modules in that directory (in case you’re wondering, there is an empty __init__.py
in /wf
dir)?seunggs
03/16/2023, 5:34 PMkarthikraj
03/16/2023, 6:05 PMpropeller logs:
---------------
{"json":{"src":"controller.go:157"},"level":"info","msg":"==\u003e Enqueueing workflow [examples-hbomax/f77eeb5cdbf1d432da25]","ts":"2023-03-16T18:00:23Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2023-03-16T18:00:23Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2023-03-16T18:00:23Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2023-03-16T18:00:23Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2023-03-16T18:00:25Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2023-03-16T18:00:25Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2023-03-16T18:00:25Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2023-03-16T18:00:27Z"}
Victor Gustavo da Silva Oliveira
03/16/2023, 6:16 PM{
"json": {
"exec_id": "a67t5ktm7f6fhksmtcfr",
"ns": "machine-learning",
"routine": "worker-32"
},
"level": "warning",
"msg": "Workflow namespace[machine-learning]/name[a67t5ktm7f6fhksmtcfr] has already been terminated.",
"ts": "2023-03-16T18:10:39Z"
}
Can anyone help me with this? I'll be much appreciatedNan Qin
03/16/2023, 6:21 PMpyflyte register
and got the following output. But there is nothing registered from the console. Any idea?
Successfully serialized 12 flyte objects
[✔] Registration experiment.flyte.workflows.example.train_base_model type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.example.train_cloak1 type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.example.train_cloak2 type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.example.score type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.example.sgf_training_wf type WORKFLOW successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.example.sgf_training_wf type LAUNCH_PLAN successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.workflow.train_base_model type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.workflow.train_cloak1 type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.workflow.train_cloak2 type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.workflow.test type TASK successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.workflow.baby_training_wf type WORKFLOW successful with version 895hlD5tQnf8V_eJ_G4Dkg==
[✔] Registration experiment.flyte.workflows.workflow.baby_training_wf type LAUNCH_PLAN successful with version 895hlD5tQnf8V_eJ_G4Dkg==
Successfully registered 12 entities
Choenden Kyirong
03/16/2023, 6:31 PMflytectl demo start
on a VSI (virtual server instance) and be able to reach the UI from the public using the public ip of the server? For example: <public_ip>:30080/console
. I’ve tried this and the demo is running from within the server but i cant seem to reach it from my own browser- do i need to setup a reverse proxy or do something else?
I try to navigate to: <public_ip>:30080/console
and nothing ends up loading despite the demo/sandbox running properly. I also curled the localhost:30080/console from inside the server and I get the html response back.
Any sort of help or feedback would be greatly appreciated, thanks!! 🙏Paul Lee
03/16/2023, 8:36 PMSabrina Lui
03/16/2023, 9:05 PMmap<string, struct>
input in the console. Specifically, when an entry is deleted from the map, the "Launch" button becomes grayed out as if the input is invalid, even though the other entries are formatted correctly. Re-adding the deleted entry doesn't fix the issue. Is there something I'm missing or should we file a bug?Jimmy Du
03/16/2023, 10:08 PMmessage ExecutionCreateRequest {
// Name of the project the execution belongs to.
// +required
string project = 1;
// Name of the domain the execution belongs to.
// A domain can be considered as a subset within a specific project.
// +required
string domain = 2;
// User provided value for the resource.
// If none is provided the system will generate a unique string.
// +optional
string name = 3;
// Additional fields necessary to launch the execution.
// +optional
ExecutionSpec spec = 4;
// The inputs required to start the execution. All required inputs must be
// included in this map. If not required and not provided, defaults apply.
// +optional
core.LiteralMap inputs = 5;
}
Nan Qin
03/17/2023, 12:11 AMError: docker sandbox doesn't have sufficient memory available. Please run docker system prune -a --volumes
when starting the sandbox cluster. But there is enough memory according to docker info below. Any ideas?
...
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.5GiB
Name: Mercury
ID: fa8608e9-e110-482e-8c5f-908edce3debb
Docker Root Dir: /var/lib/docker
...
seunggs
03/17/2023, 2:52 AM1.4.1
changed for scikit-learn models? For return hint type, I used RandomForestClassifier
in 1.2.7
and it was pickled, but now it seems to be joblib
?Mohd Shahid Khan Afridi
03/17/2023, 9:57 AMresolver
in task container argument changed from usual flytekit.core.python_auto_container.default_task_resolver
to pyflyte.pypi_flytekit.site-packages.flytekit.core.python_auto_container.default_task_resolver
. Can anyone help me in knowing where does flyte-cluster get this value of the resolver from ? is there a way to control this?Michael Tinsley
03/17/2023, 4:08 PMflyte-backend-flyte-binary-config
ConfigMap and everything looks correct
plugins:
k8s:
default-env-vars:
AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
AWS_METADATA_SERVICE_TIMEOUT: 5
MLFLOW_TRACKING_UI: <http://mlflow.mlflow.svc.cluster.local>
However, looking at the manifest of a task using the Flyte MLFlow plugin, the env vars look like
- name: aws_metadata_service_timeout
value: '5'
- name: mlflow_tracking_ui
value: <http://mlflow.mlflow.svc.cluster.local>
- name: aws_metadata_service_num_attempts
value: '20'
This is causing the MLFlow plugin to track locally as it can’t find MLFLOW_TRACKING_UI
Is this a bug or am I doing something wrong?Stephen
03/17/2023, 5:52 PMfail
even though the workflow failed? I think it’s only a UI error but we have subworflows that are still in Queued
stage even though the main workflow has now failed. I don’t see any pods in the cluster either.Frank Shen
03/17/2023, 6:12 PMNan Qin
03/17/2023, 7:30 PM