Adrian Rumpold
05/04/2023, 9:52 AMall
or a single namespace?
Based on how the information is used in `controller.go` , line 516, I fear that only a single namespace is allowed, right? If so, would you consider using something like xns-informer (Github) to allow a list instead?
My use case: I'm trying to deploy multiple instances of the Flyte control plane in a single cluster for review during CI runs of a deployment configuration. Having multiple propellers configured with limit-namespace=all
causes a race condition between them. Limiting each propeller instance to a single workflow namespace is too narrow OTOH, since in my understanding it precludes the use of k8s namespaces to separate Flyte projects (please correct me if I'm wrong here).
Thanks for your help! 🙏Klemens Kasseroller
05/04/2023, 9:58 AMDerek Yu
05/04/2023, 2:18 PMtemplates
section here? I tried adding a rolebinding
resource, but it didn't work. Not sure if there's a special naming convention for the key, or some missing permissions perhaps.Nan Qin
05/04/2023, 4:51 PMremote.register_workflow
the tasks within the workflow are also registered. However, if one of them is dynamic which contains other tasks/subworkflows, those tasks/subworkflows are not registered. What is the best way to register all the tasks/subworkflows within a dynamic task using the remote python api?Victor Gustavo da Silva Oliveira
05/04/2023, 5:13 PMThomas Blom
05/04/2023, 5:35 PM@task(requests=Resources(cpu="48", mem="24Gi"), limits=Resources(cpu="64", mem="64Gi"))
Normally I use limits to not leave available resources unused -- above I think of this as saying "I definitely need 48 cores to get my job done, but I can use up to 64". But in practice, because scheduling only looks at requests, apparently without regard for actual node utilization, this is problematic. My task above got scheduled onto a node with 64 cores, and even though my task has all 64cores pegged at 100%, other tasks are getting scheduled onto the same node (and failing to start, waiting in the "initializing" state). K8s thinks there is room on the node, since I only requested 48 cores, and the node has 64.
In fact, even "system level" pods seem to have difficulty running well because my task is pegging all 64cores at 100%, and no capacity appears to have been reserved for other admin-type pods on the node.
In case it's not obvious, know that I'm fairly new to K8s scheduling, and in fact I'm not the person that has setup the cluster or manages scheduling configuration (we use Karpenter for some of this), but I'm trying to understand this myself so I can be useful to my team in solving it. I've done some googling but not gaining much insight beyond the link posted above.
Thanks!
ThomasViljem Skornik
05/04/2023, 7:40 PMer Ksy
05/05/2023, 9:24 AMNicholas Roberson
05/05/2023, 6:05 PM@task
using remote.execute
with the arg options=Options(labels=Labels({"test_label": "test_value"}))
and I cannot seem to get that label to attach to the pod in kubernetes, however this same setup works when I call this using a @workflow
and I can see my "test_label"
label added to all pods spun up by the workflow, is this to be expected? If so is there a way to attach pod labels when running remote.execute
just on a @task
?Gerry Meixiong
05/05/2023, 6:11 PMremote.sync(execution, sync_nodes=True)
It seems that a gRPC call is made to sync the status for every node, resulting in a few hundred calls. This makes the workflow execution take very long to sync. Any suggestions on speeding this up e.g. batch fetch the node executions or limit the recursive depth of syncing nodes?Biao He
05/05/2023, 7:37 PMSouheil Inati
05/06/2023, 6:28 PMseunggs
05/08/2023, 4:47 AMFailed to convert return value for var o0 for function project.wf_58_366.wf_58.dataset_tylo_credit_card_fraud with error <class 'TypeError'>: TypedDict does not support instance and class checks
What should I use instead if I want to use an alias type which is a dictionary with specific keys?Slackbot
05/08/2023, 11:39 AMMathias Andersen
05/08/2023, 1:34 PMSebastian Büttner
05/08/2023, 3:11 PMseunggs
05/08/2023, 5:00 PM--fast
flag to my pyflyte package ...
command, but leave the flytectl register files …
command the same and then forego the docker build process?Cody Scandore
05/08/2023, 6:05 PMpyflyte init
to initialize this directory structure below, which runs using pyflyte run --remote ./workflows/example.py
just fine. However, when adding a custom image, I reach a ModuleNotFoundError
, and flyte / the image cannot seem to access my code.
Directory
.
├── docker_build.sh
├── Dockerfile
├── LICENSE
├── README.md
├── requirements.txt
└── workflows
├── example.py
├── __init__.py
example.pyLaura Lin
05/09/2023, 3:55 AMNandakumar Raghu
05/09/2023, 12:24 PMerror converting YAML to JSON: yaml: line 32: mapping values are not allowed in this context
dynamic "set" {
for_each = module.labels.tags
content {
name = "commonLabels.${set.key}"
value = set.value
}
}
Adrian Rumpold
05/09/2023, 12:39 PMHashMethod
, which for consistency's sake I'd like to also use as the input parameter for a second task. However doing so results in an error message that looks quite similar to the behavior I had described in my original post above (and probably has the same cause - if that's the case, I'm happy to provide a PR).
I can see why allowing annotations on both an task output and a task input may lead to weird semantics questions (what function would the HashMethod
on the input argument serve?), but I like the consistent typing between outputs and inputs and could see the data scientists in our org stumble across a similar problem in the future.
Thanks!Brian O'Donovan
05/09/2023, 4:12 PMmy_shelltask = ShellTask(
name="the_shelltask",
debug = True,
script="""
runs some shell code with {inputs.input1} and {inputs.input2}
""",
inputs=kwtypes(
input1=some_value,
input2=some_other_value
),
output_locs=[
OutputLocation(var='output1', var_type=FlyteFile, location="/tmp/output1.file"),
OutputLocation(var='output2', var_type=FlyteFile, location="/tmp/output2.file")
]
)
@task
def run_shelltask(MyDataClass, input1 = input1, input2 = input2) -> Tuple[MyDataClass, FlyteFile, FlyteFile]:
output1, output2 = my_shelltask(input1=input1)
return MyDataClass, output1, output2
Ophir Yoktan
05/09/2023, 4:25 PMCody Scandore
05/09/2023, 6:23 PMpy task --> R task --> py task
What is the recommendation for passing more complex types into R tasks? Will raw containers work for my use case, or is there another way to execute more effectively?Frank Shen
05/09/2023, 7:42 PMlaunch_plan.LaunchPlan.get_or_create(
workflow=wf, name="your_lp_name_1", default_inputs={"a": 3}, fixed_inputs={"c": "4"}
)
joe
05/10/2023, 12:28 AMflytekit.approve
. The workflow runs fine on my native local machine, but when I try to register the workflow I get the following error:
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "failed to compile workflow for [resource_type:WORKFLOW project:"myproject" domain:"workflows" name:"myproject.workflows.pytorch_training.model_training_and_approval_workflow" version:"DwDJxBGH4EDglOXa-ZZwOA==" ] with err failed to compile workflow with err Collected Errors: 1
Error 0: Code: VariableNameNotFound, Node Id: n1, Description: Variable [o0] not found on node [n1].
"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2023-05-09T17:24:07.713904-07:00", grpc_status:13, grpc_message:"failed to compile workflow for [resource_type:WORKFLOW project:\"myproject\" domain:\"workflows\" name:\"myproject.workflows.pytorch_training.model_training_and_approval_workflow\" version:\"DwDJxBGH4EDglOXa-ZZwOA==\" ] with err failed to compile workflow with err Collected Errors: 1\n\tError 0: Code: VariableNameNotFound, Node Id: n1, Description: Variable [o0] not found on node [n1].\n"}"
>
Any idea what's causing this?Sebastian Büttner
05/10/2023, 9:59 AMTraceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/botocore/httpsession.py", line 465, in send
urllib_response = conn.urlopen(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
raise value
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7f6941b61150>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/flytekit/core/data_persistence.py", line 301, in get_data
self.get(remote_path, to_path=local_path, recursive=is_multipart)
File "/usr/local/lib/python3.10/site-packages/flytekit/core/data_persistence.py", line 199, in get
return file_system.get(from_path, to_path, recursive=recursive)
File "/usr/local/lib/python3.10/site-packages/fsspec/spec.py", line 893, in get
rpaths = [p for p in rpaths if not (trailing_sep(p) or self.isdir(p))]
File "/usr/local/lib/python3.10/site-packages/fsspec/spec.py", line 893, in <listcomp>
rpaths = [p for p in rpaths if not (trailing_sep(p) or self.isdir(p))]
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 601, in isdir
return bool(self._lsdir(path))
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 394, in _lsdir
for i in it:
File "/usr/local/lib/python3.10/site-packages/botocore/paginate.py", line 269, in __iter__
response = self._make_request(current_kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/paginate.py", line 357, in _make_request
return self._method(**current_kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 943, in _make_api_call
http, parsed_response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 966, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 202, in _send_request
while self._needs_retry(
File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 354, in _needs_retry
responses = self._event_emitter.emit(
File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/retryhandler.py", line 207, in __call__
if self._checker(**checker_kwargs):
File "/usr/local/lib/python3.10/site-packages/botocore/retryhandler.py", line 284, in __call__
should_retry = self._should_retry(
File "/usr/local/lib/python3.10/site-packages/botocore/retryhandler.py", line 320, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/local/lib/python3.10/site-packages/botocore/retryhandler.py", line 363, in __call__
checker_response = checker(
File "/usr/local/lib/python3.10/site-packages/botocore/retryhandler.py", line 247, in __call__
return self._check_caught_exception(
File "/usr/local/lib/python3.10/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
raise caught_exception
File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 281, in _do_get_response
http_response = self._send(request)
File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 377, in _send
return self.http_session.send(request)
File "/usr/local/lib/python3.10/site-packages/botocore/httpsession.py", line 494, in send
raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "<https://flyte-exaris.s3.amazonaws.com/?list-type=2&prefix=exaris-dev%2Fdevelopment%2FAFHNPRZZHP5G2Z6PQZPJZ5BTQI%3D%3D%3D%3D%3D%3D%2Ffast0d58d290717094dfcabe1d0e8ccbfb63.tar.gz%2F&delimiter=%2F&encoding-type=url>"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/pyflyte-fast-execute", line 8, in <module>
sys.exit(fast_execute_task_cmd())
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/flytekit/bin/entrypoint.py", line 497, in fast_execute_task_cmd
_download_distribution(additional_distribution, dest_dir)
File "/usr/local/lib/python3.10/site-packages/flytekit/tools/fast_registration.py", line 111, in download_distribution
FlyteContextManager.current_context().file_access.get_data(additional_distribution, os.path.join(destination, ""))
File "/usr/local/lib/python3.10/site-packages/flytekit/core/data_persistence.py", line 303, in get_data
raise FlyteAssertion(
flytekit.exceptions.user.FlyteAssertion: Failed to get data from <s3://flyte-exaris/exaris-dev/development/AFHNPRZZHP5G2Z6PQZPJZ5BTQI======/fast0d58d290717094dfcabe1d0e8ccbfb63.tar.gz> to /root/ (recursive=False).
Original exception: Could not connect to the endpoint URL: "<https://flyte-exaris.s3.amazonaws.com/?list-type=2&prefix=exaris-dev%2Fdevelopment%2FAFHNPRZZHP5G2Z6PQZPJZ5BTQI%3D%3D%3D%3D%3D%3D%2Ffast0d58d290717094dfcabe1d0e8ccbfb63.tar.gz%2F&delimiter=%2F&encoding-type=url>"
Weird is really that many tasks do have no problem with reaching s3, but some do have the problem. All task-pods run in the same SG, subnet, IAM Role.
This is happening with larger numbers of tasks, but also when there are smaller numbers of tasks in parallel and therefore less calls to s3.
CoreDNS pods are not failing.
I hunted for a source of the fault for a couple of hours already, maybe somebody has a hint or idea for me.Kishore Vikram
05/10/2023, 10:50 AMFabio Grätz
05/10/2023, 11:04 AMhelm repo update
this is not included in the installed chart. Does the helm chart need to be released again? How does the process for this look like? Thanks 🙂Tommy Nam
05/10/2023, 11:10 AMpyflyte run
command with an external auth server? I can see that it is part of the flyteIdl repository but have not worked with Go or gRPC in the past - though I am open to trying nonetheless.
We are receiving 403 forbidden errors due to the flyte-binary
pod/deployment being unable to send the audience
parameter. I am assuming that this is between FlyteAdmin and FlytePropeller, though I could be wrong.
So essentially, the flow is like this:
• Localhost/client/machine sends pyflyte run --remote
command to gRPC backend - AWS ALB w/ SSL/TLS
• Auth request is successful for pyflyte/flytekit - using Auth0 as external auth server
• Web console registers and then displays workflow with UNKNOWN status
• No Pods that were requested in the pyflyte command are scheduled
• Inspection of flyte-binary deployment w/ kubectl shows that a missing audience
parameter is needed
• Tons of requests with 403 - seems retry logic never stops - Necessitates killing deployment
• Auth0 Logs show that all the requests fail due to audience
Relevent Github issue is here with further logs/details: https://github.com/flyteorg/flyte/issues/3662
Any assistance in the matter would be greatly appreciated. We believe that this is the final step in getting Auth0 working with a flyte-binary deployment and would be more than glad to provide supporting documentation/code from forks if need be.