clever-exabyte-82294
10/29/2024, 6:54 AMFlyteInvalidInputException
. What we understand is that signal Gate Node is not ready to receive the signal and hence the failuresquare-carpet-13590
10/29/2024, 5:19 PMError : [1/1] currentAttempt done. Last Error: USER::
[f50afe3993cc347818cc-n8-0-dn1-0-dn1-0] terminated with exit code (1). Reason [Error]. Message:
7818cc} google_bid_models}] exists, err: [signal does not exist]"}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/flyte/pipeline/optimization_engine/utils/on_success_utility.py", line 30, in <module>
on_success_task(execution_name, signal_name, msg)
File "/root/flyte/pipeline/optimization_engine/utils/on_success_utility.py", line 22, in on_success_task
return send_signal(execution_name, signal_name, msg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/flyte/pipeline/optimization_engine/utils/utils.py", line 61, in send_signal
remote.set_signal(signal_name, execution_name, signal_value)
File "/usr/local/lib/python3.11/site-packages/flytekit/remote/remote.py", line 524, in set_signal
self.client.set_signal(req)
File "/usr/local/lib/python3.11/site-packages/flytekit/clients/raw.py", line 159, in set_signal
return self._signal.SetSignal(signal_set_request, metadata=self._metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/grpc/_interceptor.py", line 277, in __call__
response, ignored_call = self._with_call(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/grpc/_interceptor.py", line 329, in _with_call
call = self._interceptor.intercept_unary_unary(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/flytekit/clients/grpc_utils/wrap_exception_interceptor.py", line 44, in intercept_unary_unary
raise e
File "/usr/local/lib/python3.11/site-packages/flytekit/clients/grpc_utils/wrap_exception_interceptor.py", line 40, in intercept_unary_unary
self._raise_if_exc(request, e)
File "/usr/local/lib/python3.11/site-packages/flytekit/clients/grpc_utils/wrap_exception_interceptor.py", line 30, in _raise_if_exc
raise FlyteInvalidInputException(request) from e
flytekit.exceptions.user.FlyteInvalidInputException
how did we discover that gate node is not ready or signal is not registered ?
• we are printing list_signals by execution id which is returning empty list at that time
has anyone faced similar issue or any metrics that we can check to debug this more ?
cc @glamorous-rainbow-77959 @clever-exabyte-82294
attached wf screenshot : where the on success task tried sending signal to gate node but failed, but again when the on error task tried sending fail signal, it succeededthankful-minister-83577
thankful-minister-83577
>>
operator?clever-exabyte-82294
10/31/2024, 5:40 AM>>
operator, but it doesn't work.
a simple code is
Task A : wait_for_input(signal_name)
Task B : send_signal( signal_name)
A>> B ❌ ( blocked as B send signal and A will not recieve signal)
B >> A ❌ ( Failure as receiver is not ready)
so task A and Task B has to be parallel. and ideal Sequence is :
• Task A start
• Task B start
• Task B end
• Task A end
but under heavy load. there is no guarantee that Task A will start before Task B start.square-carpet-13590
10/31/2024, 9:11 AMsquare-carpet-13590
10/31/2024, 9:12 AMglamorous-rainbow-77959
11/08/2024, 12:19 PMthankful-minister-83577
thankful-minister-83577
square-carpet-13590
11/13/2024, 4:27 AMsquare-carpet-13590
11/13/2024, 4:30 AMthankful-minister-83577
thankful-minister-83577
square-carpet-13590
11/13/2024, 5:46 AMsquare-carpet-13590
11/13/2024, 5:47 AMsquare-carpet-13590
11/13/2024, 5:49 AMsquare-carpet-13590
11/13/2024, 5:51 AMwait_model_signals
pod might be getting scheduled on different node, which takes its own time to start execution of wait_model_signals
which spins up the gate nodes. till that point the model completes the execution and tries to set the signal which does not exists/registered yetthankful-minister-83577
square-carpet-13590
11/13/2024, 6:09 AMthankful-minister-83577
┌───────────────┐
│ │
│ parent │
│ ├──┐
└──┬────────────┘ │
│ │
┌────────▼─┐ ┌───▼────────┐
│ signal │◄────────┤ task that │
│ node │ │ sets signal│
└──────────┘ └────────────┘
square-carpet-13590
11/13/2024, 6:20 AMsquare-carpet-13590
11/13/2024, 6:23 AMgoogle_bid_model
is the gate node which got spawned after second highlighted task failed as it tried to set signal when the gate node was yet to be ready, after which the below task (on_failure cut in screenshot) was able to set signal as it was available by then on timelinethankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
square-carpet-13590
11/13/2024, 6:43 AMsquare-carpet-13590
11/13/2024, 6:43 AMglamorous-rainbow-77959
12/16/2024, 6:55 AMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
freezing-airport-6809
freezing-airport-6809