How can ensure if a particular plugin is actually ...
# flyte-support
g
How can ensure if a particular plugin is actually installed properly. I followed this doc to install tensorflow-operator but after installation when I tried to launch the example workflow mentioned in the doc, I see below error in my flyteworkflow deployment.
Copy code
message: >-
    failed at Node[n0]. RuntimeExecutionError: failed during plugin execution,
    caused by: failed to execute handle for plugin [tensorflow]: found no
    current condition. Conditions: []
h
@polite-ability-4005 this is what you were seeing in MPI as well right? Is updating the task status to
Queued
rather than failing the task the correct functionality here?
g
For me it ramained in running state for 1h 40m and then aborted eventually with below error
Copy code
Workflow[flytesnacks:development:kftensorflow.tf_mnist.mnist_tensorflow_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [tensorflow]: found no current condition. Conditions: []
@tall-lock-23197
p
@hallowed-mouse-14616 yes. But I think he ran into other errors. No current conditions will not abort the task
Would you mind providing more context @gifted-train-81198
g
@polite-ability-4005 I have flyte cluster setup on GCP, I tried to upgrade it to enable tensorflow-operator plugin. I followed this doc for the same. Once upgrade is successful I tried to run one tensorflow workflow which is again mentioned in the the same doc. When I launch this workflow it gives me this error
Copy code
Workflow[flytesnacks:development:kftensorflow.tf_mnist.mnist_tensorflow_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [tensorflow]: found no current condition. Conditions: []
How can I debug this? How can I make sure if my plugin is properly enabled?
p
1. Describe tfjob (
kubectl describe <your tfjob>
) 2. Check tfjob’s pod logs (
kubectl logs <your tfjob pod
) 3. Check propeller pod logs Here is the several methods i will use to debug
172 Views