Hi everyone. I tried to run the tensorflow jobs us...
# ask-the-community
c
Hi everyone. I tried to run the tensorflow jobs using the dev sandbox cluster but encountered an error. The following are my commands
Copy code
flytectl demo start --dev
kubectl apply -k "<http://github.com/kubeflow/training-operator/manifests/overlays/standalone|github.com/kubeflow/training-operator/manifests/overlays/standalone>"
POD_NAMESPACE=flyte ./flyte start --config ../_common/configs/kubeflow.yaml
And this is my config file.
d
The error sounds like the training operator is not available and can't process the webhook. This one is called when creating a tensorflow job and then creates the Pod objects for it. Can you check if the operator in the
kubeflow
namespace is running and provides any logs?
Copy code
>  kubectl get pods -n kubeflow
NAME                                 READY   STATUS    RESTARTS   AGE
training-operator-676cc457bc-zs5bv   1/1     Running   0          2d6h
c
Yes. It is running
d
Could you try deleting the validating webhook and then install a stable version of the operator?
Copy code
$ kubectl delete <http://validatingwebhookconfigurations.admissionregistration.k8s.io|validatingwebhookconfigurations.admissionregistration.k8s.io> <http://validator.training-operator.kubeflow.org|validator.training-operator.kubeflow.org>
$ kubectl apply -k "<http://github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0|github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0>"
Judging from https://github.com/kubeflow/training-operator/issues/2080 there might be issues with the webhook and it does not exist in older versions of the operator
c
@Dennis Keck It works! Thank you very much.
d
不用謝 🙂
l
amazing chinese
b
lolwtf