freezing-tailor-85994
09/03/2025, 6:55 PMpytorchjobs.kubeflow.org is forbidden: User "system:serviceaccount:flyte:flyte-backend-flyte-binary" cannot create resource "pytorchjobs" in API group "kubeflow.org" in the namespace "inference-staging"
Relevant section of the helm chart in threadfreezing-tailor-85994
09/03/2025, 6:56 PMclusterResourceTemplates:
inline:
#This section automates the creation of the project-domain namespaces
001_namespace.yaml: |
apiVersion: v1
kind: Namespace
metadata:
name: '{{ namespace }}'
# This block performs the automated annotation of KSAs across all project-domain namespaces
002_serviceaccount.yaml: |
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: '{{ namespace }}'
annotations:
eks.amazonaws.com/role-arn: '{{ defaultIamRole }}'
003_torch_role.yaml: |
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: torch-role
namespace: "{{ namespace }}"
rules:
- apiGroups: ["*"]
resources:
- pods
verbs:
- '*'
- apiGroups: ["*"]
resources:
- services
verbs:
- '*'
- apiGroups: ["*"]
resources:
- configmaps
verbs:
- '*'
- apiGroups: ["*"]
resources:
- persistentvolumeclaims
verbs:
- "*"
- apiGroups: ["*"]
resources:
- pytorchjobs
verbs:
- '*'
004_torch_role_binding.yaml: |
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: torch-role-binding
namespace: "{{ namespace }}"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: torch-role
subjects:
- kind: ServiceAccount
name: default
namespace: "{{ namespace }}"
freezing-tailor-85994
09/03/2025, 7:19 PMancient-wolf-19325
09/03/2025, 7:19 PMancient-wolf-19325
09/03/2025, 7:19 PMfreezing-tailor-85994
09/03/2025, 7:21 PM<http://pytorchjobs.kubeflow.org|pytorchjobs.kubeflow.org> is forbidden: User "system:serviceaccount:flyte:flyte-backend-flyte-binary" cannot create resource "pytorchjobs" in API group "<http://kubeflow.org|kubeflow.org>" in the namespace "inference-staging"
ancient-wolf-19325
09/03/2025, 7:21 PMancient-wolf-19325
09/03/2025, 7:23 PM<http://pytorchjobs.kubeflow.org|pytorchjobs.kubeflow.org>
resources in the target namespace. The error suggests you need a ClusterRole with permissions for the pytorchjobs
resource in the <http://kubeflow.org|kubeflow.org>
API group, bound to your Flyte service account. This is not handled by your current Role/RoleBinding, which is namespace-scoped and does not cover custom resources like PyTorchJob at the cluster level.
To resolve this, you should create a ClusterRole with the appropriate rules and bind it to your Flyte service account. This approach is documented for similar plugins (e.g., Dask, Ray) and is required for custom resources managed by operators. Would you like more detail or a code example for the necessary ClusterRole and ClusterRoleBinding?
Sources:
- Ray plugin setup guide lacks ClusterRole step š 0 button š 0 button š” Hint: Mention @runllm in the thread for followups.