Hello Flyte team! We are using Flyte to schedule s...
# ask-the-community
m
Hello Flyte team! We are using Flyte to schedule spark driver and executor pods. The problem is that the tolerations we use somehow don't get applied to these pods. They however work fine for all the other pods. Example tolerations on a spark pod (driver and executor the same):
Copy code
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Example tolerations on another pod (also flyte created).
Copy code
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
                             <http://nvidia.com/gpu=present:NoSchedule|nvidia.com/gpu=present:NoSchedule>
                             <http://nvidia.com/gpu:NoSchedule|nvidia.com/gpu:NoSchedule> op=Exists
                             purpose=compute:NoSchedule
We already tried to restart the node pool. Is this somehow an expected behavior? Flytekit version:
1.1.0
d
Hi @Miha Garafolj, are these tolerations specified in the k8s plugin configuration within flytepropeller?
I know there is an issue on this and a relating PR that has had some variability in the scope. We have been hoping to wrap this up for some time, if this is indeed the same issue we will work on pushing this through quickly.
@Fabio Grätz do you know if there is an immediate workaround here? Also, let's revisit the PR.
m
@Emirhan Karagül cc
k
Aah, @Emirhan Karagül this was started by @Fabio Grätz do you want to Help Push it over
e
I'd be down if @Fabio Grätz has time. Do you guys already have a design proposal to make this work?
k
As @Dan Rammer (hamersaw) commented there is a PR
d
@Emirhan Karagül so with the above PR we were attempting to upgrade the k8s-on-spark-operator as well (figured lets just update everything during configuration changes). However, the newer version imposed some dependencies, specifically k8s version, that because of the way this is compiled broke other plugins - this will be fixed soon. So to wrap this up we just need to scope down the PR to handle only the tolerations (and hopefully other existing fields) without attempting to upgrade the k8s-on-spark-operator.
157 Views