• Fabio Grätz

    Fabio Grätz

    1 month ago
    Hey, is there documentation for the new pod templates feature?
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    Hey @Fabio Grätz, are you referring to setting a default PodTemplate for k8s Pod configuration? If so, I don't believe we have written anything up (will file an issue for it). Hopefully configuration is pretty simple, you just need to set the default-pod-template-name configuration option on FlytePropeller. When executing, FlytePropeller attempts to use the PodTemplate in the namespace that the Pod will be created in (ex. by default a pod in the project
    flytesnacks
    and domain
    development
    will look for a PodTemplate in the
    flytesnacks-development
    namespace). If that PodTemplate does not exist, it then attempts to find on in the namespace that FlytePropeller runs in.
  • The only thing to note is that PodTemplates are required to have a container set. In the implementation we override this value, because Flyte requires certain containers to be running. So when definined the default PodTemplates you need to do something like:
    apiVersion: v1
    kind: PodTemplate
    metadata:
    name: flyte-default-template
    namespace: flyte
    template:
    metadata:
    spec:
    containers:
    - name: noop
    image: <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>
    subdomain: "default-subdomain"
    Where in this example I defined a
    noop
    container.
  • cc @Marc Paquette
  • A link to the github issue for visibility.
  • Fabio Grätz

    Fabio Grätz

    1 month ago
    Thank you, I will try this out! The original motivation for looking into this is that I need to start my workflow tasks on a tainted node pool of high-powered machines that e.g. the flyte backend etc. wouldn’t be able to use. I have been using the default tolerations to do this but the Spark tasks don’t use them currently. I fixed this in this PR (Ketan already tagged you I think). Does this fix make sense to you? Spark on K8s currently unfortunately doesn’t use v1.PodTemplateSpec but a custom pod spec (see discussion). So I think that the pod template wouldn’t solve my problem as flytepropeller would have to translate the pod template into sparks custom pod spec. Is this correct?
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    I think your PR is certainly justified. Looking a bit deeper I wonder if we should take it a step further. It looks like (and sound like in the discussion) that the spark PodSpec is a subset of the k8s PodSpec. Should we set all spark configuration according to the k8s plugin config? I see that NodeSelector, HostNetwork, SchedulerName, Affinity (among others) are in both PodSpecs but not carried over from Flyte k8s configuration. Is there any reason that these values should not be?
  • Fabio Grätz

    Fabio Grätz

    1 month ago
    No, i don’t think there is a good reason against also carrying those over. I will add this to the PR in the next days and ping you again once ready?
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    And yeah, currently the default PodTemplate is only applied to Pods launched using the Pod plugin (ie. Container and Pod tasks). Maybe it makes sense to apply these values as default to all resources, including Spark. There seem to be a few options: • Spark 3 allows passing default PodTemplates, but require a volume mount I think - so it could be a little messy • We could use the Flyte default PodTemplate to initialize a default Spark PodSpec - similar to the mapping in the opposite direction that Spark 3 seems to do. If we want to transition to using the Flyte default PodTemplate rather than adding one-off configuration in the k8s plugin configuration (as we certainly want to) this may set some precedent. We would have to have a deeper discussion.
  • Sounds great! Thanks for putting this together, I'll be waiting on the changes!
  • Fabio Grätz

    Fabio Grätz

    1 month ago
    👍 Will implement the changes regarding the default config in the k8s plugin (probably on the weekend) and I’m happy to have a deeper discussion about using pod templates also for spark later on.
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    cc @Ketan (kumare3) @Haytham Abuelfutuh RE: Spark plugin configuration and extending application of Flyte default PodTemplate
  • Fabio Grätz

    Fabio Grätz

    1 month ago
    Hey @Dan Rammer (hamersaw), I continued working on this but converted the PR to draft because I’m not done yet. I could use your input on something 🙂 Flyteplugins currently uses
    <http://github.com/GoogleCloudPlatform/spark-on-k8s-operator|github.com/GoogleCloudPlatform/spark-on-k8s-operator> v0.0.0-20200723154620-6f35a1152625
    . In this version,
    SparkPodSpec
    has:
    // SecurityContenxt specifies the PodSecurityContext to apply.
    	// +optional
    	SecurityContenxt *apiv1.PodSecurityContext
    Notice the
    SecurityContext
    of type
    PodSecurityContext
    . In flyteplugins we set the spark pod's security context to the DefaultPodSecurityContext accordingly:
    SecurityContenxt: config.GetK8sPluginConfig().DefaultPodSecurityContext.DeepCopy(),
    The k8s plugin config has both a
    DefaultPodSecurityContext
    as well as a
    DefaultSecurityContext
    . In the newer spark-on-k8s-operator versions, this has been fixed and there is now both the
    PodSecurityContext
    as well as the
    SecurityContext
    . Do you agree that this should be fixed in flyteplugins by using a newer version of the spark-on-k8s-operator? I tried fixing this but
    go get -v -u <http://github.com/GoogleCloudPlatform/spark-on-k8s-operator@master|github.com/GoogleCloudPlatform/spark-on-k8s-operator@master>
    gives me the following error:
    go: <http://github.com/GoogleCloudPlatform/spark-on-k8s-operator@v0.0.0-20220615230608-94775cd89ca0|github.com/GoogleCloudPlatform/spark-on-k8s-operator@v0.0.0-20220615230608-94775cd89ca0> requires
            <http://k8s.io/kubernetes@v1.19.6|k8s.io/kubernetes@v1.19.6> requires
            <http://k8s.io/api@v0.0.0|k8s.io/api@v0.0.0>: reading <http://k8s.io/api/go.mod|k8s.io/api/go.mod> at revision v0.0.0: unknown revision v0.0.0
    This appears to be a known issue due to the way k8s uses its
    go.mod
    and people have written bash scripts to work around this. I wonder whether you are others within flyteorg have experienced this before and can give me a hint how to handle this (in case you agree that upgrading spark-on-k8s-operator makes sense). Thanks 🙂
  • Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    We have sadly. This is why eventually we want out of binary backend plugins
  • But solution- cc @Haytham Abuelfutuh and @Yuvraj have seen I think
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    Pinging to keep this active. I agree, if we are updating the Spark plugin configuration, we might as well try to clean up everything. In reading the issue a bit, it seems this k8s dependency problem is not going away - so if we ever want to update spark-on-k8s-operator we will need a fix here. If I understand correctly, the proposed solution adds
    replace
    commands for all of the k8s internal dependencies because the kubernetes repo is not meant to be a dependency so they declare v0.0.0 for all and use
    replace
    to point to a local version. Is this going to be a fix we can isolate to flyteplugins? Or would this need to be in flytepropeller as well? Not sure how replace cascades in the build. I think that integrating a script to pull k8s dependencies and insert replace statements in the go.mod may be solution? @Haytham Abuelfutuh / @Yuvraj thoughts?
  • Fabio Grätz

    Fabio Grätz

    1 month ago
    I can get back to this on the weekend and try whether adding the
    replace
    via the script in flyteplugins works without cascading to flytepropeller @Dan Rammer (hamersaw).
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    Thanks, really appreciate your time on this! I would love to get this upgraded fully to set the lastest spark pod spec fields. but if it gets to be too much I think working with the existing spark-on-k8s version is still a good sight better than the current situation.
  • Fabio Grätz

    Fabio Grätz

    1 month ago
    Hey @Dan Rammer (hamersaw), quick update, I haven’t forgotten about this 🙂 in order to use the latest spark-on-k8s-operator, I used this script to add the required
    replace
    instructions for all used k8s packages (to later test whether this workaround can be used in flyteplugins without cascading to flytepropeller). Works smoothly. In order to get the tests green again, I’m now working on fixing
    tasks/pluginmachinery/k8s/client.go
    since
    <http://sigs.k8s.io/controller-runtime|sigs.k8s.io/controller-runtime>
    is updated from
    v0.8.2
    to
    v0.12.2
    and two tags after
    v0.8.2
    , in
    v0.9.0-alpha.0
    , the
    ClientBuilder
    which flyteplugins uses here was deprecated in favor of
    NewClientFunc
    (see commit message). I haven’t figured out how to adapt flyteplugins to that change yet but will continue working on this in the next few days… Might have to get back to you for some guidance 😅🙏
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    3 weeks ago
    Hey @Fabio Grätz, sorry I was out for a little bit so I let this go stale. Lets dive back in a wrap this up! So it looks like there was some previous attempt to bump the spark-on-k8s-operator version in this PR by just hardcoding the k8s version dependencies. In discussing this a bit further, how do you feel about just hardcoding these? it should make wrapping this up pretty quick right?
  • Fabio Grätz

    Fabio Grätz

    3 weeks ago
    Hey Dan, I also had a few busy weeks and couldn’t work on it but will have a few hours waiting at an airport tomorrow. I’ll use these to work on this 🙂
  • Hey @Dan Rammer (hamersaw), hard-coding the k8s requirements (which can be automated nicely using this script) works but it will have to be done in flytepropeller as well and upgrading the k8s version there will require some other minor fixes. Also this leads a bit into dependency hell since when upgrading k8s to newer versions >=1.22 the kubeflow training operators cause problems because flyteplugins doesn’t use the new kubeflow train operators mono-repo yet and the individual pytorch operator, tensorflow operator, … repos haven’t been updated to support k8s versions >=1.22. I’m happy to continue trying to find a good solution for this (first finish the ‘k8sPluginConfig -> spark plugin’ ticket while upgrading to newest spark operator and k8s to 1.21.x and then in a separate PR maybe continue upgrading to up-to-date k8s versions while replacing the separate legacy kubeflow train operator repositories with the new mono-repo.) But I would like to confirm with you first that hard-coding the k8s requirements not only in flyteplugins but also in flytepropeller is ok for you. Do you see another way if one ever wants to upgrade to newer k8s versions?
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    2 weeks ago
    @Fabio Grätz again, very much appreciate all the effort you're dumping into this. So without (hopefully) diving too deep I want to explain the hesitancy. We have been wanted to get out-of-core plugins integrated for some time. This means something like hashicorps grpc plugin or go's dll plugin system - where we no longer need a flyteplugins repo with everything bundled. This should solve the current issue, each plugin maintains it's own dependencies and is not compiled into FlytePropeller, etc. Out-of-core plugins will certainly not be completed by flyte 1.2 (october release), but i'm thinking i'll try to make it my project for 1.3. It sounds like you have this pretty thought out; from our side we just want to ensure that the refactor on out-of-core plugins is not too messy. It sounds like we can upgrade to the latest spark operation with 1.21 upgrade by just setting dependencies in flyteplugins and flytepropeller. Then the refactoring necessary for out-of-core plugins is removing these. Does that sound right? Trying to make sure we are not going so deep that we're adding a ton of work for ourselves later on. Also, is the
    PodSecurityContext
    the only field that we're gaining by an upgrade the latest spark operator?
  • Ketan (kumare3)

    Ketan (kumare3)

    2 weeks ago
    Cc @Yee
  • Fabio Grätz

    Fabio Grätz

    2 weeks ago
    What you are saying makes a ton of sense after having tried to find a way through the dependency hell of the different plugins compiled into flytepropeller for several hours yesterday. It is super difficult to find k8s versions that allow upgrading, in this case, to newer spark-on-k8s versions without requiring other plugins to upgrade as well which then in turn require other things to be adapted… Also with 1.21 I’m not fully through all problems yet. So I absolutely understand the hesitancy and why you want to go for a plugin system where the plugins don’t have to agree on a certain set of dependencies but maintain their own. I see two good options: • I can figure out a k8s version/way through dependency hell and upgrade to a new version of spark-on-k8s by pinning k8s dependencies in flytepropeller and flyteplugins without requiring any other changes that would later complicate the refactoring for out-of-core plugins. We would gain being up to date with all fields but I actually need to look into whether there are even others than
    PodSecurityContext
    . I will time-box this effort. • I will transfer all fields from the k8s plugin config that can be set with the current spark-on-k8s version (which would solve my current problem of using the default tolerations) and I would be happy to upgrade spark-on-k8s and also other plugins later once the plugins have been moved out of core. Does that make sense to you?
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    2 weeks ago
    Absolutely, I think the later makes the most sense. Regarding upgrading versions, what would take 10 hours for a hacky solution now should be a simple version bump once out-of-core plugins are implemented. Additionally, it will unblock you guys in the short-term. @Yee what do you think?
  • Fabio Grätz

    Fabio Grätz

    2 weeks ago
    Agreed, will go for 2. I’ll take a look at which of the fields can be updated with the current version and amend the PR. Will ping you then 👍
  • Yee

    Yee

    2 weeks ago
    yes! let’s just add everything that can be added today without wrangling dependencies