Hi all! I am currently evaluating if Flyte could b...
# ask-the-community
g
Hi all! I am currently evaluating if Flyte could be used as a workflows system to handle multiple on prem compute clusters. Few questions: • We have some clusters that are running k8s, but some are running SLURM. I know that flyte handles k8s, but can it schedule jobs on top of SLURM? • Can Flyte load balance workflows between multiple compute clusters with different physical locations? • Can Flyte use a custom data plane which is API compatible with aws S3?
k
Hi @Giacomo Dabisias IT firstly welcome to the community. Sorry for the delay in response as you know folks have been out. I will have a detailed response in a short while - In short Yes Yes And yes
Detailed answers 1. (For 1). you will have to write a backend-plugin that can schedule onto a slurm cluster. this will use the Slurm WebAPI. You can probably use the Slurm REST API. This can use the WebAPI backend plugin system to write this using a simple CRUD interface. Checkout examples of Databricks, Snowflake etc. We are in the process of updating this to make this even better and easier to author - even in python 2. Yes, Open source Flyte can load balance, but it needs ability to communicate with the cluster from the control plane (flyteAdmin) K8s API server and Flytepropeller (in the dataplane) should be able to send grpc messages back to the control plane. Refer to Union.ai for a more seamless setup 3. Absolutely. We call this extensible data persistence layer. Flyte is compatible with any S3 like blob store. The API for this is extensible in 2 places. Before that understand how Flyte handles data - using this doc a. Extend the backend to allow a new metastore like S3. this is done using flytestdlib storage api. This API already uses the stow system underneath so if it is already extended in stow, then it should work out of box. b. python data persistence will need access to metastore and raw store. This can be extended using protocol based system and depends on library API like - fsspec. Hopefully this answers
g
Many thanks @Ketan (kumare3) 🙂
r
@Ketan (kumare3), a follow up. Where does the Flyte run? I'm assuming a small k8s cluster for the control, with the jobs on SLURM.
k
@Rahul Parundekar wdym? Flyte runs in k8s and orchestrates everything in k8s
Btw @Giacomo Dabisias IT how’s it going?
We currently do not have a slurm backend plugin
r
That's what I thought. Was just confirming (only to rectify my limited understanding). Flyte runs on K8s, and the task would run on SLURM with the custom backend plugin.
Thanks!
k
Yup
s
I would also be interested in something like this if someone is interested in collaborating
k
@Simon Byrne infact @Sujith Samuel has something like this already
s
Oh, I would be keen to see it: is it public?
k
There is an rfc in progress that should simplify writing external plugins https://hackmd.io/@pingsutw/B1a_Bnfqi
Can you talk about your requirements
s
Sure: I admit I'm still in the exploratory phase, but basically I want a nice front-end for our doing HPC simulations and post-processing of data, using our Slurm cluster.
I want a system that will keep track of code and data in a reproducible manner, and have a nice way to embed visual summaries (plots, tables) in a nice Web GUI front-end.
I guess the key requirements are: • be able to submit jobs to the Slurm cluster • track code and loaded packages used for each job • track data on both the local parallel file system and S3 buckets ◦ ideally it would also manage data retention policies, but that is a later concern • provide a web view of currents status, logs and visual summaries
r
This is more of a SLURM question @Simon Byrne, but I'm curious how autoscaling would work for SLURM. In Flyte on K8s you I'm assuming you can leverage K8s stack like Karpenter/Cluster AutoScaler to get the resources provisioned automatically. (@Ketan (kumare3), correct me if I'm wrong) If I submit a job with SLURM will the cluster already be running with a large number of CPUs/GPUs? Or it can autoscale as per the job.
Console outputs and debugging failed jobs should also be requirements I think.
s
We would rely on the Slurm scheduler: basically each Flyte task would correspond to a Slurm job (so you would need some way to describe the requested resources in the task metadata).
(I admit I'm not that familiar with Flyte, and my Kubernetes knowledge is pretty basic)
Slurm is a batch scheduler, so the job will be in a queue until the requested resources are available
If I submit a job with SLURM will the cluster already be running with a large number of CPUs/GPUs? Or it can autoscale as per the job.
Yes, we can typically assume it is already running (powering up/down nodes is handled by the Slurm controller)
r
Perfect, thanks @Simon Byrne
s
FWIW, we have built something similar for running CI jobs on Slurm via Buildkite: https://github.com/CliMA/slurm-buildkite
r
Another requirement I am thinking of would be about running failed jobs. With the Flyte TF and Pytorch examples, seems like theres some checkpointing, and that will allow for re-running the job if it fails. I've got a very limited understanding of SLURM, but I think running long training jobs there might be more stable. With spot instances, it might still be a need though.
s
That could certainly be useful (on large enough jobs random hardware failures can happen), but it's less of a concern than using spot EC2 instances
k
This is a great discussion I think we should capture in an issue [flyte-plugin] But Flyte also ships with a batch scheduler in built and can respond to back pressure etc. it native works on k8s , but ofcourse can send jobs to other systems like SLURM. For the community we would Love to understand what are the advantages of using slurm ve pure Flyte on k8s abstractions. I know @jeev at freenome has worked with both. Might be able to shed some light
Also @Simon Byrne backend extensions already exist in golang as described here https://docs.flyte.org/projects/cookbook/en/latest/auto/core/extend_flyte/backend_plugins.html
s
Thank you both @Ketan (kumare3) and @Rahul Parundekar
506 Views