Hi Pryce, For what you've commented in PR, I have...
# slurm-flyte-wg
c
Hi Pryce, For what you've commented in PR, I have some questions to ask: 1. As you mentioned consistent persistence layer, do you mean that we should implement an independent object store to hold the inputs and outputs of heterogeneous tasks (i.e., Slurm tasks and other task types) composed in a single workflow? 2. As for GPU accelerated tasks offload, could I assume that some tasks are run on CPU (e.g., data preprocessing) and some tasks like LLM finetuning should be offloaded to another cloud service with GPU nodes? Hence, we again need to support a consistent persistence layer of inputs/outputs access. Sorry for the dumb questions. I just want to clarify if I totally get what you said and wonder what'll be the top priority for the next step. Thanks!
e
Not dumb questions at all! I appreciate you clarifying. Any production flyte deployment will have an object store. The original implementation of the slurm agent has a bucket to communicate between slurm and flyte. I'm just saying we should keep building with that in mind, even if we don't implement it immediately. For a V1 of this it's perfectly reasonable to assume workflows will run e2e on slurm and just pass filepaths around between tasks. The real power of this will be seamlessly composing workflows that can use slurm tasks with any other, which will require slurm connecting to the regular object store somehow. Nothing to be done now, I just want us to build with that requirement in mind!
c
Thank you so much for taking time to explain this so clearly! I'll keep this requirement in mind as we move forward. 🙏
d
Hi, pryce can you give me an example about
consistent persistence layer
? I think abao's task can define input/output interface for all 3 types of slurm agent
so 1. do we need to run GPU task on slurm agent now? 2. why do we need to think about the persistent layer? slurm cluster should be like databricks cluster. right? you just simply run the job you want, and let slurm cluster write output to your bucket
do you mean we have to have an interface to know whether this task will use GPU or not?
so what's next step to get this slurm agent merged to flytekit? most of the code looks good to me
or can I push this to merge?
e
I just meant we should keep building with the object store in mind so the workflows can be composed of different task types. I haven't tried the latest implementation but yeah let's get the review process going in that PR, we can always add things later. Let's not worry about GPU for now, that can be handled by ShellTask and assuming slurm worker nodes have it or not, up to the user.
d
yes no problem
I just tested 3 task types with @creamy-shampoo-53278
will review more about slurm agent