Hi Pryce For what you ve commented in PR I have some questio Flyte #slurm-flyte-wg

Hi Pryce, For what you've commented in PR, I have...

creamy-shampoo-53278

01/17/2025, 3:03 PM

Hi Pryce, For what you've commented in PR, I have some questions to ask: 1. As you mentioned consistent persistence layer, do you mean that we should implement an independent object store to hold the inputs and outputs of heterogeneous tasks (i.e., Slurm tasks and other task types) composed in a single workflow? 2. As for GPU accelerated tasks offload, could I assume that some tasks are run on CPU (e.g., data preprocessing) and some tasks like LLM finetuning should be offloaded to another cloud service with GPU nodes? Hence, we again need to support a consistent persistence layer of inputs/outputs access. Sorry for the dumb questions. I just want to clarify if I totally get what you said and wonder what'll be the top priority for the next step. Thanks!

eager-processor-63090

01/17/2025, 5:12 PM

Not dumb questions at all! I appreciate you clarifying. Any production flyte deployment will have an object store. The original implementation of the slurm agent has a bucket to communicate between slurm and flyte. I'm just saying we should keep building with that in mind, even if we don't implement it immediately. For a V1 of this it's perfectly reasonable to assume workflows will run e2e on slurm and just pass filepaths around between tasks. The real power of this will be seamlessly composing workflows that can use slurm tasks with any other, which will require slurm connecting to the regular object store somehow. Nothing to be done now, I just want us to build with that requirement in mind!

creamy-shampoo-53278

01/18/2025, 1:00 PM

Thank you so much for taking time to explain this so clearly! I'll keep this requirement in mind as we move forward. 🙏

damp-lion-88352

01/22/2025, 4:32 AM

Hi, pryce can you give me an example about

consistent persistence layer

? I think abao's task can define input/output interface for all 3 types of slurm agent

damp-lion-88352

01/22/2025, 4:35 AM

so 1. do we need to run GPU task on slurm agent now? 2. why do we need to think about the persistent layer? slurm cluster should be like databricks cluster. right? you just simply run the job you want, and let slurm cluster write output to your bucket

damp-lion-88352

01/22/2025, 4:36 AM

do you mean we have to have an interface to know whether this task will use GPU or not?

damp-lion-88352

01/22/2025, 4:40 AM

so what's next step to get this slurm agent merged to flytekit? most of the code looks good to me

damp-lion-88352

01/22/2025, 4:40 AM

or can I push this to merge?

eager-processor-63090

01/22/2025, 4:08 PM

I just meant we should keep building with the object store in mind so the workflows can be composed of different task types. I haven't tried the latest implementation but yeah let's get the review process going in that PR, we can always add things later. Let's not worry about GPU for now, that can be handled by ShellTask and assuming slurm worker nodes have it or not, up to the user.

damp-lion-88352

01/22/2025, 4:08 PM

yes no problem

damp-lion-88352

01/22/2025, 4:11 PM

I just tested 3 task types with @creamy-shampoo-53278

damp-lion-88352

01/22/2025, 4:11 PM

will review more about slurm agent

2 Views

Open in Slack

Previous Next