Hi all I am currently evaluating if Flyte could be used as a Flyte #flyte-support

Hi all! I am currently evaluating if Flyte could b...

dazzling-advantage-95256

01/02/2023, 4:53 PM

Hi all! I am currently evaluating if Flyte could be used as a workflows system to handle multiple on prem compute clusters. Few questions: • We have some clusters that are running k8s, but some are running SLURM. I know that flyte handles k8s, but can it schedule jobs on top of SLURM? AFAIK is does not and we would need to write a flyte plugin to handle that, correct? • Can Flyte load balance workflows between multiple heterogeneous compute clusters with different physical locations? e.g. could we have 2 compute clusters on prem and one on AWS? • Can Flyte use a custom "data plane" (not in the flyte sense, but as the place where you would store KV like data) which is API compatible with aws S3? (for example to move inputs and outputs between tasks)

freezing-airport-6809

01/03/2023, 3:25 AM

Hi @dazzling-advantage-95256 firstly welcome to the community. Sorry for the delay in response as you know folks have been out. I will have a detailed response in a short while - In short Yes Yes And yes

freezing-airport-6809

01/03/2023, 5:30 AM

Detailed answers 1. (For 1). you will have to write a backend-plugin that can schedule onto a slurm cluster. this will use the Slurm WebAPI. You can probably use the Slurm REST API. This can use the WebAPI backend plugin system to write this using a simple CRUD interface. Checkout examples of Databricks, Snowflake etc. We are in the process of updating this to make this even better and easier to author - even in python 2. Yes, Open source Flyte can load balance, but it needs ability to communicate with the cluster from the control plane (flyteAdmin) K8s API server and Flytepropeller (in the dataplane) should be able to send grpc messages back to the control plane. Refer to Union.ai for a more seamless setup 3. Absolutely. We call this extensible data persistence layer. Flyte is compatible with any S3 like blob store. The API for this is extensible in 2 places. Before that understand how Flyte handles data - using this doc a. Extend the backend to allow a new metastore like S3. this is done using flytestdlib storage api. This API already uses the stow system underneath so if it is already extended in stow, then it should work out of box. b. python data persistence will need access to metastore and raw store. This can be extended using protocol based system and depends on library API like - fsspec. Hopefully this answers

dazzling-advantage-95256

01/03/2023, 1:33 PM

Many thanks @freezing-airport-6809 🙂

💯 1

❤️ 1

brainy-church-54824

02/10/2023, 4:53 AM

@freezing-airport-6809, a follow up. Where does the Flyte run? I'm assuming a small k8s cluster for the control, with the jobs on SLURM.

freezing-airport-6809

02/10/2023, 5:12 AM

@brainy-church-54824 wdym? Flyte runs in k8s and orchestrates everything in k8s

freezing-airport-6809

02/10/2023, 5:13 AM

Btw @dazzling-advantage-95256 how’s it going?

freezing-airport-6809

02/10/2023, 5:13 AM

We currently do not have a slurm backend plugin

brainy-church-54824

02/10/2023, 5:15 AM

That's what I thought. Was just confirming (only to rectify my limited understanding). Flyte runs on K8s, and the task would run on SLURM with the custom backend plugin.

brainy-church-54824

02/10/2023, 5:15 AM

Thanks!

freezing-airport-6809

02/10/2023, 7:00 AM

Yup

astonishing-pilot-82747

02/12/2023, 7:21 PM

I would also be interested in something like this if someone is interested in collaborating

freezing-airport-6809

02/12/2023, 7:30 PM

@astonishing-pilot-82747 infact @stocky-notebook-88311 has something like this already

astonishing-pilot-82747

02/12/2023, 7:31 PM

Oh, I would be keen to see it: is it public?

freezing-airport-6809

02/12/2023, 7:31 PM

There is an rfc in progress that should simplify writing external plugins https://hackmd.io/@pingsutw/B1a_Bnfqi

👀 1

freezing-airport-6809

02/12/2023, 7:31 PM

Can you talk about your requirements

astonishing-pilot-82747

02/12/2023, 7:36 PM

Sure: I admit I'm still in the exploratory phase, but basically I want a nice front-end for our doing HPC simulations and post-processing of data, using our Slurm cluster.

astonishing-pilot-82747

02/12/2023, 7:40 PM

I want a system that will keep track of code and data in a reproducible manner, and have a nice way to embed visual summaries (plots, tables) in a nice Web GUI front-end.

👍 1

astonishing-pilot-82747

02/12/2023, 7:47 PM

I guess the key requirements are: • be able to submit jobs to the Slurm cluster • track code and loaded packages used for each job • track data on both the local parallel file system and S3 buckets ◦ ideally it would also manage data retention policies, but that is a later concern • provide a web view of currents status, logs and visual summaries

➕ 1

brainy-church-54824

02/12/2023, 7:51 PM

This is more of a SLURM question @astonishing-pilot-82747, but I'm curious how autoscaling would work for SLURM. In Flyte on K8s you I'm assuming you can leverage K8s stack like Karpenter/Cluster AutoScaler to get the resources provisioned automatically. (@freezing-airport-6809, correct me if I'm wrong) If I submit a job with SLURM will the cluster already be running with a large number of CPUs/GPUs? Or it can autoscale as per the job.

brainy-church-54824

02/12/2023, 7:52 PM

Console outputs and debugging failed jobs should also be requirements I think.

👍 1

astonishing-pilot-82747

02/12/2023, 7:56 PM

We would rely on the Slurm scheduler: basically each Flyte task would correspond to a Slurm job (so you would need some way to describe the requested resources in the task metadata).

astonishing-pilot-82747

02/12/2023, 7:57 PM

(I admit I'm not that familiar with Flyte, and my Kubernetes knowledge is pretty basic)

astonishing-pilot-82747

02/12/2023, 7:58 PM

Slurm is a batch scheduler, so the job will be in a queue until the requested resources are available

👍 1

astonishing-pilot-82747

02/12/2023, 8:03 PM

If I submit a job with SLURM will the cluster already be running with a large number of CPUs/GPUs? Or it can autoscale as per the job.

Yes, we can typically assume it is already running (powering up/down nodes is handled by the Slurm controller)

brainy-church-54824

02/12/2023, 8:03 PM

Perfect, thanks @astonishing-pilot-82747

👍 1

astonishing-pilot-82747

02/12/2023, 8:06 PM

FWIW, we have built something similar for running CI jobs on Slurm via Buildkite: https://github.com/CliMA/slurm-buildkite

brainy-church-54824

02/12/2023, 8:19 PM

Another requirement I am thinking of would be about running failed jobs. With the Flyte TF and Pytorch examples, seems like theres some checkpointing, and that will allow for re-running the job if it fails. I've got a very limited understanding of SLURM, but I think running long training jobs there might be more stable. With spot instances, it might still be a need though.

astonishing-pilot-82747

02/12/2023, 8:23 PM

That could certainly be useful (on large enough jobs random hardware failures can happen), but it's less of a concern than using spot EC2 instances

freezing-airport-6809

02/12/2023, 8:49 PM

This is a great discussion I think we should capture in an issue [flyte-plugin] But Flyte also ships with a batch scheduler in built and can respond to back pressure etc. it native works on k8s , but ofcourse can send jobs to other systems like SLURM. For the community we would Love to understand what are the advantages of using slurm ve pure Flyte on k8s abstractions. I know @freezing-boots-56761 at freenome has worked with both. Might be able to shed some light

freezing-airport-6809

02/12/2023, 8:50 PM

Also @astonishing-pilot-82747 backend extensions already exist in golang as described here https://docs.flyte.org/projects/cookbook/en/latest/auto/core/extend_flyte/backend_plugins.html

astonishing-pilot-82747

02/12/2023, 8:54 PM

Thank you both @freezing-airport-6809 and @brainy-church-54824

747 Views

Open in Slack

Previous Next