hi! could someone explain what ray is and how why it makes sense to integrate it with flyte. I am struggling a little bit to understand what the 2 technologies do together and in what example you would use a ray task.
08/08/2023, 2:23 PM
Ray is for distributed computation tasks, its similar to Spark.
Essentially, imagine you have a CSV that is 100GB and you want to do some processing on it. You’re probably not going to want to load that into memory on 1 machine.
With Ray, we can split that up and put a fraction of it on any number of machines, its generally easier to get lots of small machine than it is to get one massive machine.
Within Flyte we can integrate using Ray for tasks into the rest of a pipeline. It can automatically spin up a Ray cluster of whatever size we want, execute whatever we want it to do, output the data or whatever we output to the next Flyte task, and shut down the Ray cluster that we no longer need.
IMO its easiest to understand Ray in terms of big datasets, but you can obviously use it for training jobs etc.
08/08/2023, 2:26 PM
+1, so when using ray the biggest hurdle is spinning up a cluster. That’s what Flyte solved and allows you to dynamically get a cluster suited for your workload
Also makes it easy to move data from let’s say snowflake to ray to distributed ml training to a python script. Ray is not a hammer for all jobs and so is Flyte, but together probably
08/09/2023, 11:09 AM
Awesome explanation guys, this clears it up a lot. Thank you!