Hi Community - I'm wondering what data patterns pe...
# ask-the-community
t
Hi Community - I'm wondering what data patterns people are using when dealing with large data, in our case, trained classifiers. Using a standard s3-object-store and passing results between tasks in the standard Flyte pattern, it means for example we sometimes train a (RF) classifier that ends up being hundreds of GB in size. This gets uploaded to S3, and in a downstream classification task (or any later classification task) it must be copied from S3 to the compute node -- which in some cases takes 90 minutes! It may be we need to optimize what kind of network bandwidth our compute nodes have for these kinds of jobs, but I'm wondering if there are other patterns people use to avoid copying large amounts of data like this via standard Flyte result-passing mechanisms for tasks.
The task in question is running on a r6a.48xlarge AWS node, which claims to have 50Gbps network bandwith and 40Gbps EBS-volume bandwidth. But metrics via Datadog ("receive bytes" for the pod) indicate the write of the large classifier to compute-node disk is occuring at ~50 mebibytes/s -- a couple orders of magnitude slower than the bandwidths quoted above.
I do realize that we turned down the Annotated "BatchSize" for our FlyteDirectories because of a long-running problem of tasks that just copy data (typically lots of small files) using so much memory -- but I wouldn't think this would play a large role in the transfer of a single large file, as in the case of the classifier.
h
Hey Thomas 👋 long time 🙂 A few things to look at https://aws.github.io/aws-eks-best-practices/networking/monitoring/ if you haven't already. At Union.ai (😉) we have an accelerated datasets feature that leverages s3 mountpoint and custom data routing to make those "accelerated outputs" available as "accelerated inputs" to subsequent tasks seamlessly...
t
Hey @Haytham Abuelfutuh 👋 Thanks for this. I've read about mountpoint previsouly and understand that for some applications, the throughput is much faster that alternate technologies. Are you using mountpoint to mount s3 onto the running node so that datafiles are loaded directly from s3 rather than being first copied locally and then loaded? Or is it just that the underlying file transfer is significantly faster and files are still copied locally first?
A related question is this: is there a way to tell Flyte "please re-use the result from task1 that already exists locally on-disk for task2 if it happens that they execute on the same compute node! I see situations in which we train a larger classifier, spend an hour writing it to S3, then spend another hour (or two!) reading it from s3 back to the same compute node to be used by the "test" task. (!) Of course you don't KNOW that you'll land on the same compute node, but if you do, this optimization would be immensely helpful.
h
There are a few benefits here, data you push to s3 will stay cached on the node... so subsequent tasks that land on the same node will just use the cached version... Faster throughput is a big bonus... There isn't a direct way to do that in Flyte... you may be able to do some of that with a scheduler plugin, task/pod labels... etc. haven't tried that. but doing other tricks at Union 🙂
t
@Haytham Abuelfutuh - just to clarify, is it really the case that one cannot cause Flyte directly to cache the result of data copied from S3 to the EC2 compute node, across tasks? I'm not clear from your response which of those benefits you list apply to using mountpoint (and if mountpoint is what actually provides the caching), and which apply to your reference to "other tricks" at Union -- and most importantly, whether or not it's possible to give Flyte some hint or clues or flags about caching results on the compute node across tasks (or if instead I need to re-architect my many-task workflow into a big monolithic task to avoid all this slow data copy). I have a classification job I've just watched do this: 1. Train a classifier, which is 100s GB in size. The trained classifier is coped from EC2 => S3 (1 hour copy) 2. In the "test" phase, the same classifier is copied back to the same node. ( 1.5 hour copy) 3. Now it's time to classify the "real" input data. We're on the same node, yay! But guess what -- we're copying the classifier down from S3, again. (1.5 hour copy) I think something must be suboptimal in our network/file-io, because this is way too long even for 100s of GB. But we could save the 3 hours in steps 2 and 3 if only we didn't have do download something (twice!) that was just created on the same compute node. Thanks for any clarification you can provide 🙏
@Haytham Abuelfutuh - Could you provide a link to the documentation at Union about your accelerated datasets feature referenced above? I'd like to read about it, but can't find anything using these search terms.