Does anyone have recommends about how to deal with...
# ask-the-community
t
Does anyone have recommends about how to deal with sequential tasks that output and then consume the same large data (FlyteDirectory or FlyteFile) result? Based on a response in a related thread, it seems Flyte is unable to do this efficiently - instead Flyte will upload 100G result to s3, and then immediately download that same file to the same compute node as input to the next task. Also, I wonder - am I making this situation worse by having caching enabled? I would assume the cache stores a reference to the large result, but then the cache would have to check if that result still exists, which doesn't seem to be the case for local caching, where removing cached results from /tmp causes the workflow to fail due to a missing cached result unless you clear the local-cache.
s
hi @Thomas Blom, 1. the upload and download are inevitable. have you looked into file/directory streaming? this shoud facilitate the download process: https://github.com/flyteorg/flytekit/pull/1512 2. the reference to the cache needs to be removed, hence if you want to clear cache locally, you need to use the
pyflyte local-cache clear
command. caching should be fine in the case of large data because, as you've mentioned, it stores the reference to the data
t
Hey @Samhita Alla, thanks for your response. I looked at the PR you linked for file/directory streaming. It's not clear to me how this will be helpful, since we are downloading a very large file that is a classifier -- we need the whole thing present in memory to do any work - (normally I think of streaming as being useful when you can request parts of a large file that can be processed independently, and do work without needing to download the whole file first). Are you suggesting that the underlying mechanism of FlyteFile.read() is significantly faster than FlyteDirectory.download() -- the latter is what we currently use to pull data from s3. Finally, I'll admit that we're really still learning how to manually intervene in the result load/save process. Flyte does much of this somewhat transparently when you have a task that returns a result, and that result is then used by a downstream task. Flyte saves the result to s3 as it comes out of task1, and it downloads that result from s3 as it goes into the downstream task. Our results further contain a FlyteDirectory, and in the downstream task, we then trigger additional download via that FlyteDirectory that contains our larger files. I'm not really clear on the distinction between using a FlyteFile.read() and a FlyteDirectory.download() like we currently do -- if the former is significantly faster, it may help. We are working on an alternate solution in which the result that is automatically downloaded by Flyte contains a path referencing the very large file that lives elsewhere -- in this case in an s3 bucket mounted via aws mountpoint. Our hope is to make gains via the additional throughput and caching provided by mountpoint. This is kind of what I wondered if other people do: use built-in Flyte data management to set/get results in the backend s3-blob-store, and then inside of those intentionally small-ish results, save references to large data that lives elsewhere -- elsewhere being anywhere that is faster/mountable/cacheable so make copies faster and allow caching for tasks that land on the same node and need the same data. This is what @Haytham Abuelfutuh seemed to hint at in the other thread I referred to.
s
Are you suggesting that the underlying mechanism of FlyteFile.read() is significantly faster than FlyteDirectory.download() -- the latter is what we currently use to pull data from s3.
i thought you might want to use streaming if you don't require the whole data at once. may not be applicable in your case. i don't think there exists a difference in terms of performance between flytefile and flytedirectory.