I'm curious to get people's opinions here on how t...
# ask-the-community
p
I'm curious to get people's opinions here on how to go about building a file-heavy workflow. It's basically a file-based scatter-gather along 2 dimensions. Inputs get split into smaller files, those are processed and then combined based on similar attributes, more processing, and then finally all combined again into a single output file. To make sense of it all, I defined a custom dataclass that has some str attrs and some FlyteFile attrs. However, I discovered that the FlyteFile nested in the custom dataclass doesn't cross task boundaries (the FlyteFile.path attribute of downstream tasks still points to a location upstream where it was defined). Passing the FlyteFile by itself between tasks does not have this issue (the path is correct relative to the downstream pod's context), all else being equal. Then I tried using a tuple with the same attributes as the custom dataclass, but these can't be added to a list (which is what I was using to pass groups of files around). I was getting
Type of Generic List type is not supported, Transformer for type <class 'tuple'> is restricted currently
So now I'm back to using FlyteDirectory but it's.. awkward since there's no metadata about the files and they're given arbitrary names on the backend unless I explicitly name them somehow. I'm wondering how folks might go about dealing with this? Do I handle all the metadata via the FlyteFile names? Pass some sort of metadata object along with every FlyteDirectory? Maybe there's a way around the custom dataclass limitation, as that would be the most elegant solution. Thanks for reading! I appreciate any insight.
k
@Pryce I would love to learn more and help here. Can you type out what you want to do as a dummy example. From what I read there seems to be a bug?
p
Hey @Ketan (kumare3), thanks for looking into it! Over the course of writing my dummy example I think I understand better what's going on. I believe this is all expected behavior (just not to me)! For example:
Copy code
@task
def get_file_contents(infile: FlyteFile) -> str:
    local = Path(infile.path)
    print(local.exists())
    content = ''
    with open(infile, 'r') as in_:
        content = in_.read()
    return content
This task runs without issue in the current form. However, that print statement will say
False
, and if I try to open
local
it will fail with a
File not found
error.
After (re)reading the FlyteFile spec it sounds like content is streamed from the object store when
open
is called. My issue was trying to access
FlyteFile.path
directly and passing it to another function, which would fail saying the file wasn't there. I think I falsely assumed this had to do with my custom dataclass.
I'll do some more RTFMing, but I'm guessing the
path
attribute specifies (e.g.
/tmp/flytetsfn_0lu/local_flytekit/899733d34a9e428d2773b2c5ffd2914d/hello.txt
) where a FlyteFile will be downloaded to if the
download
method is called, not where it actually exists?
s
Yes.
path
is populated after the file is downloaded. Try setting
local
to
infile.download()
.
p
Yep that has solved it! Helpful as always, thank you 🙏
153 Views