https://flyte.org logo
#ask-the-community
Title
# ask-the-community
c

Calvin Leather

09/20/2022, 4:53 PM
Hey @krishna Yerramsetty nice presentation... bioinformatics-oriented question... how have you been using FlyteFiles for indexed files? Do you have tabix indexes etc. on your bed of vcf files? This has been something my company has been thinking about/prototyping (using 2 flyte files is a bit annoying because the index doesn’t end up “next to” the file its indexing so in our shell tasks we have to cp the files + its index into a new temp folder, but what if we had something like FlyteDirectory that wrapped a file family with an expected number of files)
y

Yee

09/20/2022, 8:07 PM
interesting
can you make an issue for this @Calvin Leather
so you want if you
return file_a, file_b
that they end up close to each other somehow? like possibly in the same prefix in s3?
could you elaborate a bit more on why that’s useful?
c

Calvin Leather

09/20/2022, 8:08 PM
I haven't though enough about how *t*his should work. Its tough to do the static checking because for some of these "file families", they are treated by C code/binaries as though they were one file... e.g, doing file_a/file_b doesn't make sense, rather the binary takes a filestem without the suffix
Useful because: 1. Otherwise when flytekit downloads the files they don't always end up in the same folder, e.g., for a shell task 2. If you have 2 inputs to a task that really are 2 file families with 3 files each, the inputs/outputs look gross, even though mentally you're just passing in 2 families (not 6 distinct things)
Some examples of these families: https://www.cog-genomics.org/plink/1.9/formats#bed vcf + vcf.tbi (and index of the file to allow fast lookups instead of linear disk scans)
One more reason: there is a wealth of tooling like htslib/pyvcf that could play really nice with these file families, similar to how the structured dataset plays nicely with pandas, so that you can could manipulate the data with python code as soon as you get into the task (rather than having to download/open the file, and then form the pyvcf etc. constructor around the file pointer)
y

Yee

09/20/2022, 8:14 PM
got it thank you.
n

Niels Bantilan

09/20/2022, 9:08 PM
hey Calvin, it seems like
FlyteDirectory
would be appropriate to use here… how are these “file families” distinct from directories?
c

Calvin Leather

09/20/2022, 9:09 PM
Yeah good question, maybe FlyteDirectory, or an extension of it, is the right way to go
Some reasons: Sometimes you want to download just the index so you can check whether something exists in the large file it indexes before you download it (weaker as this is really sugar) files in a file family typically all interrelate in a way where removing one makes the others meaningless. This semantics usually isn't there for directories of files (e.g., a bunch of parquet partitions in a directory)
I probably should play with FlyteDirectory more on our end and see if the UX is good for these setup (we've been going with 1 flyte file for each member of the family, and then dealing with the fact they don't download into the same temp folder by using an mv/cp in the shell task)
n

Niels Bantilan

09/20/2022, 9:16 PM
Sometimes you want to download just the index so you can check whether something exists in the large file it indexes before you download it
interesting… yeah I think if you can write down requirements like this in an issue it would help us figure out how to extend FlyteDirectories https://github.com/flyteorg/flyte/issues/new?assignees=&labels=enhancement%2Cuntriaged&template=feature_request.yaml&title=%5BCore+feature%5D+
g

Greg Gydush

09/21/2022, 12:21 AM
Hey @Calvin Leather , we had a similar need when dealing with indexed files. The way we ended up solving it was with a custom type that can be extended to handle files with any number of associated indexes.
Copy code
class FlyteFileWithIndex(FlyteFile, metaclass=TypeTransformerMeta):
    index_extensions: typing.List[str] = []
    index_requirement: typing.Literal["all", "any"] = "any"
The type transformer looks at the list of index extensions, checks if “any” or “all” of the index files exists based on the specified index_requirement (“any” is useful for things like BAM that can have either “.bai” or “.bam.bai” suffix, “all” is useful for strict matching). If index(es) exists, it will download both the file and its associated index files in the same directory, otherwise it will error. So for example, this is what a type would look like to handle BAM files, which can be defined inside the workflow or in a library that is used by the workflow:
Copy code
class BamFile(FlyteFileWithIndex):
    index_extensions = [".bai", ".bam.bai"]
    index_requirement = "any"
c

Calvin Leather

09/21/2022, 2:42 PM
Nice! This is exactly what I was thinking about
y

Yee

09/21/2022, 4:54 PM
do you think you can help port this to an issue @Calvin Leather?
c

Calvin Leather

09/21/2022, 4:54 PM
Yes!
y

Yee

09/21/2022, 4:54 PM
tyty
c

Calvin Leather

09/22/2022, 1:12 AM
27 Views