Hey < sparse addition 80036> nice presentation bioinformatic Flyte #flyte-support

Hey <@U03JT2RGAKX> nice presentation... bioinforma...

shy-holiday-15500

09/20/2022, 4:53 PM

Hey @sparse-addition-80036 nice presentation... bioinformatics-oriented question... how have you been using FlyteFiles for indexed files? Do you have tabix indexes etc. on your bed of vcf files? This has been something my company has been thinking about/prototyping (using 2 flyte files is a bit annoying because the index doesn’t end up “next to” the file its indexing so in our shell tasks we have to cp the files + its index into a new temp folder, but what if we had something like FlyteDirectory that wrapped a file family with an expected number of files)

thankful-minister-83577

09/20/2022, 8:07 PM

interesting

thankful-minister-83577

09/20/2022, 8:07 PM

can you make an issue for this @shy-holiday-15500

thankful-minister-83577

09/20/2022, 8:07 PM

so you want if you

return file_a, file_b

that they end up close to each other somehow? like possibly in the same prefix in s3?

thankful-minister-83577

09/20/2022, 8:08 PM

could you elaborate a bit more on why that’s useful?

shy-holiday-15500

09/20/2022, 8:08 PM

I haven't though enough about how *t*his should work. Its tough to do the static checking because for some of these "file families", they are treated by C code/binaries as though they were one file... e.g, doing file_a/file_b doesn't make sense, rather the binary takes a filestem without the suffix

shy-holiday-15500

09/20/2022, 8:09 PM

Useful because: 1. Otherwise when flytekit downloads the files they don't always end up in the same folder, e.g., for a shell task 2. If you have 2 inputs to a task that really are 2 file families with 3 files each, the inputs/outputs look gross, even though mentally you're just passing in 2 families (not 6 distinct things)

shy-holiday-15500

09/20/2022, 8:10 PM

Some examples of these families: https://www.cog-genomics.org/plink/1.9/formats#bed vcf + vcf.tbi (and index of the file to allow fast lookups instead of linear disk scans)

shy-holiday-15500

09/20/2022, 8:12 PM

One more reason: there is a wealth of tooling like htslib/pyvcf that could play really nice with these file families, similar to how the structured dataset plays nicely with pandas, so that you ~~can~~ could manipulate the data with python code as soon as you get into the task (rather than having to download/open the file, and then form the pyvcf etc. constructor around the file pointer)

thankful-minister-83577

09/20/2022, 8:14 PM

got it thank you.

broad-monitor-993

09/20/2022, 9:08 PM

hey Calvin, it seems like

FlyteDirectory

would be appropriate to use here… how are these “file families” distinct from directories?

shy-holiday-15500

09/20/2022, 9:09 PM

Yeah good question, maybe FlyteDirectory, or an extension of it, is the right way to go

shy-holiday-15500

09/20/2022, 9:11 PM

Some reasons: Sometimes you want to download just the index so you can check whether something exists in the large file it indexes before you download it (weaker as this is really sugar) files in a file family typically all interrelate in a way where removing one makes the others meaningless. This semantics usually isn't there for directories of files (e.g., a bunch of parquet partitions in a directory)

shy-holiday-15500

09/20/2022, 9:11 PM

I probably should play with FlyteDirectory more on our end and see if the UX is good for these setup (we've been going with 1 flyte file for each member of the family, and then dealing with the fact they don't download into the same temp folder by using an mv/cp in the shell task)

broad-monitor-993

09/20/2022, 9:16 PM

Sometimes you want to download just the index so you can check whether something exists in the large file it indexes before you download it

interesting… yeah I think if you can write down requirements like this in an issue it would help us figure out how to extend FlyteDirectories https://github.com/flyteorg/flyte/issues/new?assignees=&labels=enhancement%2Cuntriaged&template=feature_request.yaml&title=%5BCore+feature%5D+

rich-garden-69988

09/21/2022, 12:21 AM

Hey @shy-holiday-15500 , we had a similar need when dealing with indexed files. The way we ended up solving it was with a custom type that can be extended to handle files with any number of associated indexes.

Copy code

class FlyteFileWithIndex(FlyteFile, metaclass=TypeTransformerMeta):
    index_extensions: typing.List[str] = []
    index_requirement: typing.Literal["all", "any"] = "any"

The type transformer looks at the list of index extensions, checks if “any” or “all” of the index files exists based on the specified index_requirement (“any” is useful for things like BAM that can have either “.bai” or “.bam.bai” suffix, “all” is useful for strict matching). If index(es) exists, it will download both the file and its associated index files in the same directory, otherwise it will error. So for example, this is what a type would look like to handle BAM files, which can be defined inside the workflow or in a library that is used by the workflow:

Copy code

class BamFile(FlyteFileWithIndex):
    index_extensions = [".bai", ".bam.bai"]
    index_requirement = "any"

👍 1

shy-holiday-15500

09/21/2022, 2:42 PM

Nice! This is exactly what I was thinking about

thankful-minister-83577

09/21/2022, 4:54 PM

do you think you can help port this to an issue @shy-holiday-15500?

shy-holiday-15500

09/21/2022, 4:54 PM

Yes!

thankful-minister-83577

09/21/2022, 4:54 PM

tyty

shy-holiday-15500

09/22/2022, 1:12 AM

https://github.com/flyteorg/flyte/issues/2910

excellent 1

163 Views

Open in Slack

Previous Next