Hi team so flytekit needs pyarrow gt = 4 0 0 except pyarrow Flyte #flyte-support

Hi team.. so flytekit needs pyarrow >= 4.0.0 ex...

little-cricket-84530

12/29/2022, 8:13 PM

Hi team.. so flytekit needs pyarrow >= 4.0.0 except pyarrow (required by awswrangler) has this issue for versions > 2.0.0 that I’m currently running into when trying to chunk and read a large parquet file.. Has anyone else in the group encountered this and if yes how did you get around it?

glamorous-carpet-83516

12/29/2022, 9:35 PM

is there any reason why don’t use structured dataset? it does blob operation underlying for you (read/write parquet file).

little-cricket-84530

12/29/2022, 9:42 PM

oh.. I didn’t realize I had that option.. is there some sample code on how to populate it from a file/folder in s3?

little-cricket-84530

12/29/2022, 9:43 PM

uri= ?

little-cricket-84530

12/29/2022, 9:43 PM

is that it?

glamorous-carpet-83516

12/29/2022, 9:43 PM

yup

little-cricket-84530

12/29/2022, 9:44 PM

what about aws credentials? how do I pass those?

glamorous-carpet-83516

12/29/2022, 9:44 PM

sd = structuredDataset(uri=<s3_path>)

glamorous-carpet-83516

12/29/2022, 9:45 PM

edit flytepropeller config map, update the default env

Copy code

default-cpus: 100m
        default-env-vars:
        - FLYTE_AWS_ENDPOINT: <http://minio.flyte.svc.cluster.local:9000>
        - FLYTE_AWS_ACCESS_KEY_ID: minio
        - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage

little-cricket-84530

12/29/2022, 9:45 PM

that wouldn’t work for us 😞

little-cricket-84530

12/29/2022, 9:45 PM

our aws credentials are coming from vault

little-cricket-84530

12/29/2022, 9:46 PM

to access the s3 buckets that have the files we are interested in processing

glamorous-carpet-83516

12/29/2022, 9:48 PM

cc @thankful-minister-83577 do you know what’s the best practice to read credential from vault in flytekit?

little-cricket-84530

12/29/2022, 9:48 PM

We are using flytekit.Secret for each task

little-cricket-84530

12/29/2022, 9:48 PM

the question is how to pass the secret id, access key etc when initializing the StructuredDataset from a parquet file in s3

little-cricket-84530

12/29/2022, 9:48 PM

initialization

glamorous-carpet-83516

12/29/2022, 10:02 PM

flytekit downloads input proto from s3 while running the task, so the pod should already have access to s3 bucket. if not, how do you download the input file from s3?

little-cricket-84530

12/29/2022, 10:03 PM

I am using awswrangler to read the parquet file from s3.. it takes a boto3_session as a parameter which I instantiate with the right credentials

little-cricket-84530

12/29/2022, 10:04 PM

flyte has access to whatever bucket it needs for its blob storage… for all other artifacts in other buckets I provide credentials (usually boto3)

glamorous-carpet-83516

12/29/2022, 10:08 PM

I means you don’t need to pass credentials to the pods by yourself. pods should already have that. Otherwise, the pod will fail to download the task input (input can be any python type, int, str).

thankful-minister-83577

12/29/2022, 10:08 PM

are these in different buckets?

thankful-minister-83577

12/29/2022, 10:08 PM

so you need another s3 connection to download data that you want? and those credentials are only available in vault?

little-cricket-84530

12/29/2022, 10:09 PM

yes.. and available to the pod using “secrets”

little-cricket-84530

12/29/2022, 10:09 PM

flytekit.secrets

thankful-minister-83577

12/29/2022, 10:09 PM

that is a bit tricky… flytekit only assumes one s3 connection.

thankful-minister-83577

12/29/2022, 10:10 PM

whatever credentials are used to download inputs are the same ones the default structured dataset readers and writers will use.

thankful-minister-83577

12/29/2022, 10:10 PM

you can of course replace those.

little-cricket-84530

12/29/2022, 10:10 PM

argh.. that won’t work for us 😞

little-cricket-84530

12/29/2022, 10:11 PM

unless I use the credentials to download the files locally before use..

thankful-minister-83577

12/29/2022, 10:11 PM

let me see

little-cricket-84530

12/29/2022, 10:12 PM

feature request: StructuredDataset also takes an optional boto3_session parameter and uses that if it’s provided 🙂

thankful-minister-83577

12/29/2022, 10:12 PM

just out of curiousity… how long have you had boto3 as a pip dependency?

thankful-minister-83577

12/29/2022, 10:13 PM

(i only ask because i’ve run into problems with it in the past, so just curious [in particular i’ve found it doesn’t work well when installed alongside s3fs])

little-cricket-84530

12/29/2022, 10:13 PM

a lot of our workflows have this flow… i.e. working with files in s3 that flyte directly doesn’t have access to

little-cricket-84530

12/29/2022, 10:14 PM

I don’t use s3fs so that hasn’t been an issue yet

thankful-minister-83577

12/29/2022, 10:14 PM

sweet

thankful-minister-83577

12/29/2022, 10:15 PM

cc @freezing-airport-6809

thankful-minister-83577

12/29/2022, 10:15 PM

can we get back to you tomorrow on how best to do this?

little-cricket-84530

12/29/2022, 10:15 PM

sure.. until then I’ll continue building and testing the workflow with my smaller dataset

little-cricket-84530

12/29/2022, 10:16 PM

thanks @thankful-minister-83577 and @glamorous-carpet-83516

freezing-airport-6809

12/29/2022, 11:44 PM

hi @little-cricket-84530 are you around for a quick chat?

little-cricket-84530

12/29/2022, 11:45 PM

sure

little-cricket-84530

01/09/2023, 7:51 PM

@glamorous-carpet-83516 for a very large parquet dataset (100 million records), is it possible to read the data using StructuredDataset in chunks so that the task does not run out of memory?

little-cricket-84530

01/09/2023, 8:03 PM

or do I need to manually handle the chunking and reading the partitioned files 1 by 1?

glamorous-carpet-83516

01/09/2023, 8:06 PM

yes, need to override the current decoder, and the new decoder should return iterator. https://github.com/flyteorg/flytekit/blob/e4d7649b440e1985b3ecc1d2ecfe9bb45612e444/flytekit/types/structured/structured_dataset.py#L789

glamorous-carpet-83516

01/09/2023, 8:06 PM

and use

sd.open(pd.Dataframe).iter()

to get the iterator

glamorous-carpet-83516

01/09/2023, 8:08 PM

@thankful-minister-83577 we don’t have any transformer that returns iterator. I think we should add some defaults for pandas and spark

thankful-minister-83577

01/09/2023, 8:26 PM

yes

thankful-minister-83577

01/09/2023, 8:26 PM

we should.

thankful-minister-83577

01/09/2023, 8:27 PM

can we chat about this for five later today?

thankful-minister-83577

01/09/2023, 8:27 PM

cc @broad-monitor-993

glamorous-carpet-83516

01/09/2023, 8:35 PM

Yes

broad-monitor-993

01/09/2023, 8:43 PM

@little-cricket-84530 @glamorous-carpet-83516 here’s an implementation of an iterable reader: https://github.com/flyteorg/flyte-demos/blob/main/flyte_demo/workflows/data_iter.py We should consider adding it to the default implementation

👍 1

broad-monitor-993

01/09/2023, 8:44 PM

[flyte-core]

little-cricket-84530

01/09/2023, 8:44 PM

oh thank you.. I was trying to figure out how to do this

broad-monitor-993

01/09/2023, 8:46 PM

it’s a little bit hacky (see the partition_col is manually injected here) so we should generalize this so that the partition column is a valid kwarg in

StructuredDataset

initializer

broad-monitor-993

01/09/2023, 8:47 PM

[flyte-core]

user

01/09/2023, 8:47 PM

⭐ Create a new Flyte Core Feature issue: https://github.com/flyteorg/flyte/issues/new?assignees=&labels=enhancement%2Cuntriaged&template=feature_request.yaml&title=%5BCore+feature%5D+

broad-monitor-993

01/09/2023, 8:51 PM

Issue: https://github.com/flyteorg/flyte/issues/3219

👍 1

182 Views

Open in Slack

Previous Next