In the older version of flytekit (pre 1.5) we can ...
# flytekit
m
In the older version of flytekit (pre 1.5) we can configure gsutil parallelism using this configuration. Is it correct to assume that I can set the value by setting environment variable
FLYTE_GCS_GSUTIL_PARALLELISM
to
true
?
Asking since we see a slow
FlyteDirectory
transfer when there are a lot of files ~10k and just realized the parallelism is disabled by default. What will be the best approach to enable it globally? cc: @broad-train-34581
f
Ohh now we use fsspec, I would have expected it to be faster. Cc @thankful-minister-83577 / @freezing-boots-56761
f
@most-gold-65483 what flytekit version?
m
I am using 1.2.11, still stuck with < 1.3 due to protobuf version
f
FLYTE_GCP_GSUTIL_PARALLELISM
m
Haven’t tested it yet. Still figuring out the approach to enable it globally so that all workflow will get the benefit. Any suggestion?
f
what about default env var in propeller plugin config?
m
Do you mean this ?
👍 1
f
yes!
m
Got it, thanks! Will try and update in this thread!
f
So after flytekit 1.5 you should not need it
m
yeah, hopefully we can reach to that point asap 🤞
celebrate 1
f
Also let us know about 1.5 and how it’s working etc
👍 1
m
It works! And it scale with the number of CPU too. Previously it tooks ~17minutes to copy 13k images, now it’s around ~5minutes with 4 CPU cores.
🙌 2
f
you can further tune it too if you can mount a boto.cfg into the container: https://medium.com/@duhroach/gcs-read-performance-of-large-files-bd53cfca4410
f
In the new version @most-gold-65483 I would recommend using the streaming api
💯 1
The same method can be accelerate as it will use very little disk and memory
m
In the new version @most-gold-65483 I would recommend using the streaming api
Noted will keep this in mind. Thanks @freezing-boots-56761 this is really useful!
👍 1
154 Views