In the older version of flytekit (pre 1.5) we can ...
# flytekit
p
In the older version of flytekit (pre 1.5) we can configure gsutil parallelism using this configuration. Is it correct to assume that I can set the value by setting environment variable
FLYTE_GCS_GSUTIL_PARALLELISM
to
true
?
Asking since we see a slow
FlyteDirectory
transfer when there are a lot of files ~10k and just realized the parallelism is disabled by default. What will be the best approach to enable it globally? cc: @Lee Ning Jie Leon
k
Ohh now we use fsspec, I would have expected it to be faster. Cc @Yee / @jeev
j
@Pradithya Aria Pura what flytekit version?
p
I am using 1.2.11, still stuck with < 1.3 due to protobuf version
j
FLYTE_GCP_GSUTIL_PARALLELISM
p
Haven’t tested it yet. Still figuring out the approach to enable it globally so that all workflow will get the benefit. Any suggestion?
j
what about default env var in propeller plugin config?
p
Do you mean this ?
j
yes!
p
Got it, thanks! Will try and update in this thread!
k
So after flytekit 1.5 you should not need it
p
yeah, hopefully we can reach to that point asap 🤞
k
Also let us know about 1.5 and how it’s working etc
p
It works! And it scale with the number of CPU too. Previously it tooks ~17minutes to copy 13k images, now it’s around ~5minutes with 4 CPU cores.
j
you can further tune it too if you can mount a boto.cfg into the container: https://medium.com/@duhroach/gcs-read-performance-of-large-files-bd53cfca4410
k
In the new version @Pradithya Aria Pura I would recommend using the streaming api
The same method can be accelerate as it will use very little disk and memory
p
In the new version @Pradithya Aria Pura I would recommend using the streaming api
Noted will keep this in mind. Thanks @jeev this is really useful!
151 Views