https://flyte.org logo
#ask-the-community
Title
# ask-the-community
m

Mick Jermsurawong

12/11/2023, 11:31 PM
hi folks, do you run Flyte -- local mode -- on notebook on transient notebook env (eg. databricks, collab)? In that situation, how do you enable local cache? the default behavior writes output to
~/.*flyte*/*local*-*cache*/
which assumes strongly a durable persistent local env
k

Ketan (kumare3)

12/12/2023, 3:07 AM
Wdym
Local cache is local
m

Mick Jermsurawong

12/12/2023, 3:08 AM
right.. but if i run Flyte code on databricks environement for example.. cache just doens't work across runs
so i'm asking if there are good solutions for such ephemeral environment?
k

Ketan (kumare3)

12/12/2023, 3:09 AM
Wdym is there no disk
There should be
Ya we can make it store in s3
I love the idea
m

Mick Jermsurawong

12/12/2023, 3:10 AM
yah i see that ``~/.*flyte*/*local*-*cache*/` now has sq-lite artifact
k

Ketan (kumare3)

12/12/2023, 3:10 AM
Shall we collaborate on this
m

Mick Jermsurawong

12/12/2023, 3:10 AM
and flytekit likely queries on this local db
k

Ketan (kumare3)

12/12/2023, 3:10 AM
Ya it does
m

Mick Jermsurawong

12/12/2023, 3:11 AM
so if we implement s3 path, it's gonna look more like Flyte running on K8S cluster
k

Ketan (kumare3)

12/12/2023, 3:11 AM
No it won’t
As that needs a db
Here we will have to use lookup
m

Mick Jermsurawong

12/12/2023, 3:12 AM
just s3 look-up by uri path?
k

Ketan (kumare3)

12/12/2023, 3:12 AM
But if this is a custom cache then we could Simply upload the cache db
That’s the other option
m

Mick Jermsurawong

12/12/2023, 3:12 AM
ok i'd love to collaborate
databricks is the standard notebooking env we are going with, so we'd like to have caching funcationality here
k

Ketan (kumare3)

12/12/2023, 3:13 AM
Ok I have never used it so would love to understand
Why databricks
Let’s have a chat sometime
m

Mick Jermsurawong

12/12/2023, 3:14 AM
databricks notebook have great UX and it's been worth the $$ for the productivity gain for our engs
let me write up something and will share with you to get first round of feedback
we are actually want to make local/remote execution more seamless.. often when folks do remote execution, they are hoping they can reuse result from remote execution to iterate locally as well
k

Ketan (kumare3)

12/12/2023, 3:16 PM
@Mick Jermsurawong using remote cache locally is dangerous
But you can fetch all the data
Checkout the new Flyte data uri
m

Mick Jermsurawong

12/12/2023, 3:17 PM
it is dangerous in the sense that you are concerned about data corruption right?
i think read-only secondary cache is sufficient for us
anyways that's a secondary ask.. I think the first ask is just to be able to have external durable storage for local execution, as described in the issue above
pls let me know further thougths, and will be happy to contribute
k

Ketan (kumare3)

12/12/2023, 3:21 PM
Let me discuss today
m

Mick Jermsurawong

12/13/2023, 3:03 AM
hi ketan! any further thoughts on this?
k

Ketan (kumare3)

12/13/2023, 6:20 AM
i read it briefly, i have some comments, i guess i think if we set a s3 path you do not even need a prefix / context
but also we wont have time to work on this at the moment
m

Mick Jermsurawong

12/13/2023, 2:11 PM
if s3 path can be env var as well, that would work.. i'm happy to implement the work here, but want to make sure that directinoally it's something OSS will accept
k

Ketan (kumare3)

12/14/2023, 5:25 AM
yes i think we should, @Yee is out but he will be back week after (he got married)
m

Mick Jermsurawong

12/14/2023, 1:15 PM
sounds good. will work with Yee on this then
hi @Eduardo Apolinario (eapolinario)! thanks for the input here https://github.com/flyteorg/flyte/issues/4580#issuecomment-1864541011 also happy to chat here if it's more helpful.
e

Eduardo Apolinario (eapolinario)

12/22/2023, 3:05 PM
awesome, let's keep chatting here. It wouldn't be too hard to lean on our flytekit's existing infra to support loading/writing to a blob store. I just wanted to separate the two ideas: (1) the local scope, and (2) a remote cache. If you want to throw a PR I'd be more than happy to review.
m

Mick Jermsurawong

12/22/2023, 8:20 PM
gotcha.. 1/ local scope here is simply have the cache local disk to be configurable right? 2/ remote cache will also reuse that cache path? 2.1/ do you have preference if we will simply sync the whole DB files that python diskcache write.. or should we try to encode the cache key in indiviual remote blob store path (closer to how on-cluster execution works)
e

Eduardo Apolinario (eapolinario)

12/26/2023, 1:54 PM
1/ correct. 2/ Yeah, the local scope is optional, its purpose is just to help you segregate local caches. 2.1/ That's a good question. It'd be simpler to sync all DB files, but I fear that this might make the local cache very slow after multiple runs (imagine the case of a few thousand objects of different sizes being stored there). I also dislike the fact that if we go that route the local cache becomes slower and slower... so my vote goes for to make each entry its own separate entry in the blob store. wdyt?
m

Mick Jermsurawong

01/05/2024, 10:15 PM
Sorry eduardo for late response. And happy new year! 2.1/ Yup each cache entry can have its own entry in the blob store. That makes sense to me
3 Views