```/opt/venv/lib/python3.10/site-packages/flytekit...
# ask-the-community
s
Copy code
/opt/venv/lib/python3.10/site-packages/flytekit/types/schema/types.py:323: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.  (This may have returned Python scalars in past versions.
  _np.bool: SchemaType.SchemaColumn.SchemaColumnType.BOOLEAN,  # type: ignore
Traceback (most recent call last):
  File "/opt/venv/bin/pyflyte", line 5, in <module>
    from flytekit.clis.sdk_in_container.pyflyte import main
  File "/opt/venv/lib/python3.10/site-packages/flytekit/__init__.py", line 195, in <module>
    from flytekit.types import directory, file, numpy, schema
  File "/opt/venv/lib/python3.10/site-packages/flytekit/types/schema/__init__.py", line 1, in <module>
    from .types import (
  File "/opt/venv/lib/python3.10/site-packages/flytekit/types/schema/types.py", line 313, in <module>
    class FlyteSchemaTransformer(TypeTransformer[FlyteSchema]):
  File "/opt/venv/lib/python3.10/site-packages/flytekit/types/schema/types.py", line 323, in FlyteSchemaTransformer
    _np.bool: SchemaType.SchemaColumn.SchemaColumnType.BOOLEAN,  # type: ignore
  File "/opt/venv/lib/python3.10/site-packages/numpy/__init__.py", line 284, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'bool'. Did you mean: 'bool_'?
k
Which version of flytekit are you using?
s
Using
flytekit==1.2.3
@Kevin Su
y
this is a known issue sorry - we were a bit late in keeping up with the numpy deprecation notice.
can you bump to 1.2.7?
s
I remember having issues with 1.2.4 so I had to do:
Copy code
grpcio-status<1.49.0
flytekit==1.2.3
Can I just do
flytekit==1.2.7
now?
@Yee
y
yes
s
Great I’ll try that
Thanks!
y
s
@Yee using 1.2.7 works - thanks!
@Yee A quick question - is Flyte used for training only? Should I be using serving tools like BentoML for inference? What if a large amount of data needs to be pre-processed (say via Spark) prior to inference? Where does Flyte fit in (or is it not meant to be used for inference at all, even for data pre-processing)?
Can Flyte replace other data workflow tools like Airflow, Prefect, Dagster, etc?
y
will let @Niels Bantilan answer this one.
n
The short answer is yes, they all pretty much have the same feature set at a high level for many of the core use cases (scheduled batch processing).
Flyte is good for any workflows involving data, not just ML training.
Re: inference we’d recommend other tools like bentoml or kserve (a bentoml integration is in the issues https://github.com/flyteorg/flyte/issues/3107)
Flyte works well for batch inference (where latency requirements are 10s of minutes or more) For anything faster, inference tools like bentoml works well. You could also use Flyte in event-driven architectures https://blog.flyte.org/build-an-event-driven-neural-style-transfer-application-using-aws-lambda
For online inference use cases (sub minute latency) with large data preprocessing requirements it would make sense to use a feature store (Flyte has Feast integration https://docs.flyte.org/projects/cookbook/en/stable/auto/case_studies/feature_engineering/feast_integration/index.html) Where Flyte can orchestrate the generation of features to be read into e.g. a bentoml service
s
This is great - thank you very much @Niels Bantilan. I’ll also check out the articles!
One quick question though - the issue there mentions
signal
node. Is that just a custom node to use as a flag or is there something built in in Flyte?
n
We’re still working on the signal node I believe @Yee , but it’ll be a first-class node in Flyte for human-in-the-loop use cases (eg the requirement for a human to approve a model based on some metrics before deploying)
s
Oh I see - ok cool thanks!
s
@seunggs Just to add to this conversation based on my readings over the past couple of days. This are my findings and some understanding of prefect vs flyte caching - • Prefect docs mention that their caching as of today is at "task" level, and not workflow level. ◦ Also they mention the cache can contain a maximum of 2000 characters, and you have to enable a parameter to persist the cache in your prefect_storage after a workflow is run. • Whereas Flyte on the other hand has caching at both workflow and task levels. You can also version the caches if needed. And retrieve those versions when needed. ◦ Flyte cache can be based on the hash of the input and output of tasks as well. I may have got this wrong. @Niels Bantilan @Ketan (kumare3) @Samhita Alla @Yee and anyone else from Flyte Org, please do clarify if I have made mistakes in my research regarding "Flyte" , obviously you don't have to speak for "Prefect". 🙂 Caching, is something that could be a pivotal feature for people looking to choose between workflow management tools. And threads like these are really cool to read, where comparisons are done in an open manner.
n
Whereas Flyte on the other hand has caching at both workflow and task levels
Caching works with `@task`s and
@dynamic
workflows. Currently, caching is not supported for static workflows
You can also version the caches if needed.
Yes, Flyte’s opinion is it’s too complicated trying to figure out if a task’s upstream dependencies have changed (which could potentially live in other modules, etc), so you can use any version string to version the cache.
Flyte cache can be based on the hash of the input and output of tasks as well.
The main use case for a user-defined hash method for inputs is for blob-store-serialized objects like files, directories, dataframes, pickle files, etc. In this case, you need to define a
HashMethod
, which will incur some runtime cost as Flyte computes the hash of, e.g. a dataframe.
s
Great! Just want to clarify using a scenario, please bear with my long question below , im new to Flyte. So the use of
HashMethod
means "both the input parameters to the function, and the output" ie., DF/files/filepaths/pickles etc will be hashed and stored in Flyte storage (which in aws is an S3 bucket). Am I understanding this correctly? This means if I run a large spark based "task1" , and then the next "task2" requires "task1"s output for some operation, using HashMethod and potentially "cache_version" , I can run a workflow multiple times for evaluating "task2" which takes "task1"s cached output right? Basically Im trying to say that "Hashed and versioned tasks" could potentially avoid multiple writes to disk (output_1.csv, output2.csv etc) while a data science/data engineering "task2" is being ideated/refined? I have one more question, if my above understanding is correct. So please do clarify.
n
HashMethod
annotated outputs (e.g. for files, dataframes) will calculate a hash key based on the user-defined hash function, and this key will be used as the cache key. Assume
task1
produces this output, when the output is passed into downstream task
task2
, the hash key will be used to determine whether or not to re-run
task2
or just hit the cache to return the pre-computed value.
This means if I run a large spark based “task1” , and then the next “task2" requires “task1”s output for some operation, using HashMethod and potentially “cache_version” , I can run a workflow multiple times for evaluating “task2” which takes “task1"s cached output right?
correct
Basically Im trying to say that “Hashed and versioned tasks” could potentially avoid multiple writes to disk (output_1.csv, output2.csv etc) while a data science/data engineering “task2" is being ideated/refined?
correct
so to re-state what you’re saying to make sure I understand: •
task1
is a relatively cheap spark job that produces a parquet file. The output of this task has a
HashMethod
so has a cache key associated with the output. •
task2
is an expensive data processing spark job that depends on
task1
, and is set to
cache=True
with a
cache_version="1"
• assuming that
cache_version
stays the same and the output of
task1
produces the same cache key, the first invocation of
task2
will run it, but subsequent invocations will hit the cache. However, since
task1
doesn’t have
cache=True
,
task1
will always run. Now if
task1
is also cached based on some primitive datatype inputs (like
datetime
,
int
,
str
, etc), then
task1
will not be run (avoiding multiple writes to disk) if a cache key for the output already exists.
s
Great I understand now. Just in my case your description of task1 and task2 is the reverse, I meant task1 to be a very expensive task, so avoiding re-running it seemed a better outcome of using cache. But the analogy, and in turn Flyte, works both ways anyway. Thanks again for the detailed answer. I give credit to my initial understanding to the 'Caching' doc section of Flyte Docs. But I think you guys can really sell/market this a lot more! Just showing more emphasis on this caching method and the way Flyte thought process is, a huge factor for anyone choosing a workflow tool. Prefect, in my research, had very primitive caching capability when compared to Flyte. Good and careful caching in Flyte can potentially alleviate, some of, the need for feature store usage I think. I will try to complete my installation of Flyte in a dev env in a cloud, and try out such a scenario. Thanks for clarifying the thought process @Niels Bantilan. Would be glad to hear more from anyone reading this thread.
k
@Sidharth (Sid) we can sell a lot of things a lot more. We are terrible at that - help us spread the word. The caching in Flyte has evolved through many design discussions, user sessions and lot of careful planning. Thank you for sharing
n
@Sidharth (Sid) yes our flyte.org website revamp (coming soon!) should feature caching a lot more prominently. Out of curiosity, do you find the current docs on caching clear and understandable?
s
I feel it has everything a person needs to understand Caching, but it needs a thorough read, maybe even a few times for a complete new comer to workflow tools. Also, terms related to caching such as "task Signature" could be simpler in my opinion. For example "Any changes to task Function". Prefect V1 docs for caching - https://docs-v1.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target Its pretty old, but has similar concepts explained in simple sentences. Im just sharing it so that I can show you what kind of understanding fits a crowd who are new to these tools. But I do wish the above prefect page to be more technical. Closer to flyte's way of explaining. Basically a middle ground will be nice. Also in Flyte current stable docs, I would wish for more details on "local cache storage" and "remote cache storage". The above prefect v1 doc gives some insight to it.
y
thank you for the feedback @Sidharth(Sid)
n
[flyte-docs]
n
@Sidharth (Sid) if you don’t mind, would you fill in a new issue ^^ for improving the caching docs? It be super helpful if you can link this slack thread and summarize the main suggestions you have for improving the readability and content.
k
thank you @Sidharth(Sid) this is fantastic feedback
s
@Niels Bantilan sure I will create an issue.
@Ketan (kumare3) feels good to talk to the flyte team directly, hope that we all together push the general direction of Flyte to greater heights.
p
@Sidharth(Sid) I have already created an issue for this (sorry didn't see that Niels asked you to open one). Feel free to comment on it with anything else you want to say: https://github.com/flyteorg/flyte/issues/3249
s
Oh nice. great will do. Sorry Im still working in office couldnt find time, was planning to write up at home after office, with a beer in hand 😄 . Will do tonight (india time)
k
Issues with beer in hand 😂
s
hehe
https://github.com/flyteorg/flyte/issues/3249#issuecomment-1399256660 @Ketan (kumare3) @Peeter Piegaze @Niels Bantilan Done.
168 Views