GitHub
01/10/2023, 9:39 PM<https://github.com/flyteorg/flyteplugins/tree/master|master>
by hamersaw
<https://github.com/flyteorg/flyteplugins/commit/109224c2a0e65782fee53336b46cbe4bd0d2d189|109224c2>
- added raw-container to registered task types (#305)
flyteorg/flytepluginsGitHub
01/10/2023, 9:41 PMGitHub
01/10/2023, 9:47 PM<https://github.com/flyteorg/flyteidl/tree/master|master>
by katrogan
<https://github.com/flyteorg/flyteidl/commit/9fbac98b2d173fe1b30f18ac0487ff35b9627e3b|9fbac98b>
- Add raw claims to user info response (#357)
flyteorg/flyteidlGitHub
01/10/2023, 9:54 PMGitHub
01/10/2023, 10:08 PMGitHub
01/10/2023, 10:09 PMmap_task
to a partitioned StructuredDataset automatically so that I can process the partitions in an embarrassingly parallel fashion without too much extra code.
Goal: What should the final outcome look like, ideally?
Suppose we have a task that produces a StructuredDataset
@task
def make_df() -> StructuredDataset:
df = pd.DataFrame.from_records([
{
"id": i,
"partition": (i % 10) + 1,
"name": "".join(
random.choices(string.ascii_uppercase + string.digits, k=10)
)
}
for i in range(1000)
])
return StructuredDataset(dataframe=df, partition_col=["partition"])
Ideally, I should be able to do something like this:
@task
def process_df(dataset: StructuredDataset) -> StructuredDataset:
df = structured_dataset.open(pd.DataFrame).read_partition() # read the partition
... # do stuff
@task
def use_processed_df(dataset: List[StructuredDataset]) -> ...:
...
@workflow
def wf() -> StructuredDataset:
structured_dataset = make_df()
# where structured_dataset.partitions is a list of unpartitioned StructuredDatasets
results: List[StructuredDataset] = map_task(process_df)(dataset=structured_dataset.partitions)
return use_processed_df(dataset=results)
Note that in this example code a few magical things are happening:
1. we pass in structured_dataset.partitions
into the map task, which indicates that we want to apply process_df
to each of the partitions defined in make_df
2. The fact that map_task(process_df)
returns a StructuredDataset
implies that using map tasks with structured datasets does an implicit reduction, i.e. the outputs of map_task(process_df)
are written to the same blob store prefix.
Ideally the solution enables processing of StructuredDataset
without having to manually handle reading in of partitions in the map task, and automatically reduces the results into a StructuredDataset
without having to explicitly write a coalense/reduction task.
Describe alternatives you've considered
Users would have to roll their own way of processing partitions of a structured dataset using dynamic tasks.
Propose: Link/Inline OR Additional context
Slack context: https://flyte-org.slack.com/archives/CP2HDHKE1/p1673380243923279
Related to #3219
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyteGitHub
01/10/2023, 11:00 PM<https://github.com/flyteorg/flyteadmin/tree/master|master>
by katrogan
<https://github.com/flyteorg/flyteadmin/commit/1ccd59c249e2305185bd7e5cd9340c4f60c6e8bd|1ccd59c2>
- Forward all claims in userinfo response (#511)
flyteorg/flyteadminGitHub
01/10/2023, 11:26 PM<https://github.com/flyteorg/flytekit/tree/master|master>
by wild-endeavor
<https://github.com/flyteorg/flytekit/commit/581a5c66b6dec1d105f8f655918c405665ceebf6|581a5c66>
- Update default config to work out-of-the-box with flytectl demo (#1384)
flyteorg/flytekitGitHub
01/10/2023, 11:36 PMflyte-binary
Helm chart.
• Remove a lot of the sandbox stuff since the new flytectl demo
environment is architected differently.
• Change instructions to use flytectl demo
instead of flytectl sandbox
• Remove ideal_flow.rst
- this is incomplete. We should eventually offer some basic gh workflow examples rather than just talking about it.
Structure Changes
Bundling all Flyte configuration under the deployment moniker felt a bit of a stretch. I think renaming it to a broader administrator's guide makes more sense.
• Currently the generated component configs are missing, but this will end up changing with the partial mono-repo work anyways.
Deployment Updates
One of the things we want to do is change the deployment journey. We want to make sure users are led comfortably through the various stages that one might expect of something as complex as Flyte.
Basically the steps of the journey should be
1. flytectl demo sandbox
2. a simple cloud (eks/gke) based deployment with nothing tricky - just helm install. users will have to port-forward to see anything.
3. a production ready deployment (ingress, auth, etc) (think stable enough for most companies).
4. a scalable multicluster setup (lots of deployments will never need this level).
The middle steps will replace the aws/gcp guides that we have today. These have been completely deleted.
From there we should link to a revamped version of the ideal_flow.rst
file I think.
#2993
flyteorg/flyte
✅ All checks have passed
11/11 successful checksGitHub
01/10/2023, 11:55 PMGitHub
01/11/2023, 4:47 AM%{HOST_GATEWAY_IP}%
template variable.
flyteorg/flyte
GitHub Actions: trigger-sandbox-lite-build
GitHub Actions: trigger-single-binary-build
GitHub Actions: compile
GitHub Actions: Functional test
✅ 6 other checks have passed
6/10 successful checksGitHub
01/11/2023, 12:51 PMSdkBindingData.get()
to get the inner value in the run task context (java example, scala example), but at the same time these changes allow to recover the attributes by name in the workflow site (java example, scala example).
Tracking Issue
• flyteorg/flyte#3250
• flyteorg/flyte#3251
Follow-up issue
• flyteorg/flyte#3252
flyteorg/flytekit-java
✅ All checks have passed
3/3 successful checksGitHub
01/11/2023, 7:15 PMimage▾
GitHub
01/11/2023, 9:21 PMGitHub
01/11/2023, 9:50 PM<https://github.com/flyteorg/flytepropeller/tree/master|master>
by hamersaw
<https://github.com/flyteorg/flytepropeller/commit/ce57dbf15274a77c7037f367bda738e6ce33b453|ce57dbf1>
- use different perm (#517)
flyteorg/flytepropellerGitHub
01/11/2023, 10:11 PM<https://github.com/flyteorg/flyte/tree/master|master>
by eapolinario
<https://github.com/flyteorg/flyte/commit/782fe745452a07294db47f6ff782e5c559ac3c92|782fe745>
- Add dask operator (#3145)
flyteorg/flyteGitHub
01/11/2023, 10:13 PMGitHub
01/11/2023, 10:46 PMGitHub
01/11/2023, 10:48 PM<https://github.com/flyteorg/flyte/tree/master|master>
by jeevb
<https://github.com/flyteorg/flyte/commit/c674bfa31ef0d262112b083950c4425480a18bc3|c674bfa3>
- Add Kubernetes objects for dev mode with single-binary sandbox (#3228)
flyteorg/flyteGitHub
01/11/2023, 10:50 PMGitHub
01/11/2023, 10:55 PM<https://github.com/flyteorg/flyte/tree/master|master>
by eapolinario
<https://github.com/flyteorg/flyte/commit/c19c282df236b4f7917a64f7928eeaa121723b56|c19c282d>
- Changelog for 1.3 (#3209)
flyteorg/flyteGitHub
01/11/2023, 11:12 PMGitHub
01/11/2023, 11:14 PM<https://github.com/flyteorg/flytectl/tree/master|master>
by wild-endeavor
<https://github.com/flyteorg/flytectl/commit/b0d98931bf8780c193f9fc101242309058138795|b0d98931>
- Change extra host to host-gateway (#380)
flyteorg/flytectlGitHub
01/11/2023, 11:33 PM<https://github.com/flyteorg/flyte/tree/master|master>
by eapolinario
<https://github.com/flyteorg/flyte/commit/f69fb09ca189e8bf57e1a6a12db168274f640d15|f69fb09c>
- Update Flyte components (#3230)
flyteorg/flyteGitHub
01/11/2023, 11:37 PMGitHub
01/11/2023, 11:37 PM<https://github.com/flyteorg/homebrew-tap/tree/main|main>
by flyte-bot
<https://github.com/flyteorg/homebrew-tap/commit/6d942c75f7f2a62d6180fb5d16c454d6e28c8d54|6d942c75>
- Brew formula update for flytectl version v0.6.26
flyteorg/homebrew-tapGitHub
01/11/2023, 11:57 PMGitHub
01/11/2023, 11:57 PMFlyteRemote
however, though only a limited set of types can be passed in.
Notes
There are a couple things to point out with this release.
Caching on Structured Dataset
Please take a look at the flytekit PR notes for more information but if you haven't bumped Propeller to version v1.1.36 (aka Flyte v1.2) or later, tasks that take as input a dataframe or a structured dataset type, that are cached, will trigger a cache miss. If you've upgraded Propeller, it will not.
Flytekit Remote Types
In the FlyteRemote
experience, fetched tasks and workflows will now be based on their respective "spec" classes in the IDL (task/wf) rather than the template. The spec messages are a superset of the template messages so no information is lost. If you have code that was accessing elements of the templates directly however, these will need to be updated.
Usage Overview
Databricks
Please refer to the documentation for setting up Databricks.
Databricks is a subclass of the Spark task configuration so you'll be able to use the new class in place of the more general Spark
configuration.
from flytekitplugins.spark import Databricks
@task(
task_config=Databricks(
spark_conf={
"spark.driver.memory": "1000M",
"spark.executor.memory": "1000M",
"spark.executor.cores": "1",
"spark.executor.instances": "2",
"spark.driver.cores": "1",
},
databricks_conf={
"run_name": "flytekit databricks plugin example",
"new_cluster": {
"spark_version": "11.0.x-scala2.12",
"node_type_id": "r3.xlarge",
"aws_attributes": {
"availability": "ON_DEMAND",
"instance_profile_arn": "arn:aws:iam::1237657460:instance-profile/databricks-s3-role",
},
"num_workers": 4,
},
"timeout_seconds": 3600,
"max_retries": 1,
}
))
New Deployment Type
A couple releases ago, we introduced a new Flyte executable that combined all the functionality of Flyte's backend into one command. This simplifies the deployment in that only one image needs to run now. This approach is now our recommended way for new comers to the project to install and administer Flyte and there is a new Helm chart also. Documentation has been updated to take this into account. For new installations of Flyte, clusters that do not already have the flyte-core
or flyte
charts installed, users can
helm install flyte-server flyteorg/flyte-binary --namespace flyte --values your_values.yaml
New local demo environment
Users may have noticed that the environment provided by flytectl demo start
has also been updated to use this new style of deployment, and internally now installs this new Helm chart. The demo cluster now also exposes an internal docker registry on port 30000
. That is, with the new demo cluster up, you can tag and push to localhost:30000/yourimage:tag123
and the image will be accessible to the internal Docker daemon. The web interface is still at localhost:30080
, Postgres has been moved to 30001
and the Minio API (not web server) has been moved to 30002
.
Human-in-the-loop Workflows
Users can now insert sleeps, approval, and input requests, in the form of gate nodes. Check out one of our earlier issues for background information.
from flytekit import wait_for_input, approve, sleep
@workflow
def mainwf(a: int):
x = t1(a=a)
s1 = wait_for_input("signal-name", timeout=timedelta(hours=1), expected_type=bool)
s2 = wait_for_input("signal name 2", timeout=timedelta(hours=2), expected_type=int)
z = t1(a=5)
zzz = sleep(timedelta(seconds=10))
y = t2(a=s2)
q = t2(a=approve(y, "approvalfory", timeout=timedelta(hours=2)))
x >> s1
s1 >> z
z >> zzz
...
These also work inside @dynamic
tasks. Interacting with signals from flytekit's remote experience looks like
from flytekit.remote.remote import FlyteRemote
from flytekit.configuration import Config
r = FlyteRemote(
Config.auto(config_file="/Users/ytong/.flyte/dev.yaml"),
default_project="flytesnacks",
default_domain="development",
)
r.list_signals("atc526g94gmlg4w65dth")
r.set_signal("signal-name", "execidabc123", True)
Overwritten Cached Values on Execution
Users can now configure workflow execution to overwrite the cache. Each task in the workflow execution, regardless of previous cache status, will execute and write cached values - overwritting previous values if necessary. This allows previously corrupted cache values to be corrected without the tedious process of incrementing the cache_version
and re-registering Flyte workflows / tasks.
Support for Dask
Users will be able to spawn Dask ephemeral clusters as part of their workflows, similar to the support for Ray and Spark.
Looking Ahead
In the coming release, we are focusing on...
1. Out of core plugin: Make backend plugin scalable and easy to author. No need of code generation, using tools that MLEs and Data Scientists are not accustomed to using.
2. Performance Observability: We have made great progress on exposing both finer-grained runtime metrics and Flytes orchestration metrics. This is important to better understand workflow evaluation performance and mitigate inefficiencies thereof.
flyteorg/flyteGitHub
01/11/2023, 11:58 PMGitHub
01/12/2023, 1:12 AMGitHub
01/12/2023, 1:12 AM