GitHub
06/07/2023, 3:05 PM<https://github.com/flyteorg/flytepropeller/tree/master|master>
by pingsutw
<https://github.com/flyteorg/flytepropeller/commit/2086bb91f4d88d68e4414238a3f777615fb0841e|2086bb91>
- Register gRPC plugin after reading configmap (#564)
flyteorg/flytepropellerGitHub
06/07/2023, 3:43 PMGitHub
06/07/2023, 4:12 PMPytorchJob
as opposed to when doing non-elastic pytorch distributed training.
Flyteplugins, however, still generates a log link for the non-existing master replica in case of elastic training. This PR fixes this.
Type
☑︎ Bug Fix
☐ Feature
☐ Plugin
Are all requirements met?
☑︎ Code completed
☑︎ Smoke tested
☑︎ Unit tests added
☐ Code documentation added
☐ Any pending items have an associated Issue
I built a propeller image and tested that the correct log links are shown both for elastic and the original non-elastic pytorch tasks.
Complete description
When doing "normal" non-elastic training, a flyte task looks like this:
@task(
task_config=PyTorch(
num_workers=3,
)
)
def train():
The pytorch job that is created from this task definition looks like this:
apiVersion: "<http://kubeflow.org/v1|kubeflow.org/v1>"
kind: PyTorchJob
metadata:
...
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
...
Worker:
replicas: 3
...
Notice that there is a so-called "master" replica and multiple workers.
In the Flyte console, a link to the master replica and to the 3 worker replicas logs is shown.
When using the new elastic training task (torchrun) ...
@task(
task_config=Elastic(
nnodes=3,
nproc_per_node=1,
),
)
def train():
... the resulting pytorch job looks like this:
apiVersion: "<http://kubeflow.org/v1|kubeflow.org/v1>"
kind: PyTorchJob
metadata:
...
spec:
elasticPolicy:
...
pytorchReplicaSpecs:
Worker:
replicas: 3
...
Notice that there is no-more "master" replica.
Even though there is no "master" replica, currently the Flyte console still shows a log link for the master replica that doesn't exist.
This PR fixes this.
Tracking Issue
NA
Follow-up issue
NA
flyteorg/flyteplugins
✅ All checks have passed
7/7 successful checksGitHub
06/07/2023, 5:07 PMGitHub
06/07/2023, 5:53 PMGitHub
06/07/2023, 5:59 PMGitHub
06/07/2023, 6:10 PMGitHub
06/07/2023, 6:27 PM<https://github.com/flyteorg/flyteplugins/tree/master|master>
by fg91
<https://github.com/flyteorg/flyteplugins/commit/de1d1d345f3bf487701b1f732b2c56f76e81e086|de1d1d34>
- Don't add master replica log link when doing elastic pytorch training (#356)
flyteorg/flytepluginsGitHub
06/07/2023, 7:59 PMGitHub
06/07/2023, 9:15 PMGitHub
06/07/2023, 10:21 PM<https://github.com/flyteorg/flytekit/tree/master|master>
by eapolinario
<https://github.com/flyteorg/flytekit/commit/ad56f8affeb0cecc2f7d32c8bd55c49165a95429|ad56f8af>
- Set a less strict deadline for hypothesis tests (#1682)
flyteorg/flytekitGitHub
06/07/2023, 10:24 PM<https://github.com/flyteorg/flytekit/tree/master|master>
by eapolinario
<https://github.com/flyteorg/flytekit/commit/1952a56ca36b8310ae756f8e2ef2cf31ec409b8f|1952a56c>
- Use protos of new kubeflow.pytorch plugin instead of legacy pytorch plugin (#1678)
flyteorg/flytekitGitHub
06/08/2023, 6:31 AM<https://github.com/flyteorg/flytekit/tree/master|master>
by pingsutw
<https://github.com/flyteorg/flytekit/commit/06d95b805d36998b58cce93f3837f0bd0ec0b945|06d95b80>
- More time info for time line deck (#1680)
flyteorg/flytekitGitHub
06/08/2023, 7:03 AMdask
plugin creates a DaskJob
k8s resource, it will initially not have the .Status.JobStatus
property set.
So
flyteplugins/go/tasks/plugins/k8s/dask/dask.go
Line 316 in </flyteorg/flyteplugins/commit/de1d1d345f3bf487701b1f732b2c56f76e81e086|de1d1d3>
will return an empty string until the DaskOperator
sets the status for the first time. While this period is very brief, we've seen it happening a few times.
This PR adds support for the ""
state, effectively handling it the same as being queued/initializing.
Type
☑︎ Bug Fix
☐ Feature
☐ Plugin
Are all requirements met?
☐ Code completed
☐ Smoke tested
☑︎ Unit tests added
☐ Code documentation added
☐ Any pending items have an associated Issue
flyteorg/flyteplugins
Codecov: 62.62% (-0.21%) compared to 06866ee
✅ 6 other checks have passed
6/7 successful checksGitHub
06/08/2023, 2:31 PMflyte:propeller:all:free_workers_count flyte:propeller:all:round:abort_error[5m] flyte:propeller:all:round:system_error_unlabeled[5m] flyte:propeller:all:node:plugin:.*_failure_unlabeled flyte:propeller:all:node:plugin:.*_success_unlabeled flyte:propeller:all:round:raw_unlabeled_ms[5m] flyte:propeller:all:round:raw_ms[5m] flyte:propeller:all:round:panic_unlabeled[5m] flyte:propeller:all:collector:flyteworkflow flyte:propeller:all:metastore:cache_hit flyte:propeller:all:metastore:cache_miss flyte:propeller:all:metastore:head_failure_unlabeled
We can only see the following in datadog when we search for 'propeller':
flyte_admin_admin_builder_flytepropeller_build_failures.count flyte_admin_admin_builder_flytepropeller_build_successes.count flyte_admin_admin_execution_manager_propeller_failures.count
These seem to be flyte admin logs not propeller logs.
Expected result: All flyte propeller metrics should be exposed via the metrics port.
flyteorg/flyteGitHub
06/08/2023, 3:58 PM<https://github.com/flyteorg/flyteplugins/tree/master|master>
by jeevb
<https://github.com/flyteorg/flyteplugins/commit/aa1ed678ec085a78f3295e51e6153fb56e9bf1ed|aa1ed678>
- [Bigquery] Add support for impersonation of GSA bound to task's KSA (#355)
flyteorg/flytepluginsGitHub
06/08/2023, 3:59 PMGitHub
06/08/2023, 5:18 PMAddFlyteCustomizationsToContainer
(here and here. This PR updates the BuildRawPod function to call a refactored BuildRawContainer to ensure the AddFlyteCustomizationsToContainer
is only called once so that env vars are only injected a single time.
Type
☑︎ Bug Fix
☐ Feature
☐ Plugin
Are all requirements met?
☑︎ Code completed
☑︎ Smoke tested
☐ Unit tests added
• [ x] Code documentation added
☐ Any pending items have an associated Issue
Complete description
^^^
Tracking Issue
fixes flyteorg/flyte#3740
Follow-up issue
NA
flyteorg/flyteplugins
✅ All checks have passed
2/2 successful checksGitHub
06/08/2023, 6:08 PMGitHub
06/08/2023, 6:26 PMGitHub
06/08/2023, 7:07 PMGitHub
06/08/2023, 7:11 PM<https://github.com/flyteorg/flyteplugins/tree/master|master>
by hamersaw
<https://github.com/flyteorg/flyteplugins/commit/2a7a6f9c6b888bc51352e6997cfd6cdffcc96d42|2a7a6f9c>
- Fix initial dask job state (#357)
flyteorg/flytepluginsGitHub
06/08/2023, 7:12 PMGitHub
06/08/2023, 7:38 PMGitHub
06/08/2023, 9:28 PMImageSpec
and pyflyte build
, specifying packages like transformers[torch,deepspeed]>=4.28.1,<5
will fail during the build step will fail and to fix it, the user needs to wrap the package dep in quotes 'transformers[torch,deepspeed]>=4.28.1,<5'
.
Goal: What should the final outcome look like, ideally?
To improve the ergonomics of this feature, packages deps should automatically be wrapped in quotes by flytekit.
Describe alternatives you've considered
Keeping the status quo, which would be inconvenient for end users.
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyteGitHub
06/08/2023, 9:29 PMcharts/flyte-core/values.yaml
to add a resources block and verified that running make helm
updated the generated helm chart:
# Source: flyte/charts/flyte/templates/clusterresourcesync/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: syncresources
namespace: flyte
...
spec:
...
template:
metadata:
annotations:
configChecksum: "475154c41cdb06999025ab796aa1264fa3d235df51ac088a39c89c7ce300408"
labels:
<http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteclusterresourcesync
<http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte
<http://helm.sh/chart|helm.sh/chart>: flyte-v0.1.10
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
spec:
...
serviceAccountName: flyteadmin
resources:
limits:
cpu: 250m
ephemeral-storage: 100Mi
memory: 500Mi
requests:
cpu: 10m
ephemeral-storage: 50Mi
memory: 50Mi
volumes:
...
Check all the applicable boxes
☐ I updated the documentation accordingly.
☐ All new and existing tests passed.
☐ All commits are signed-off.
Screenshots
N/A
Note to reviewers
N/A
flyteorg/flyte
✅ All checks have passed
8/8 successful checksGitHub
06/08/2023, 9:51 PMGitHub
06/08/2023, 10:01 PMGitHub
06/08/2023, 10:47 PM<https://github.com/flyteorg/flytekit/tree/master|master>
by pingsutw
<https://github.com/flyteorg/flytekit/commit/8437023ad1a4fce367a2e1ac2368276154b5e927|8437023a>
- Rename external plugin to agent (#1666)
flyteorg/flytekitGitHub
06/08/2023, 10:50 PMkubeflow.pytorch
plugin instead of legacy pytorch
plugin by @fg91 in #1678
• More time info for time line deck by @Yicheng-Lu-llll in #1680
• Rename external plugin to agent by @pingsutw in #1666
Full Changelog: v1.6.2b0...v1.6.2b1
flyteorg/flytekit