GitHub
04/20/2023, 8:21 PMGitHub
04/20/2023, 9:06 PMGitHub
04/20/2023, 9:45 PM<https://github.com/flyteorg/flyte/tree/master|master>
by davidmirror-ops
<https://github.com/flyteorg/flyte/commit/478137b1cf476575c25043de70378c06d723d1c0|478137b1>
- Update troubleshooting guide (#3552)
flyteorg/flyteGitHub
04/20/2023, 9:45 PMflytectl demo
cluster, though the commands should transfer over to most any other Flyte cluster.
Here are example topics that would fit in this section:
• Scoped down to flytectl demo
cluster
• Getting error logs from failing pods
• Getting cluster configuration values (e.g. project, domain, task-level resources)
• Building and pushing a custom docker image to a flytectl demo
cluster
• Move the contents from the Troubleshooting guide here
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyteGitHub
04/20/2023, 10:05 PMGitHub
04/20/2023, 10:19 PMGitHub
04/20/2023, 10:26 PMGitHub
04/20/2023, 11:03 PMGitHub
04/20/2023, 11:44 PM<https://github.com/flyteorg/flytekit/tree/master|master>
by pingsutw
<https://github.com/flyteorg/flytekit/commit/8c05797feeaf0e47a56075bf1c725c8284157660|8c05797f>
- pyflyte run imperative workflow (#1597)
flyteorg/flytekitGitHub
04/20/2023, 11:53 PMis_inside
will check if current image is built from image_spec when running remotely. This function always returns true when running locally, which means it will import all the package in local execution.
from flytekit import task, workflow, Resources
from flytekit.image_spec import ImageSpec
new_flytekit = "git+<https://github.com/flyteorg/flytekit@db448b054533de52e3a2d7fdb08be31e7b8a5c7d>"
tensorflow_image_spec = ImageSpec(packages=["tensorflow", new_flytekit], apt_packages=["git"], registry="pingsutw")
torch_image_spec = ImageSpec(packages=["torch", new_flytekit], apt_packages=["git"], registry="pingsutw")
if tensorflow_image_spec.is_inside():
print("import tensorflow")
import tensorflow
if torch_image_spec.is_inside():
print("import torch")
import torch
@task(container_image=tensorflow_image_spec, requests=Resources(mem="900Mi"))
def t2():
tensorflow.constant(4)
print("tensorflow")
@task(container_image=torch_image_spec, requests=Resources(mem="900Mi"))
def t1():
torch.tensor([[1., -1.], [1., -1.]])
print("torch")
@workflow
def wf():
t1()
t2()
if __name__ == '__main__':
wf()
Type
☐ Bug Fix
☑︎ Feature
☐ Plugin
Are all requirements met?
☐ Code completed
☐ Smoke tested
☐ Unit tests added
☐ Code documentation added
☐ Any pending items have an associated Issue
Complete description
^^^
Tracking Issue
NA
Follow-up issue
NA
flyteorg/flytekit
GitHub Actions: build-plugins (3.8, flytekit-kf-mpi)
GitHub Actions: build-plugins (3.8, flytekit-k8s-pod)
GitHub Actions: build-plugins (3.8, flytekit-hive)
GitHub Actions: build-plugins (3.8, flytekit-greatexpectations)
GitHub Actions: build-plugins (3.8, flytekit-envd)
GitHub Actions: build-plugins (3.8, flytekit-duckdb)
GitHub Actions: build-plugins (3.8, flytekit-dolt)
GitHub Actions: build-plugins (3.8, flytekit-deck-standard)
GitHub Actions: build-plugins (3.8, flytekit-dbt)
GitHub Actions: build-plugins (3.8, flytekit-data-fsspec)
GitHub Actions: build-plugins (3.8, flytekit-dask)
GitHub Actions: build-plugins (3.8, flytekit-bigquery)
GitHub Actions: build-plugins (3.8, flytekit-aws-sagemaker)
GitHub Actions: build-plugins (3.8, flytekit-aws-batch)
GitHub Actions: build-plugins (3.8, flytekit-aws-athena)
GitHub Actions: build (windows-latest, 3.11)
GitHub Actions: build (windows-latest, 3.9)
GitHub Actions: build (windows-latest, 3.8)
GitHub Actions: build (ubuntu-latest, 3.11)
GitHub Actions: build (ubuntu-latest, 3.10)
GitHub Actions: build (ubuntu-latest, 3.9)
GitHub Actions: build (ubuntu-latest, 3.8)
GitHub Actions: lint
GitHub Actions: Docs Warnings
GitHub Actions: docs
✅ 2 other checks have passed
2/27 successful checksGitHub
04/21/2023, 12:12 AMimage▾
image▾
GitHub
04/21/2023, 12:57 AMGitHub
04/21/2023, 1:18 AMGitHub
04/21/2023, 12:54 PM~/.flyte/sandbox/ca-certificates
will suffice to add those to the sandbox trust store.
flyteorg/flyte
GitHub Actions: build-and-push-sandbox-bundled-image
✅ 15 other checks have passed
15/16 successful checksGitHub
04/21/2023, 12:54 PM<https://github.com/flyteorg/flyte/tree/master|master>
by jeevb
<https://github.com/flyteorg/flyte/commit/bf6d6350c43c73939d4dab718b34e51c8ec2859e|bf6d6350>
- Use additional ca-certificates in flyte sandbox configuration directory (#3609)
flyteorg/flyteGitHub
04/21/2023, 1:36 PMCleanupOnFailure
to TaskNodeStatus
to track situations where a plugin execution should be cleaned up by Abort
even though it is reported as a failure.
Type
☑︎ Bug Fix
☐ Feature
☐ Plugin
Are all requirements met?
☑︎ Code completed
☑︎ Smoke tested
☐ Unit tests added
☐ Code documentation added
☐ Any pending items have an associated Issue
Complete description
^^^
Tracking Issue
flyteorg/flyte#3239
Follow-up issue
NA
flyteorg/flytepropeller
GitHub Actions: Build & Push Flytepropeller Image
GitHub Actions: Goreleaser
GitHub Actions: Bump Version
✅ 11 other checks have passed
11/14 successful checksGitHub
04/21/2023, 2:04 PMcache_serialize
enabled can fail to correctly abort if the task with cache_serialize
is processed during abort but has not yet started. This is because the ReleaseCatalogReservation
function attempts to read the input values (to compute the cache key). If the node has not yet started, then Flyte has not yet written the input values.
Expected behavior
Aborts over NotYetStarted
tasks with cache_serialize
should not error.
Additional context to reproduce
This error is reproducible with the follow workflow:
import time
import flytekit
from flytekit import workflow, task
from flytekit.types.file import FlyteFile
from pathlib import Path
from typing import Optional
@task(cache=True, cache_serialize=True, cache_version='0.0.0')
def t1() -> FlyteFile:
out = Path(flytekit.current_context().working_directory) / 'out.txt'
out.write_text('Hi from t1')
return FlyteFile(path=str(out))
@task(cache=True, cache_serialize=True, cache_version='0.0.0')
def t2(t1_: FlyteFile, optional: Optional[int] = None) -> FlyteFile:
out = Path(flytekit.current_context().working_directory) / 'out.txt'
with open(t1_, 'r') as f:
out.write_text(f.read() + f'\nHi from t2, optional={optional}')
return FlyteFile(path=str(out))
@workflow
def inner_wf(t1_: FlyteFile, optional: Optional[int] = None) -> FlyteFile:
return t2(t1_=t2(t1_=t1_, optional=optional), optional=optional)
@workflow
def wf() -> FlyteFile:
return inner_wf(t1_=t1())
This is certainly not a minimal reproduction, but the workflow fails because node n0
(subworkflow inner_wf
) is unable to resolve the optional
value. In attempting to abort the workflow, Flyte attempts to abort task t1
. Since t1
is cache_serialize
during the abort process there is a failure during ReleaseCatalogReservation
.
The failure is reported as:
{
"json": {
"exec_id": "myexecid",
"ns": "myns",
"res_ver": "445663335",
"routine": "worker-7",
"wf": "myproject:myns:mywf.mytask"
},
"level": "error",
"msg": "Error when trying to reconcile workflow. Error [0: failed at Node[n0]. CatalogCallFailed: failed to release reservation, caused by: failed to read inputs when trying to query catalog: [READ_FAILED] failed to read data from dataDir [<gs://mybucket/metadata/propeller/myproject-myns-myexecid/n1/data/0/n0/inputs.pb>]., caused by: path:<gs://mybucket/metadata/propeller/myproject-myns-myexecid/n1/data/0/n0/inputs.pb>: not found]. Error Type[errors.ErrorCollection]",
"ts": "2023-04-20T12:13:39Z"
}
Screenshots
No response
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyteGitHub
04/21/2023, 2:15 PM@task(
pod_template=PodTemplate(
pod_spec=V1PodSpec(
service_account_name="trainer-service",
containers=[],
)
attempts to manually set the service account of the k8s Pod
to trainer-service
. However, the result is that Flyte overrides this variable, setting the service account to 'default' (by default).
This seems to be because the PodPlugin
always overrides the ServiceAccount
in this logic. However, previous implementation first checked if the ServiceAccount
was empty first. The latter seems to be the correct implementation.
Expected behavior
Users should be able to override the k8s Pod
service account using the PodTemplate
.
Additional context to reproduce
Create a task with a service account, run it, and note the manually set service account is overridden.
Screenshots
No response
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyteGitHub
04/21/2023, 4:59 PM<https://github.com/flyteorg/flyteconsole/tree/master|master>
by jsonporter
<https://github.com/flyteorg/flyteconsole/commit/e4919df49f5f45f74bc27ee55ff62784d2b26bad|e4919df4>
- TLM add log-message window to left panel (#748)
flyteorg/flyteconsoleGitHub
04/21/2023, 8:41 PM<https://github.com/flyteorg/flytekit/tree/master|master>
by kumare3
<https://github.com/flyteorg/flytekit/commit/9fafa7800af46e2cb18c08c66bf72bc5d6db1d49|9fafa780>
- Backfill should fill in the right input vars (#1593)
flyteorg/flytekitGitHub
04/21/2023, 10:04 PMScreenshot 2023-04-21 at 3 03 37 PM▾
GitHub
04/21/2023, 11:16 PM<https://github.com/flyteorg/flyteplugins/tree/master|master>
by pingsutw
<https://github.com/flyteorg/flyteplugins/commit/1f3916383b86c69ccc3b6ef6573f9356bf4cdfb0|1f391638>
- External Plugin Service (grpc) (#330)
flyteorg/flytepluginsGitHub
04/21/2023, 11:16 PMGitHub
04/21/2023, 11:34 PMGitHub
04/22/2023, 1:51 AM<https://github.com/flyteorg/flytekit/tree/master|master>
by eapolinario
<https://github.com/flyteorg/flytekit/commit/f5c5abe48dd351337724c17f41cec2fa44f8222a|f5c5abe4>
- Pyflyte prettified (#1602)
flyteorg/flytekitGitHub
04/22/2023, 6:49 PMGitHub
04/22/2023, 6:50 PMtorchrun
one can perform distributed training on a single machine with multiple GPUs by starting a local process group in sub-processes. (With the kf-pytorch plugin one currently cannot perform distributed training on a single machine with multiple GPUs.)
In addition, many open source projects or libraries building on-top of pytorch now simply assume that one uses torchrun
. For instance:
• Finetuning Stanford's new Llama model requires `torchrun`.
• Pytorch ignite assumes that the environment variable `LOCAL_RANK` (which is set by `torchrun`) is set when initializing distributed training even though this is not required for native pytorch distributed training.
* * *
The kubeflow training operator which is used by Flyte to perform pytorch distributed training in Kubernetes clusters now supports elastic training. Flyte should make use of this to give users the benefits of elastic training.
Goal: What should the final outcome look like, ideally?
Currently, distributed training is configured as follows in `flytekit`:
from flytekitplugins.kfpytorch import PyTorch
@task(
task_config=PyTorch(
num_workers=2,
),
....
)
def train(...):
...
Users of torchrun
typically start their training e.g. like this:
torchrun
....
--nproc-per-node=4
train.py
Flyte users should be able to configure elastic training e.g. like this:
from flytekitplugins.kfpytorch import Elastic
@task(
task_config=Elastic(
replicas=4,
nproc_per_node=4,
...
),
...
)
def train(...):
...
Behind the scenes a task with such an Elastic
task config should use torch.distributed.launcher.api.elastic_launch
to start the task function which is also used by `torchrun`.
Additionally, the values in the Elastic
task config should also be used to configure the `ElasticPolicy` of the kubeflow PytorchJob used to perform multi-node distributed training in a Kubernetes cluster.
Describe alternatives you've considered
Propose: Link/Inline OR Additional context
I implemented the required changes:
• `flytekit`: flyteorg/flytekit#1603
• `flyteidl`: flyteorg/flyteidl#394
• `flyteplugins`: flyteorg/flyteplugins#343
• Docs: flyteorg/flytesnacks#987
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyteGitHub
04/23/2023, 10:55 AMGitHub
04/23/2023, 11:01 AMfrom flytekitplugins.kfpytorch import Elastic
@task(
task_config=Elastic(
replicas=4,
nproc_per_node=4,
...
),
...
)
def train(...):
...
See this issue for motivation and more details.
Type
☐ Bug Fix
☑︎ Feature
☐ Plugin
Are all requirements met?
☑︎ Code completed
☑︎ Smoke tested
☑︎ Unit tests added
☐ Code documentation added
☑︎ Any pending items have an associated Issue
Complete description
Tracking Issue
Fixes flyteorg/flyte#3614
Follow-up issue
flyteorg/flyteplugins
GitHub Actions: Run tests and lint
✅ 3 other checks have passed
3/4 successful checksGitHub
04/23/2023, 11:02 AMfrom flytekitplugins.kfpytorch import Elastic
@task(
task_config=Elastic(
replicas=4,
nproc_per_node=4,
...
),
...
)
def train(...):
...
Type
☐ Bug Fix
☑︎ Feature
☐ Plugin
Are all requirements met?
☑︎ Code completed
☑︎ Smoke tested
☑︎ Unit tests added
☑︎ Code documentation added
☑︎ Any pending items have an associated Issue
Complete description
See flyteorg/flyte#3614 for motivation and a more detailed description.
Tracking Issue
flyteorg/flyte#3614
Follow-up issue
flyteorg/flytekit
GitHub Actions: build-plugins (3.11, flytekit-kf-pytorch)
✅ 29 other checks have passed
29/30 successful checks