<https github com flyteorg flytekit blob c8433ea2d2ecf362815 Flyte #contribute

<https://github.com/flyteorg/flytekit/blob/c8433ea...

cool-lifeguard-49380

06/06/2023, 3:46 PM

https://github.com/flyteorg/flytekit/blob/c8433ea2d2ecf362815b1ee63554549ced9acd75/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L67 @polite-ability-4005 do you know what the use case is to specify the image in the worker/master spec in the new pytorch task config? Doesn’t this mess with flyte’s way of specifying the image in the task decorator?

agreeable-kitchen-44189

06/06/2023, 3:55 PM

Not Byron but I’ve implemented something similar for

dask

in case folks want to have a different image in one of the components. Docs (and what takes precedence) here

cool-lifeguard-49380

06/06/2023, 4:01 PM

Thanks! @polite-ability-4005 do you know whether the precedence is handled the same way for the new pytorch config?

polite-ability-4005

06/06/2023, 4:02 PM

@limited-raincoat-94253 ^

limited-raincoat-94253

06/06/2023, 4:03 PM

if you don’t specify image for the replica group, it will just use the image you set in the decorator

👍 1

cool-lifeguard-49380

06/06/2023, 4:04 PM

Thus if one also doesn’t set one there, it uses the one from

pyflyte run

I guess?

limited-raincoat-94253

06/06/2023, 4:04 PM

yes

limited-raincoat-94253

06/06/2023, 4:05 PM

this change is backward compatible

cool-lifeguard-49380

06/06/2023, 4:07 PM

Thanks for the quick reply guys! I need a bit more help 🙏

Copy code

from flytekit import task, workflow, Resources
from flytekitplugins.kfpytorch import Elastic, PyTorch


@task(
    task_config=PyTorch(
        num_workers=1
    ),
    requests=Resources(cpu="3", mem="3Gi"),
)
def train():
    print("Trained")


@workflow
def wf():
    train()

When I run this workflow with

flytekit==1.6.2

installed locally it runs, propeller image is also version 1.6.2. When I install latest master of flytekit, the task stays queued with this error:

Copy code

E0606 16:05:16.771720       1 workers.go:102] error syncing 'development/f0024b9843ea14ed3b8c': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0

I guess to use the new pytorch task config I’ll need a new propeller image, right? Do we already have an image built in CI somewhere?

cool-lifeguard-49380

06/06/2023, 4:08 PM

And will the new propeller image be backwards compatible for users who still have older flytekitplugins-kf-pytorch versions?

limited-raincoat-94253

06/06/2023, 4:09 PM

it should be backward compatible for users who still have older flytekitplugins-kf-pytorch

limited-raincoat-94253

06/06/2023, 4:10 PM

interesting, the number of worker check I think was there even before the change

limited-raincoat-94253

06/06/2023, 4:10 PM

https://github.com/flyteorg/flyteplugins/commit/9a2bbbaf2f3ac9e38222ba7755ac4bda1ac2fdb7#diff-48a62990bb8b250866007c3[…]29814020251cbc18b492c6838L92

limited-raincoat-94253

06/06/2023, 4:10 PM

so you want to be able to set worker to 0?

cool-lifeguard-49380

06/06/2023, 4:11 PM

No I’m running this code.

limited-raincoat-94253

06/06/2023, 4:11 PM

oh you set it to 1

cool-lifeguard-49380

06/06/2023, 4:11 PM

But I’m registering from flytekit latest master installed against flytepropeller 1.6.2

cool-lifeguard-49380

06/06/2023, 4:11 PM

This doesn’t work right?

limited-raincoat-94253

06/06/2023, 4:15 PM

I don’t think flytepropeller has a version of 1.6.2. I might made a mistake in the readme. Can we use the release version instead?

limited-raincoat-94253

06/06/2023, 4:15 PM

what is the tag of your latest commit

limited-raincoat-94253

06/06/2023, 4:16 PM

btw, yes, if the flytekit is upgraded and flytepropeller is not, it will cause issue

👍 1

cool-lifeguard-49380

06/06/2023, 4:21 PM

Copy code

❯ k -n flyte get pod flytepropeller-668549d8bd-srqk2 -o yaml | grep image
    image: <http://cr.flyte.org/flyteorg/flytepropeller-release:v1.6.2|cr.flyte.org/flyteorg/flytepropeller-release:v1.6.2>

cool-lifeguard-49380

06/06/2023, 4:22 PM

But I think this is related to the tags in the

flyte

repo not in propeller.

limited-raincoat-94253

06/06/2023, 4:24 PM

can you help me testing using the new config?

limited-raincoat-94253

06/06/2023, 4:24 PM

I think the 1.6.2 should already include the changes I made

cool-lifeguard-49380

06/06/2023, 4:25 PM

Yes 🙂 I’ll end the day soon but tomorrow I’ll build a propeller image from latest master and run it in our staging cluster.

cool-lifeguard-49380

06/06/2023, 4:25 PM

Let’s see whether I can then register the dummy workflow from above.

limited-raincoat-94253

06/06/2023, 4:25 PM

sure, I will do some testing on my end as well

cool-lifeguard-49380

06/06/2023, 4:26 PM

I think the 1.6.2 should already include the changes I made

If this is the case (didn’t check), then something is broken, right?

👍 1

cool-lifeguard-49380

06/06/2023, 4:27 PM

Can you try to run my dummy workflow please, registering from flytekit master or some version that includes the new pytorch config?

limited-raincoat-94253

06/06/2023, 4:28 PM

will do today

cool-lifeguard-49380

06/06/2023, 4:30 PM

Thanks

high-accountant-32689

06/06/2023, 6:01 PM

What Yubo said. We need a new flytepropeller image. Yubo's change is only present in release 1.6.2b0 of flytekit. We're preparing 1.7.0, going should be out tomorrow, which should have Yubo's change.

👀 1

limited-raincoat-94253

06/07/2023, 7:17 AM

I tested it locally and seems the issue is there, let me do a fix tomorrow when I wake up

limited-raincoat-94253

06/07/2023, 7:39 AM

@cool-lifeguard-49380 if you are in a rush, you can help fix this https://github.com/flyteorg/flytekit/blob/c8433ea2d2ecf362815b1ee63554549ced9acd75/plugins/flytekit-kf-tensorflow/flytekitplugins/kftensorflow/task.py#L151 add the same code as I have put in tensorflow. I forgot to set the task version to 1 in Phtorch

🙏 1

cool-lifeguard-49380

06/07/2023, 8:02 AM

thanks!

cool-lifeguard-49380

06/07/2023, 8:16 AM

I have one more question about specifying the master or worker images to be different from the flyte task image: Do you guys know what the difference between the master and the worker is? I never found anything useful about this in the kubeflow docs and I was never in a situation where I needed the master to be different than the workers. Pytorch distributed (I mean the python library, not kubeflow pytorch job) itself doesn’t seem to have this differentiation at all. And in elastic training kubeflow pytorchjob doesn’t use master anymore.

cool-lifeguard-49380

06/07/2023, 8:18 AM

Basically I’m asking: Does anyone of us have a good use case for different images in master and workers or more than 1 master replica?

limited-raincoat-94253

06/07/2023, 8:42 AM

In my opinion, there is no use case that require different images, but definitely different resources. I think the image option is more to keep consistent with other plugins like mpijob

limited-raincoat-94253

06/07/2023, 8:44 AM

Do you think we should remove the images options? I can remove them. Looks like it’s confusing you

cool-lifeguard-49380

06/07/2023, 9:11 AM

I don’t have super strong opinions and don’t want to be annoying so please excuse upfront for starting this discussion ahah Kubeflow unfortunately doesn’t have much (any?) documentation about when one would want to customize certain aspects of their Job types. For instance one can use more than 1 master replica but I don’t know what this would mean/when one does this. Since Flyte is an abstraction layer on top of K8s that hides away the details that ML engineers and data scientists don’t want to deal with, I think it would be nice to shield users from the options that even we as power users don’t know any use case for. (I can well imagine that not everything in kubeflow training operators is well thought through given how the rest of the project looks like.) And I’m wondering whether images are a candidate for this in PytorchJob, totally seeing that a power user would want to do customize the images for e.g. Spark, … • I’m curious what your use case if for different resources in master and worker (not because I’m arguing against it but because I actually don’t know what the difference between master and worker is and would love to know. Have never seen good docs about it) • Do you know if using more than 1 master actually makes any sense? • Can we as power users/someone who has used raw kubeflow pytorhjob before come up with a use case for different images in PytorchJob? If not, should we expose this to the user? What would we write in the docstring for when to do this?

limited-raincoat-94253

06/08/2023, 2:29 AM

so actually, I agree with your thoughts. so when I was working on this. Our use cases are 1. TFJob where ps/chief replicas require different resources than worker replicas 2. MPIJob where launcher replicas have different command than worker replicas. At that time my thoughts are, Flyte does not allow user to set specific details on the kubeflow jobs and I want to expose those fields for users. However, now I felt like I agree with your thoughts. Maybe we should hide users from the complexity. So as for now, I can either: 1. revert pytorch plugin to the original state or 2. Make basic usage more intuitive, so that users are not aware of other fields that can be set.

👍 1

cool-lifeguard-49380

06/08/2023, 7:54 AM

Are you joining the contributors’ sync today? 🙂 Maybe we can discuss there

👍 1

limited-raincoat-94253

06/08/2023, 7:56 AM

sounds like a good idea. I will attend. I think we need a collective decision. I am not a pytorch expertise. My changes are based on my presumption on other kubeflows. Thanks for bringing this up though!

👍 2

cool-lifeguard-49380

06/08/2023, 7:11 PM

Thanks for coming to the meeting to discuss this @limited-raincoat-94253. I find the two arguments convincing that 1) with the other plugins we strive for parity with the underlying CRDs and 2) that it’s better to expose one parameter too much even though it’s not of much use to most people rather than omitting one that could have been useful to somebody.

🦜 2

149 Views

Open in Slack

Previous Next