<https://github.com/flyteorg/flytekit/blob/c8433ea...
# contribute
f
https://github.com/flyteorg/flytekit/blob/c8433ea2d2ecf362815b1ee63554549ced9acd75/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L67 @Byron Hsu do you know what the use case is to specify the image in the worker/master spec in the new pytorch task config? Doesn’t this mess with flyte’s way of specifying the image in the task decorator?
b
Not Byron but I’ve implemented something similar for
dask
in case folks want to have a different image in one of the components. Docs (and what takes precedence) here
f
Thanks! @Byron Hsu do you know whether the precedence is handled the same way for the new pytorch config?
b
@Yubo Wang ^
y
if you don’t specify image for the replica group, it will just use the image you set in the decorator
f
Thus if one also doesn’t set one there, it uses the one from
pyflyte run
I guess?
y
yes
this change is backward compatible
f
Thanks for the quick reply guys! I need a bit more help 🙏
Copy code
from flytekit import task, workflow, Resources
from flytekitplugins.kfpytorch import Elastic, PyTorch


@task(
    task_config=PyTorch(
        num_workers=1
    ),
    requests=Resources(cpu="3", mem="3Gi"),
)
def train():
    print("Trained")


@workflow
def wf():
    train()
When I run this workflow with
flytekit==1.6.2
installed locally it runs, propeller image is also version 1.6.2. When I install latest master of flytekit, the task stays queued with this error:
Copy code
E0606 16:05:16.771720       1 workers.go:102] error syncing 'development/f0024b9843ea14ed3b8c': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
I guess to use the new pytorch task config I’ll need a new propeller image, right? Do we already have an image built in CI somewhere?
And will the new propeller image be backwards compatible for users who still have older flytekitplugins-kf-pytorch versions?
y
it should be backward compatible for users who still have older flytekitplugins-kf-pytorch
interesting, the number of worker check I think was there even before the change
so you want to be able to set worker to 0?
f
No I’m running this code.
y
oh you set it to 1
f
But I’m registering from flytekit latest master installed against flytepropeller 1.6.2
This doesn’t work right?
y
I don’t think flytepropeller has a version of 1.6.2. I might made a mistake in the readme. Can we use the release version instead?
what is the tag of your latest commit
btw, yes, if the flytekit is upgraded and flytepropeller is not, it will cause issue
f
Copy code
❯ k -n flyte get pod flytepropeller-668549d8bd-srqk2 -o yaml | grep image
    image: <http://cr.flyte.org/flyteorg/flytepropeller-release:v1.6.2|cr.flyte.org/flyteorg/flytepropeller-release:v1.6.2>
But I think this is related to the tags in the
flyte
repo not in propeller.
y
can you help me testing using the new config?
I think the 1.6.2 should already include the changes I made
f
Yes 🙂 I’ll end the day soon but tomorrow I’ll build a propeller image from latest master and run it in our staging cluster.
Let’s see whether I can then register the dummy workflow from above.
y
sure, I will do some testing on my end as well
f
I think the 1.6.2 should already include the changes I made
If this is the case (didn’t check), then something is broken, right?
Can you try to run my dummy workflow please, registering from flytekit master or some version that includes the new pytorch config?
y
will do today
f
Thanks
e
What Yubo said. We need a new flytepropeller image. Yubo's change is only present in release 1.6.2b0 of flytekit. We're preparing 1.7.0, going should be out tomorrow, which should have Yubo's change.
y
I tested it locally and seems the issue is there, let me do a fix tomorrow when I wake up
@Fabio Grätz if you are in a rush, you can help fix this https://github.com/flyteorg/flytekit/blob/c8433ea2d2ecf362815b1ee63554549ced9acd75/plugins/flytekit-kf-tensorflow/flytekitplugins/kftensorflow/task.py#L151 add the same code as I have put in tensorflow. I forgot to set the task version to 1 in Phtorch
f
thanks!
I have one more question about specifying the master or worker images to be different from the flyte task image: Do you guys know what the difference between the master and the worker is? I never found anything useful about this in the kubeflow docs and I was never in a situation where I needed the master to be different than the workers. Pytorch distributed (I mean the python library, not kubeflow pytorch job) itself doesn’t seem to have this differentiation at all. And in elastic training kubeflow pytorchjob doesn’t use master anymore.
Basically I’m asking: Does anyone of us have a good use case for different images in master and workers or more than 1 master replica?
y
In my opinion, there is no use case that require different images, but definitely different resources. I think the image option is more to keep consistent with other plugins like mpijob
Do you think we should remove the images options? I can remove them. Looks like it’s confusing you
f
I don’t have super strong opinions and don’t want to be annoying so please excuse upfront for starting this discussion ahah Kubeflow unfortunately doesn’t have much (any?) documentation about when one would want to customize certain aspects of their Job types. For instance one can use more than 1 master replica but I don’t know what this would mean/when one does this. Since Flyte is an abstraction layer on top of K8s that hides away the details that ML engineers and data scientists don’t want to deal with, I think it would be nice to shield users from the options that even we as power users don’t know any use case for. (I can well imagine that not everything in kubeflow training operators is well thought through given how the rest of the project looks like.) And I’m wondering whether images are a candidate for this in PytorchJob, totally seeing that a power user would want to do customize the images for e.g. Spark, … • I’m curious what your use case if for different resources in master and worker (not because I’m arguing against it but because I actually don’t know what the difference between master and worker is and would love to know. Have never seen good docs about it) • Do you know if using more than 1 master actually makes any sense? • Can we as power users/someone who has used raw kubeflow pytorhjob before come up with a use case for different images in PytorchJob? If not, should we expose this to the user? What would we write in the docstring for when to do this?
y
so actually, I agree with your thoughts. so when I was working on this. Our use cases are 1. TFJob where ps/chief replicas require different resources than worker replicas 2. MPIJob where launcher replicas have different command than worker replicas. At that time my thoughts are, Flyte does not allow user to set specific details on the kubeflow jobs and I want to expose those fields for users. However, now I felt like I agree with your thoughts. Maybe we should hide users from the complexity. So as for now, I can either: 1. revert pytorch plugin to the original state or 2. Make basic usage more intuitive, so that users are not aware of other fields that can be set.
f
Are you joining the contributors’ sync today? 🙂 Maybe we can discuss there
y
sounds like a good idea. I will attend. I think we need a collective decision. I am not a pytorch expertise. My changes are based on my presumption on other kubeflows. Thanks for bringing this up though!
f
Thanks for coming to the meeting to discuss this @Yubo Wang. I find the two arguments convincing that 1) with the other plugins we strive for parity with the underlying CRDs and 2) that it’s better to expose one parameter too much even though it’s not of much use to most people rather than omitting one that could have been useful to somebody.
149 Views