I have the following issue with FlyteFile + single node mult Flyte #flyte-support

I have the following issue with FlyteFile + single...

flat-waiter-82487

07/23/2025, 11:35 AM

I have the following issue with FlyteFile + single node/multi-GPU PyTorch plugin: 1. When task is decorated only with

@task()

, the following code works just fine:

Copy code

@task()
def my_task(dataset: FlyteFile):
    path = dataset.download()
    # works as intended
    assert Path(path).is_file()

@workflow
def my_workflow():
    file = FlyteFile(
        path="<s3://path/to/my/file.csv>"
    )

    outputs = my_task(dataset=file)

2. When task is decorated with

@task(task_config=Elastic(nnodes=1, nproc_per_node=4)

, it breaks

Copy code

@task(task_config=Elastic(nnodes=1, nproc_per_node=4)
def my_task(dataset: FlyteFile):
    path = dataset.download()
    # .download() immediately returns and file is not there
    assert Path(path).is_file() # this will raise

@workflow
def my_workflow():
    file = FlyteFile(
        path="<s3://path/to/my/file.csv>"
    )

    outputs = my_task(dataset=file)

There's also a warning raised:

Copy code

.venv/lib/python3.12/site-packages/flytekit/types/file/file.py:356: RuntimeWarning: coroutine 'FileAccessProvider.async_get_data' was never awaited

It seems like the FlyteFile does not play well with underlying multiprocessing spawn This happens on

Flyte 1.15.3

freezing-airport-6809

07/23/2025, 1:33 PM

Good catch. It should not stay open in the spawn

freezing-airport-6809

07/23/2025, 1:34 PM

In v2 we have changed the architecture of flytefile, this should not happen, but let’s look into this for v1 too

flat-waiter-82487

07/23/2025, 1:39 PM

It should not stay open in the spawn

WDYM?

flat-waiter-82487

07/31/2025, 9:08 AM

Hello?

freezing-airport-6809

07/31/2025, 1:43 PM

For the multi you train we spawn torchrun spawns many processes

flat-waiter-82487

08/01/2025, 6:55 AM

I was asking about "It should not stay open in the spawn"

flat-waiter-82487

08/05/2025, 12:39 PM

Can you provide a solution/workaround for that @freezing-airport-6809?

freezing-airport-6809

08/05/2025, 2:08 PM

But this is just a warning

freezing-airport-6809

08/05/2025, 2:08 PM

Should not impact anything

freezing-airport-6809

08/05/2025, 2:09 PM

It’s a fortune handle that was shared

freezing-airport-6809

08/05/2025, 2:09 PM

Not run

flat-waiter-82487

08/06/2025, 7:03 AM

No, the warning is "additional" - the main reason I wrote this because it actually breaks

freezing-airport-6809

08/06/2025, 1:09 PM

Ok I need to reproduce

🤞 1

freezing-airport-6809

08/07/2025, 3:53 PM

i wont get to it for a bit, but will do. cc @echoing-account-76888 if you happen to look into this?

🙏 1

🫡 1

echoing-account-76888

08/08/2025, 12:16 AM

Sure! I'll look into this

🙏 1

echoing-account-76888

08/12/2025, 12:16 PM

Just a quick update, I can reproduce this on my side. The error occurs on both local and remote. I’m working on a fix. Thank you for your patience! 🙏

🤞 2

echoing-account-76888

08/15/2025, 8:43 AM

Fixed in https://github.com/flyteorg/flytekit/pull/3313 I think it's an edge case using

loop_manager.synced

+ spawn. I leave more context in the PR description, feel free to have a look and leave comments!

🔥 1

freezing-airport-6809

08/15/2025, 1:03 PM

Cc @thankful-minister-83577 fyi for v2? Shouldn’t need it

flat-waiter-82487

08/18/2025, 6:43 AM

Awesome @echoing-account-76888, thanks for fixing it! 🎖️ @freezing-airport-6809 will v2 still support PyTorch Elastic? (via decorator, same as in v1)?

freezing-airport-6809

08/18/2025, 2:30 PM

Absolutely

freezing-airport-6809

08/18/2025, 2:30 PM

We are working on it

freezing-airport-6809

08/18/2025, 2:31 PM

You can contribute too

freezing-airport-6809

08/18/2025, 2:31 PM

We already have Ray spark and dask

freezing-airport-6809

08/18/2025, 2:31 PM

Do you have any improvement suggestions

flat-waiter-82487

08/19/2025, 11:34 AM

I has quietly hoping for a decorator for PyTorch that wouldn't require installing 3rd party software in K8s besides Flyte itself 🤞

freezing-airport-6809

08/19/2025, 1:52 PM

Wdym

freezing-airport-6809

08/19/2025, 1:52 PM

For distributed training?

flat-waiter-82487

08/28/2025, 8:09 AM

Yes - right now, if you want to do distributed training on > 1 node, you have to either install KF training operator or Ray on the cluster, which adds complexity. PyTorch is THE FRAMEWORK for neural networks right now, so having 1st party support for it in Flyte itself would be 🎖️

freezing-airport-6809

08/28/2025, 1:41 PM

@flat-waiter-82487 we have started working on honest support

freezing-airport-6809

08/28/2025, 1:41 PM

I don’t think it’s a lot of work, but we are working through how the experience should be

freezing-airport-6809

08/28/2025, 1:42 PM

You are right, we concluded in Flyte 2 we will ship oob with jobset based multinode training

💜 1

freezing-airport-6809

08/28/2025, 1:42 PM

It will available in Flyte 1 too

💜 1

flat-waiter-82487

08/29/2025, 8:36 AM

Aw yeah! 🦜

freezing-airport-6809

08/29/2025, 4:54 PM

will take some time

freezing-airport-6809

08/29/2025, 4:55 PM

@flat-waiter-82487 what would you like to improve in it?

flat-waiter-82487

09/01/2025, 7:07 AM

Functionally I would like the following things: 1. Ability to decorate a task to make it distributed (so 2 params in it at least: num_nodes, num_proc_per_node - same as now?) 2. Support for TorchElastic and fault tolerance during training with auto-recovery 3. No requirement of installing 3rd party extensions to the cluster besides Flyte and its CRDs (CRDs coming from Flyte itself would be fine) 4. Ability to specify num_nodes/num_proc_per_node dynamically (e.g. via task param or something) - right now as far as I know, it has to be "baked in" into the task definition = somehow hardcoded - this is a painpoint during development, because in development we usually use small machines with 2-4 GPUs and then we go full throttle on larger machines with 8 GPUs and we have to re-configure the code manually to do it instead of just swapping

--num_proc_per_node=8

or sth.

5 Views

Open in Slack

Previous Next