I have been having issues running the llama tutorials refere Flyte #flyte-support

I have been having issues running the llama tutori...

hallowed-dog-74273

02/01/2024, 6:03 PM

I have been having issues running the llama tutorials referenced in the documentation https://github.com/unionai-oss/llm-fine-tuning/issues/9

👀 1

freezing-airport-6809

02/01/2024, 6:16 PM

Cc @broad-monitor-993

broad-monitor-993

02/01/2024, 7:46 PM

Hi @hallowed-dog-74273 I’m unable to repro this on my side:

broad-monitor-993

02/01/2024, 7:46 PM

Can you share more info on you environment?

pip list

would be useful

broad-monitor-993

02/01/2024, 7:53 PM

also can you share the image being used in your exection?

hallowed-dog-74273

02/01/2024, 7:59 PM

hallowed-dog-74273

02/01/2024, 8:00 PM

this is pip list

pip-list.txt

hallowed-dog-74273

02/01/2024, 8:17 PM

hallowed-dog-74273

02/01/2024, 8:19 PM

The images current deployed to my cluster are 1.10.6 cr.flyte.org/flyteorg/flytepropeller-release:v1.10.6

broad-monitor-993

02/01/2024, 9:22 PM

Can you download that image locally and do

docker run …

and take a look at

/root

to see if the

flyte_llama

directory is inside it?

broad-monitor-993

02/01/2024, 9:22 PM

I suspect the issue is with the built image

hallowed-dog-74273

02/01/2024, 9:38 PM

dont' see any flyte_llama directories under root

hallowed-dog-74273

02/01/2024, 9:39 PM

i only see a micromamba directory

hallowed-dog-74273

02/01/2024, 9:53 PM

I had a globally installed version of pyflyte, which was the latest version. maybe that could be it? The version in the flyte_llama is not the latest

hallowed-dog-74273

02/01/2024, 9:54 PM

that repo has everything pinned to 1.10.0 which is really old

broad-monitor-993

02/01/2024, 10:06 PM

The version pinning shouldn’t be the problem (feel free to try updating those deps, but I doubt it’ll fix the issue). • How are you building/pushing the images? What registry are you using? • Are you using the

ImageSpec

as specified in the workflow.py file are you modifying it?

hallowed-dog-74273

02/01/2024, 10:46 PM

I have not modified the code in the repo, aside from updating the registry value

hallowed-dog-74273

02/01/2024, 10:46 PM

im just running the pyflyte --remote and letting it build and push the images

hallowed-dog-74273

02/01/2024, 10:46 PM

im using gcp's registry

hallowed-dog-74273

02/01/2024, 10:47 PM

i am using the ImageSpec and following all the steps in the README

freezing-airport-6809

02/01/2024, 10:53 PM

this is really odd

hallowed-dog-74273

02/01/2024, 11:50 PM

ok so i am fairly certain it was because my pyflyte version was higher than 1.10.0

hallowed-dog-74273

02/01/2024, 11:50 PM

i had globally installed the latest pyflyte version

hallowed-dog-74273

02/01/2024, 11:50 PM

after uninstalling completely, and using the binary in the virtual env, the task succeeded

hallowed-dog-74273

02/02/2024, 12:02 AM

I do have a new issue though... The second task of the workflow is stuck in a Queue state...

hallowed-dog-74273

02/02/2024, 12:11 AM

so i think i found a bug with flyte propeller though. from the flyte propeller logs

Copy code

│ {"json":{"exec_id":"fb3bf107c6ddd4889a74","ns":"flytesnacks-development","res_ver":"8113362","routine":"worker-30","wf":"flytesnacks:development:flyte_llama.workflows.train_workflow"},"level":"error","msg":"Error when trying to reco │
│ ncile workflow. Error [failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Invalid] failed to create resource, caused by: Pod \"fb3bf107c6ddd4889a74 │
│ -n1-0\" is invalid: [spec.volumes[1].name: Invalid value: \"mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy\": must be no more than 63 characters, spec.containers[0].volumeMounts[1].name: Not  │
│ found: \"mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy\"]]. Error Type[*errors.NodeErrorWithCause]","ts":"2024-02-02T00:07:55Z"}                                                               │
│ E0202 00:07:55.870691       1 workers.go:103] error syncing 'flytesnacks-development/fb3bf107c6ddd4889a74': failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [c │
│ ontainer]: [Invalid] failed to create resource, caused by: Pod "fb3bf107c6ddd4889a74-n1-0" is invalid: [spec.volumes[1].name: Invalid value: "mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy":  │
│ must be no more than 63 characters, spec.containers[0].volumeMounts[1].name: Not found: "mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy"]

hallowed-dog-74273

02/02/2024, 12:58 AM

https://github.com/flyteorg/flyte/issues/4824

freezing-airport-6809

02/02/2024, 4:46 AM

Ohh man, this seems like a weird regression and did it show this in the Ui?

freezing-airport-6809

02/02/2024, 4:46 AM

Cc @high-park-82026 / @hallowed-mouse-14616 / @flat-area-42876 . This should be bubbled up

👍 1

flat-area-42876

02/02/2024, 5:09 AM