https://flyte.org logo
#ask-the-community
Title
# ask-the-community
d

Dan Farrell

10/12/2023, 12:37 PM
When running
flytectl demo
is there a way to increase the allowed total ~G~CPU/Memory per task? My mem="10Gi" requests are received by the flyte server and silently truncated to "1Gi" -_-
ok I'm pretty sure i need to change
projectQuotaMemory
somehow.... following: https://docs.flyte.org/en/latest/deployment/configuration/general.html#cluster-resources I run
flytectl get cluster-resource-attribute -p flytesnacks -d development
but I get:
Copy code
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x70 pc=0x147f37c]

goroutine 1 [running]:
<http://github.com/flyteorg/flytectl/cmd/get.FetchAndUnDecorateMatchableAttr({0x257b478|github.com/flyteorg/flytectl/cmd/get.FetchAndUnDecorateMatchableAttr({0x257b478>?, 0xc0000560a8?}, {0x7ffc1eb6a9db?, 0xc00099f998?}, {0x7ffc1eb6a9ea?, 0xc000ae5090?}, {0x0?, 0xc000afc701?}, {0x0, 0x0}, ...)
        /home/runner/work/flytectl/flytectl/cmd/get/matchable_attribute_util.go:32 +0xbc
<http://github.com/flyteorg/flytectl/cmd/get.getClusterResourceAttributes({0x257b478|github.com/flyteorg/flytectl/cmd/get.getClusterResourceAttributes({0x257b478>, 0xc0000560a8}, {0xc000315d80, 0x0, 0x257b120?}, {0x0, {0x0, 0x0}, {0x0, 0x0}, ...})
        /home/runner/work/flytectl/flytectl/cmd/get/matchable_cluster_resource_attribute.go:78 +0x2a6
<http://github.com/flyteorg/flytectl/cmd/core.generateCommandFunc.func1(0xc0009fbb80|github.com/flyteorg/flytectl/cmd/core.generateCommandFunc.func1(0xc0009fbb80>?, {0xc000315d80, 0x0, 0x4})
        /home/runner/work/flytectl/flytectl/cmd/core/cmd.go:70 +0x93d
<http://github.com/spf13/cobra.(*Command).execute(0xc0009fbb80|github.com/spf13/cobra.(*Command).execute(0xc0009fbb80>, {0xc000315d40, 0x4, 0x4})
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
<http://github.com/spf13/cobra.(*Command).ExecuteC(0xc0009fb400)|github.com/spf13/cobra.(*Command).ExecuteC(0xc0009fb400)>
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
<http://github.com/spf13/cobra.(*Command).Execute(...)|github.com/spf13/cobra.(*Command).Execute(...)>
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
<http://github.com/flyteorg/flytectl/cmd.ExecuteCmd()|github.com/flyteorg/flytectl/cmd.ExecuteCmd()>
        /home/runner/work/flytectl/flytectl/cmd/root.go:137 +0x1e
main.main()
        /home/runner/work/flytectl/flytectl/main.go:12 +0x1d
trying to follow the instructions for updating the project quotas:
Copy code
flytectl update cluster-resource-attribute --attrFile cra.yaml
cra.yaml:
Copy code
attributes:
    projectQuotaCpu: "1000"
    projectQuotaMemory: 5Ti
domain: development
project: flytesnacks
yields:
Copy code
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1aeed82]

goroutine 1 [running]:
<http://github.com/flyteorg/flytectl/cmd/update.DecorateAndUpdateMatchableAttr({0x257b478|github.com/flyteorg/flytectl/cmd/update.DecorateAndUpdateMatchableAttr({0x257b478>, 0xc0000560a8}, {0xc000a9cb00, 0xb}, {0xc000a9caf0, 0xb}, {0x0, 0x0}, {0x0, 0x0}, ...)
        /home/runner/work/flytectl/flytectl/cmd/update/matchable_attribute_util.go:37 +0x2e2
<http://github.com/flyteorg/flytectl/cmd/update.updateClusterResourceAttributesFunc({0x257b478|github.com/flyteorg/flytectl/cmd/update.updateClusterResourceAttributesFunc({0x257b478>, 0xc0000560a8}, {0xc000a1fb40?, 0x100000049c5e5?, 0x257b120?}, {0x0, {0x0, 0x0}, {0x0, 0x0}, ...})
        /home/runner/work/flytectl/flytectl/cmd/update/matchable_cluster_resource_attribute.go:75 +0x206
<http://github.com/flyteorg/flytectl/cmd/core.generateCommandFunc.func1(0xc00083b680|github.com/flyteorg/flytectl/cmd/core.generateCommandFunc.func1(0xc00083b680>?, {0xc000660760, 0x0, 0x2})
        /home/runner/work/flytectl/flytectl/cmd/core/cmd.go:70 +0x93d
<http://github.com/spf13/cobra.(*Command).execute(0xc00083b680|github.com/spf13/cobra.(*Command).execute(0xc00083b680>, {0xc000660740, 0x2, 0x2})
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
<http://github.com/spf13/cobra.(*Command).ExecuteC(0xc0003b3b80)|github.com/spf13/cobra.(*Command).ExecuteC(0xc0003b3b80)>
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
<http://github.com/spf13/cobra.(*Command).Execute(...)|github.com/spf13/cobra.(*Command).Execute(...)>
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
<http://github.com/flyteorg/flytectl/cmd.ExecuteCmd()|github.com/flyteorg/flytectl/cmd.ExecuteCmd()>
        /home/runner/work/flytectl/flytectl/cmd/root.go:137 +0x1e
main.main()
        /home/runner/work/flytectl/flytectl/main.go:12 +0x1d
ah.. getting closer:
Copy code
$ flytectl update cluster-resource-attribute --attrFile cra.yaml --config ~/.flyte/config-sandbox.yaml
Updated attributes from flytesnacks project and domain development
seems to work!! My jobs aren't OOMing!! Maybe stragely If i go grab the k8s definition from the k8s dashboard I see the pod has the instructions:
Copy code
resources:
        limits:
          cpu: '2'
          memory: 1Gi
        requests:
          cpu: '2'
          memory: 1Gi
while my Task has defined:
requests=Resources(mem="6500Mi"))
So I feel like I'm not correctly doing this....
edit - my task was still eventually oomkilled...
ah, finally I think I got it... In addition to the
cra.yaml
above I needed to update the
task-resource-attribute
as well: tra.yaml:
Copy code
defaults:
    cpu: "1"
    memory: 1Gi
limits:
    cpu: "1000"
    memory: 5Ti
project: flytesnacks
domain: development
cmd:
Copy code
flytectl update task-resource-attribute --attrFile tra.yaml --config ~/.flyte/config-sandbox.yaml
now my pod resource correctly reflects my settings:
Copy code
resources:
        limits:
          cpu: '1'
          memory: 6500Mi
        requests:
          cpu: '1'
          memory: 6500Mi
https://github.com/flyteorg/flytesnacks/pull/1183 added some of the above to the documentation.
k

Kevin Su

10/12/2023, 8:09 PM
flytectl demo can’t use GPU for now. cc @L godlike
d

Dan Farrell

10/12/2023, 8:29 PM
that was a typo @Kevin Su sorry, supposed to be CPU. Although, looking forward to GPU usage as well!
k

Kevin Su

10/12/2023, 11:44 PM
Yes, we are working on enabling GPUs in the sandbox
l

L godlike

10/13/2023, 12:21 AM
I am working on it! I will mention you when I finished!
r

Ryuu

10/26/2023, 8:01 AM
ths for your help ❤️
l

L godlike

10/30/2023, 7:08 AM
@Ryuu Hi, currently my ubuntu is using WSL kernel, which might cause unexpected behavior when using GPU in the sandbox. If you are using Ubuntu, can you help me test this step by step? Here is a guidance about how to execute it. https://github.com/flyteorg/flyte/pull/3256#issuecomment-1784590139
@Dan Farrell If you can help, I will be really appreciated, too.
d

Dan Farrell

10/30/2023, 3:29 PM
@L godlike the command:
Copy code
flytectl demo start --image futureoutlier/flyte-sandbox:gpu-v2 --disable-agent --force
does not seem to work on my ubuntu gpu workstation, it just exits with no logs.
l

L godlike

10/30/2023, 3:30 PM
Thanks a lot, maybe the dockerfile is not correct, I will try to create my own and mention you to try it
d

Dan Farrell

10/30/2023, 3:30 PM
Copy code
docker run --rm -it futureoutlier/flyte-sandbox:gpu-v2
also gives no stdout and fails immediately
k3d entrypoint logs:
Copy code
[2023-10-30T15:31:24+00:00] Running k3d entrypoints...
[2023-10-30T15:31:24+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-10-30T15:31:24+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
2023/10/30 15:31:24 failed to apply transformations: lookup host.docker.internal on 205.171.3.26:53: no such host
r

Ryuu

10/30/2023, 3:32 PM
Ok, i will try it now
d

Dan Farrell

10/30/2023, 3:35 PM
those logs are from running via docker run, so maybe it is not expected to work when running via docker run Also @L godlike I must need more coffee, The command:
Copy code
flytectl demo start --image futureoutlier/flyte-sandbox:gpu-v2 --disable-agent --force
Does not exit with no logs, it gives me the following output,
Copy code
{"status":"Status: Downloaded newer image for futureoutlier/flyte-sandbox:gpu-v2"}
🧑‍🏭 booting Flyte-sandbox container
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
Waiting for cluster to come up...
I was meaning to say that the
flyte-sandbox
container that it spawns fails and has no logs (
Copy code
73a7896f3379   futureoutlier/flyte-sandbox:gpu-v2                                                                "/bin/k3d-entrypoint…"   17 minutes ago   Exited (1) 16 minutes ago             flyte-sandbox
l

L godlike

10/30/2023, 3:38 PM
Ok, I will try to make a new dockerfile with GPU tomorrow, and if there’s any problem I face and I can’t solve it, I will come here for help
Thanks you two, really appreciated 🙏🏻
d

Dan Farrell

10/30/2023, 4:16 PM
Maybe this is my problem:
Copy code
$ docker run --gpus=all --rm -it --entrypoint /bin/bash futureoutlier/flyte-sandbox:gpu-v2
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown.
My cuda is 11.8
l

L godlike

10/31/2023, 3:06 AM
I will update how to specify the cuda version. @Ryuu Can I know your test result?
@Dan Farrell Are you willing to start following this guide from scratch and build your own image? I can't test it due to my WSL kernel, if the guide doesn't work, I will try to produce another Docker GPU image. https://github.com/flyteorg/flyte/pull/3256#issuecomment-1784590139
d

Dan Farrell

10/31/2023, 3:14 AM
ok it's building now
l

L godlike

10/31/2023, 3:15 AM
Thanks a lot, if there's any problem, please let me know
d

Dan Farrell

10/31/2023, 4:34 AM
@L godlike with the locally built docker image I can run it with gpu access, but I still get the k3s no such host error
is
flyte-sandbox-bootstrap
doing something wild here? or is this a problem with my env?
@L godlike finally getting somewhere....
Copy code
root@153de33f2937:/# cat /var/log/k3d-entrypoints_231031053514.log
[2023-10-31T05:35:14+00:00] Running k3d entrypoints...
[2023-10-31T05:35:14+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-10-31T05:35:14+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
[2023-10-31T05:35:14+00:00] Running /bin/k3d-entrypoint-gpu-check.sh
/bin/k3d-entrypoint.sh: 14: /bin/k3d-entrypoint-gpu-check.sh: Permission denied
root@153de33f2937:/# chmod +x /bin/k3d-entrypoint-gpu-check.sh
this needs to be executable
l

L godlike

10/31/2023, 5:37 AM
Hi
Looking!
So there are 2 main questions to figure out now. 1. how
flyte-sandbox-bootstrap
works 2. how to make
k3s
get the host
I will study it today and reply you
d

Dan Farrell

10/31/2023, 5:42 AM
well
All you need to do is:
Copy code
chmod +x ./docker/sandbox-bundled/bin/k3d-entrypoint-gpu-check.sh
l

L godlike

10/31/2023, 5:44 AM
So the image is correct and it works?
Can you allocate GPU to the node? Can you use the GPU in the task?
If possible, can you also provide the screenshot in the github description's comment?
you can use
kubectl describe node | grep -i gpu
r

Ryuu

10/31/2023, 5:45 AM
@L godlike sr, i have some own work right now. I will response as soon as possible .
l

L godlike

10/31/2023, 5:46 AM
It's OK, good luck with your work! Appreciated to your help!
@Dan Farrell If possible, please add the screenshots under this PR description, thanks really really much https://github.com/flyteorg/flyte/pull/3256
19 Views