quick question, do either of the Flyte Sandbox ( <...
# flyte-support
w
quick question, do either of the Flyte Sandbox ( https://docs.flyte.org/en/latest/deployment/deployment/sandbox.html ) or the local demo cluster ( https://docs.flyte.org/en/latest/user_guide/environment_setup.html ) offer parallel task execution? E.g. suppose I have 2 GPUs and 16 cores and I have jobs that only require 1 GPU / 8 cores each. Will either of these set-ups allow the jobs to run concurrently? Or does that require a real kubernetes install, i.e. k8s will orchestrate / queue the jobs?
1
a
@wooden-fish-36386 I think both terms refer to the same thing. In this case, it's a Kubernetes-in-Docker type of substrate for Flyte and while it will let you access any of the underlying compute resources, it's not really designed to accomodate big workloads or multiple integrations.
Flyte supports parallel executions (either natively or using integrations) at different levels: 1. You can request GPU resources directly from the task decorator (see) and different tasks can have different resource requests (or any at all). This doesn't make your executions go parallel though, so 2. You can use ArrayNode to allow for concurrent executions that use a single node (single Pod) still supporting most of the main Flyte features From your question I understand this is a single machine? There are also examples that use pytorch for this use case Let me know if any of this answers your question
w
@average-finland-92144 wow, thanks for the pointer to ArrayNode!! This is definitely a bit more of a heavy / complicated plug-in than I expected. Yeah this is also an interesting set of examples https://github.com/flyteorg/flytesnacks/tree/master/examples/advanced_composition So if I were to take the single-node multi-gpu example ( https://docs.flyte.org/en/latest/flytesnacks/examples/mnist_classifier/pytorch_single_node_multi_gpu.html#single-node-multi-gpu-training ) and put a
Resources(gpu=1)
request / limit in the task. And then run that in the sandbox, and with
map_task()
on a list of two inputs (i.e. two jobs). So Flyte will run the two tasks concurrently, even in the sandbox? And I guess since the tasks are effectively running in the same container (?) they are implicitly allocated different GPUs but the actual job code can probably see both GPU IDs? (Unlike in say k8s where the tasks would run in separate containers and not see each others GPUs). Do I have this conceptually right? I'll try to dig a little deeper into
map_task
impl to understand limitations. If I'm right, that's really cool that Flyte can do some amount of parallelism w/out k8s !
a
So Flyte will run the two tasks concurrently, even in the sandbox?
I'm mostly sure yes, compute resources provided, the sandbox can execute all Flyte features, including map_tasks
And I guess since the tasks are effectively running in the same container (?)
This is true when using map_task
they are implicitly allocated different GPUs
I'm not sure I follow. GPUs are surfaced to K8s using the K8s ExtendedResources plugin which advertises any GPU accelerator and makes it available for Pods to use. You can use taints/tolerations to surface different devices to different Pods but that's typically for multi node K8s clusters. Flyte lets you request specific NVIDIA devices or even partitions, but even that plays with taints/tolerations and nodeSelectors to tweak K8s scheduling.
but the actual job code can probably see both GPU IDs
with default K8s scheduling mechanisms, this should be true The sandbox is still K8s, it's just that the runtime it's not a VM or server but a Docker container I hope any of this makes sense for you
w
@average-finland-92144 ahh, I had missed the part that the Sandbox actually contains k3s. And I poked around a bit to confirm, https://github.com/flyteorg/flyte/tree/master/docker So the sandbox is actually pretty full fidelity! And your comments here make a ton of sense now. yeah the start-up scripts and stuff in here are pretty nice too! A lot of batteries included, which I know can be too much for some people, but very nice reference impl here. Thank you this totally clears me up, and the extra comments and links really help too! 🎉
a
cool, any other question please let us know!