Arthur Lindoulsi
03/30/2023, 8:39 AMif training_args.force_a100_gpus:
return train(input_args).with_overrides(
pod_template=PodTemplate(
pod_spec=V1PodSpec(affinity=V1Affinity(
node_affinity=V1NodeAffinity(
required_during_scheduling_ignored_during_execution=V1NodeSelector(
node_selector_terms=[
V1NodeSelectorTerm(match_expressions=[
V1NodeSelectorRequirement(
key="<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>", operator="In",
values=["nvidia-tesla-a100"])])]
))),
restart_policy='never',
containers=[V1Container(name='primary',
image='{{.image.imagename.fqn}}:{{.image.imagename.version}}',
resources=V1ResourceRequirements(limits={"<http://nvidia.com/gpu|nvidia.com/gpu>": '1'},
requests={"memory": "...",
"cpu": "..."})
)])
),
container_image=None,
requests=None,
)
Container name for the train task was "<executionID>-<workflowID>-0-dn2-0"Eli Bixby
03/30/2023, 10:13 AMcontainer_image
argument doesn't match the image
for the primary container. I think this might be a problem?Ketan (kumare3)
Felix Ruess
03/30/2023, 1:25 PMArthur Lindoulsi
03/30/2023, 1:52 PMFelix Ruess
03/30/2023, 2:21 PMcontainers=[]
Dan Rammer (hamersaw)
03/30/2023, 2:50 PMpod_template
argument in the @task
decorator is applied statically, meaning it is built into the task definition. So I don't suspect it will work with with_overrides
.Arthur Lindoulsi
03/30/2023, 2:51 PMDan Rammer (hamersaw)
03/30/2023, 2:53 PMdef gpu_training(...):
# omitted
@task(pod_template=PodTemplate(
# omitted
))
def train_on_a100:
gpu_training()
@task()
def train_on_normal_gpu:
gpu_training()
@workflow
def wf(force_a100_gpus: bool):
conditional("gpu")
.if_(force_a100_gpus)
.then(train_on_a100)
.else_()
.then(train_on_normal_gpu)
Arthur Lindoulsi
03/30/2023, 3:09 PM