Hi everyone how can i add GPU to this locals { g...
# ask-the-community
e
Hi everyone how can i add GPU to this locals { gke_subnetwork = module.network.subnets_names[0] gke_pods_range_name = module.network.subnets_secondary_ranges[0][0].range_name gke_services_range_name = module.network.subnets_secondary_ranges[0][1].range_name } module "gke" { source = "terraform-google-modules/kubernetes-engine/google" project_id = local.project_id region = local.region name = local.name_prefix regional = true release_channel = "STABLE" network = module.network.network_name subnetwork = local.gke_subnetwork ip_range_pods = local.gke_pods_range_name ip_range_services = local.gke_services_range_name create_service_account = true identity_namespace = "enabled" remove_default_node_pool = true node_pools = [ { name = "default" machine_type = "n1-standard-8" min_count = 0 max_count = 3 # Set to true if you want to enable Image Streaming. Learn more: https://cloud.google.com/kubernetes-engine/docs/how-to/image-streaming to speed up pulling of images enable_gcfs = false } ] depends_on = [google_project_service.project ] } output gke_cluster_name { value = module.gke.name }
e
Hi
Thanks let me try it
c
👍 Specifically
accelerator_count
and
accelerator_type
will allow you to set GPUs. The types you can find here: https://cloud.google.com/compute/docs/gpus#nvidia_gpus_for_compute_workloads
e
Thanks this helped me but i have one last query when i push my workflow to remote i get this 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
c
What's the GKE/k8s version of the cluster?
e
1.27.8-gke.1067004
c
You're probably missing the drivers
Copy code
kubectl apply -f <https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml>
Seems like there should be a setting that you could add to terraform for automatic driver installation
e
yeah , i have and it's working fine thanks👍
c
I think if you add
gpu_driver_version = "LATEST"
to the terraform definition of your gpu node pooll, then the driver will be installed automatically next time
e
@Hi @Cornelis Boon I get this error
[1/1] currentAttempt done. Last Error: USER::
[f2975714e0c1349cf849-n0-0] terminated with exit code (1). Reason [Error]. Message:
│ ❱  760 │   │   │   │   return __callback(*args, **kwargs)                    │
│                                                                              │
│ /usr/local/lib/python3.8/site-packages/flytekit/bin/entrypoint.py:508 in     │
│ fast_execute_task_cmd                                                        │
│                                                                              │
│ ❱ 508 │   subprocess.run(cmd, check=True)                                    │
│                                                                              │
│ /usr/local/lib/python3.8/subprocess.py:516 in run                            │
│                                                                              │
│ ❱  516 │   │   │   raise CalledProcessError(retcode, process.args,           │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['pyflyte-execute', '--inputs',
'<gs://flyte-gcp-data-174222667125/metadata/propeller/flytesnacks-development-f29>
75714e0c1349cf849/n0/data/inputs.pb', '--output-prefix',
'<gs://flyte-gcp-data-174222667125/metadata/propeller/flytesnacks-development-f29>
75714e0c1349cf849/n0/data/0', '--raw-output-data-prefix',
'<gs://flyte-gcp-data-174222667125/0r/f2975714e0c1349cf849-n0-0>',
'--checkpoint-path',
'<gs://flyte-gcp-data-174222667125/0r/f2975714e0c1349cf849-n0-0/_flytecheckpoints>
', '--prev-checkpoint', '""', '--dynamic-addl-distro',
'<gs://flyte-gcp-data-174222667125/flytesnacks/development/VTSSZ7BWFXUL52RFOVN2A2>
2UCA======/script_mode.tar.gz', '--dynamic-dest-dir', '.', '--resolver',
'flytekit.core.python_auto_container.default_task_resolver', '--',
'task-module', 'experiments.workflows.workflow', 'task-name',
'preprocess_data']' returned non-zero exit status 1.
when i run this command
poetry run  pyflyte run --remote  -d development experiments/workflows/workflow.py train_workflow
for a poetry project of this code structure
├── README.md
├── config.yaml
├── experiments
│   ├── __init__.py
│   ├── configs
│   │   ├── __init__.py
│   │   ├── config.py
│   │   └── training.yaml
│   ├── utils
│   │   ├── __init__.py
│   │   ├── loading.py
│   │   └── training.py
│   └── workflows
│       ├── __init__.py
│       └── workflow.py
├── gcp-artifact-reader.key
├── gcp-artifact-writer.key
├── pyproject.toml
├── tests
│   ├── __init__.py
│   └── test_workflow.py
└── tox.ini
c
No idea, sorry
e
Thanks