[need help] I'm trying to setup a gpu slurm cluste...
# slurm-flyte-wg
d
[need help] I'm trying to setup a gpu slurm cluster. this is the last 2 lines in my
/etc/slurm/slurm.conf
Copy code
NodeName=localhost Gres=gpu:1 CPUs=4 RealMemory=15006 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
this is the
/etc/slurm/gres.conf
Copy code
AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla  File=/dev/nvidia0 COREs=0
after changed the config, I restarted my slurm cluster and type
slurmd -C
but it doesn't show that I have gpu. CC @rich-application-44533 @red-school-96573 @fierce-oil-47448
r
Hi @damp-lion-88352, can you add the following line to your `slurm.conf`:
Copy code
GresTypes=gpu
See https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes
d
yes I did it
r
On our GPU cluster, the GRES info is not listed using
slurmd -C
. What do you see when using
slurmd -G
?
d
Copy code
(base) ubuntu@ip-10-0-0-4:~$ sudo slurmd -G
slurmd: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults
slurmd: A line in gres.conf for GRES gpu:tesla has 1 more configured than expected in slurm.conf. Ignoring extra GRES.
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected
slurmd: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_t4`. Setting system GRES type to NULL
slurmd: error: This GPU specified in [slurm|gres].conf has mismatching Cores or Links from the device found on the system. Ignoring it.
slurmd: error: [slurm|gres].conf:
slurmd: error:     GRES[gpu] Type:(null) Count:1 Cores(4):0  Links:(null) Flags:HAS_FILE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:/dev/nvidia0 UniqueId:(null)
slurmd: error: system:
slurmd: error:     GRES[gpu] Type:(null) Count:1 Cores(4):0-1  Links:-1 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia0 UniqueId:(null)
slurmd: The following autodetected GPUs are being ignored:
slurmd:     GRES[gpu] Type:(null) Count:1 Cores(4):0-1  Links:-1 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia0 UniqueId:(null)
r
I believe the issue is
COREs=0
in
gres.conf
. Could you please remove it?
d
!!!!!!!!!!!!!!!!
IT WORKS
THANK YOU SO MUCH!!!
r
You're welcome. I'm glad to help.