Hello everyone, We're happy to share that the Slu...
# slurm-flyte-wg
c
Hello everyone, We're happy to share that the Slurm agent v1 (with
PythonFunctionTask
) has been implemented. It supports the following three core methods: 1. `create`: Use
srun
to run a Slurm job which executes Flyte entrypoints,
pyflyte-fast-execute
and
pyflyte-execute
2. `get`: Use
scontrol show job <job_id>
to monitor the Slurm job state 3. `delete`: Use
scancel <job_id>
to cancel the Slurm job (this method is still under test) We setup an environment to test it locally without running the agent gRPC server. The setup is divided into three components: a client (localhost), a remote tiny Slurm cluster, and an Amazon S3 bucket that facilitates communication between the two. The attached figure below illustrates the interaction between the client and the remote Slurm cluster.
This guide introduces how to setup a local test environment, including all three components mentioned above.
overview_v2.png
f
@creamy-shampoo-53278 great job But @damp-lion-88352 we also wanted a simple agent that does not do pyflyte execute and is more like a script runner?
Assuming all code is on slurm
@creamy-shampoo-53278 how do you configure the remote?
c
Hi Ketan, For faster development and testing on a remote Slurm cluster, I setup the controller and compute node on the same machine, following Slurm-101 instructions. As for the configuration, I use this official Slurm configurator to generate the config file!
If I understand it correctly, for the script runner, we could further support
sbatch
combined with the user-defined script (e.g., setting env vars, loading modules, nested srun, etc.).
f
When I say remote, can the user add the remote slurm host in the config
c
I see. Currently, I hard code ssh config in agent.py . I’ll add another ssh_config field in task_config to let users set ssh-related info!
f
That’s what we want to avoid
We should end up sending that in
c
Copy code
@task(
    task_config=Slurm(
        ssh_conf={
            "host": "<ssh_host>",
            "port": "<ssh_port>",
            "username": "<ssh_username>",
            "password": "<ssh_password>",
        },
        srun_conf={
            "partition": "debug",
            "job-name": "demo-slurm",
            # Remote working directory
            "chdir": "<your_remote_working_dir>"
        }
    )
)
def plus_one(x: int) -> int:
    return x + 1
It's now possible to define a task by passing in
ssh_conf
!
Would it be better to use Secrets in this case?
d
yes
we should use secret
coming!
finished my stuff
let me take a look at the code first
and will talk to you later