Slurm

Terminology

Tasks

A task is a singular process that can either be run alone or with other identical tasks in parallel. Depending on the program or script that the task runs, performance can improve when tasks are run in parallel compared to running one after the other.

Partitions

Partitions are an organizational structure within Slurm that allows nodes to be grouped together with certain options and restrictions placed on them. These are the partitions on Buddy:

Tip

this information can be found using the sinfo command in the terminal

Name

Time limit

Description

general

5 days

Used for most jobs

long

30 days

Used for jobs that are expected to run for a long time

high-mem

5 days

Used for jobs which are expected to have high memory usage

high-mem-long

30 days

Used for long jobs which are expected to have high memory usage

gpu

5 days

Used for gpu jobs

gpu-long

30 days

Used for gpu jobs which are expected to run for a long time

testing

2 days

Reserved for our internal testing

gpu-test

2 days

Reserved for our internal testing

We recommend that you use the partition that is most appropriate to your application.

Cores

Each compute node has 16 cores. When setting up scripts with Sbatch, (which is discussed below) the product of –tasks-per-node and –cpus-per-task should not exceed 16.

Commands

Name

Description

sbatch

used allocate resources and run the given script using slurm

srun

used within an sbatch file or in an interactive session to run a command as a parallel task

smap

displays the jobs currently running on the cluster

sinfo

displays node status (idle, down, allocated, …) and partition information

Sbatch Parameters

Sbatch parameters are used to control the way jobs are submitted and run on buddy

Common sbatch parameters

Name

Environment variables

Default

Description

-J,–job-name

SLURM_JOB_NAME

script name or “sbatch”

the name of your job

-o,–output

N/A

“slurm-%j.out”

file to dump standard output of program

-e,–error

N/A

“slurm-%j.out”

file to dump standard error of program

-n,–ntasks

N/A

1 unless –cpus-per task is set

the maximum number of tasks sbatch should allocate resources for

-N,–nodes

SLURM_JOB_NUM_NODES

enough nodes to satisfy the -n and -c options

the number of nodes to allocate. A minimum and maximum can also be set like: –nodes=10-12

-c,–cpus-per-task

SLURM_CPUS_PER_TASK

one processor per task

the number of cpus to allocate for each task

–ntasks-per-node

SLURM_TASKS_PER_NODE

?

the number of tasks to allocate for on each node

-p,–partition

SBATCH_PARTITION

general

the partition to run the job in

-t,–time

SBATCH_TIMELIMIT

max time for partition

the maximum amount of time the job is allowed to run

Interactive Jobs with salloc

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. Interactive jobs can also be useful in building Slurm batch scripts to run non-interactively. The salloc command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

The following example job is assigned 2 nodes with 4 CPUs and 4GB of memory each:

[rmaher@ssh1 ~]$ salloc --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1
salloc: Granted job allocation 5382
[rmaher@node-106 ~]$ srun hostname
node-106.hpc.uco.edu
node-106.hpc.uco.edu
node-106.hpc.uco.edu
node-106.hpc.uco.edu
node-107.hpc.uco.edu
node-107.hpc.uco.edu
node-107.hpc.uco.edu
node-107.hpc.uco.edu

In the above example, srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job.