Run Calculations

To run calculations on the cluster, users need to have

  • the code they plan to run

  • any input it requires

  • a way to sub the code using a batch scheduler

Managing Jobs

HPC utilizes SLURM to manage jobs that users submit to various queues on a computer system. Each queue represents a group of resources with attributes necessary for the queue's jobs. You can see the list of queues that HPC has by typing sinfo. stdmemq is the default partition/queue.

Common Commands

The table below gives a short description of the most used SLURM commands.

Note: Do not run jobs on the login nodes. All jobs launched from those nodes will be terminated without notice.

Listing jobs

To list all jobs:

$user@host[~]:   squeue
 JOBID  PARTITION  NAME      USER   STATE    TIME        TIME_LIMI   CPUS  NODES  NODELIST(REASON)
4340   gpuq       testjob1  user1  RUNNING  2-03:06:55  4-00:00:00  2     1      gpu1
4349   stdmemq    testjob2  user2  RUNNING  1:36:09     2-00:00:00  2     1      compute1
4347   bigmemq    testjob3  user2  RUNNING  18:34:07    2-00:00:00  40    1      bigmem1

To list your jobs:

$user@host[~]:   squeue -u $USER

To obtain the status of a job, run the following command using the job's ID number (this is provided at time of job submission).

$user@host[~]:   squeue -j JOB_ID

You can also use checkjob job_ID to show the current status of the job.

Submitting a job

To submit a job, use the sbatch command, followed by the name of your submission file. A Job ID will be provided. You may want to make note of the ID for later use.

$user@host[~]:   sbatch your_script.slurm
Submitted batch job 4359

Deleting a job

Note: Be aware that deleting a job cannot be undone. Double check the job ID before deleting a job.

Users can delete their jobs by typing the following command.

$user@host[~]:   scancel JOB_ID

To delete all the jobs of a user:

$user@host[~]:   scancel -u $USER

Overview of resources

The sinfo command gives an overview of what resources are in each partition/queue and what their status is. It should inform your decisions on how you structure your jobs and what partition you should submit them to.

$user@host[~]:   sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
stdmemq*        up 2-00:00:00      1    mix bigmem1
stdmemq*        up 2-00:00:00     10    mix compute[1-8],gpu[1-2]
stdmemq-long    up 4-00:00:00      1    mix bigmem1
bigmemq         up 2-00:00:00      1    mix bigmem1
gpuq            up 4-00:00:00      1    mix gpuv1001
gpuq            up 4-00:00:00      1   idle gpuv1002
debugq          up    2:00:00      1    mix gpuv1001
debugq          up    2:00:00      3   idle gpu[1-2],gpuv1002
scavengeq       up 1-00:00:00      2    mix bigmem0,gpuv1001

You can format that output in a more concise form:

$user@host[~]:   sinfo -o "%20P %5a %.10l %16F"
PARTITION            AVAIL  TIMELIMIT NODES(A/I/O/T)
stdmemq*             up    2-00:00:00 1/10/0/11
stdmemq-long         up    4-00:00:00 1/2/0/3
bigmemq              up    2-00:00:00 1/0/0/1
gpuq                 up    4-00:00:00 1/1/0/2
debugq               up       2:00:00 1/3/0/4
scavengeq            up    1-00:00:00 2/11/0/13

Status of past and current jobs

The sacct command gives some accounting details on past and current jobs.

$user@host[~]:   sacct
4359         jredo-0.x+    bigmemq     (null)          0  COMPLETED      0:0
4360         jredo-0.x+    bigmemq     (null)          0  CANCELLED      0:0
.
.
.

You can format that output in a more detailed form:

$user@host[~]:   sacct --format=jobid,user,jobname,partition,end,Elapsed,State
4359          user jredo-0.x+    bigmemq 2019-08-07T15:03:15   00:00:10  COMPLETED
4360          user jredo-0.x+    bigmemq 2019-08-07T15:03:42   00:00:05  CANCELLED

SLURM environmental variables

When a SLURM job is scheduled to run, some relevant information about the job such as the names of the nodes it is running on, the number of cores, the working directory ... etc ... are saved as environmental variables. Users can invoke these environmental variables in their job submission scripts.

Below is a list of the most common SLURM environmental variables including with a brief description from UMD's HPC page.

Last updated