Jobs on cluster

Introduction

NOTE

Jobs on Arun-HPCC are managed by the Slurm Workload Manager. To learn about the work load manager please visit SLURM's official website.

Jobs on Arun-HPCC doesn't run like your normal PCs. Because the resources of HPCC is aimed to be shared among users. If a user want to use all resources of the cluster for a computational job for 1 week then no job could run at that period of time. But goal of HPCC is to provide high performance computing power to everyone. To operate that way HPCCs use workload manager or job scheduler to manage the jobs. A job scheduler helps sharing resources among users by queuing jobs. The scheduler works following the fairshare principle. Fairshare is a mechanism that allows resource utilization information to be incorporated into job feasibility and priority decisions. The jobs on hpcc is prioritized by considering some key factors-

  • Job duration: Short jobs don't tie up resources for long periods of time, so they have higher priority than long jobs.

  • Queue age: The longer a job sits in a queue waiting to run, the higher its priority.

  • Job size: Large jobs (i.e. those which need more nodes) are much harder to fit in to the schedule than small jobs, so they get higher priority. Smaller jobs are then backfilled around the large ones.

  • Resource usage: Scheduler keeps track of per-user resource utilization. Users who have consumed few resources recently will have priority over those who have consumed a lot.

  • User priority: System operator gets higher priority on the premise that timely execution of system housekeeping functions is crucial to smooth operation of the system.

  • Queue priority: Scheduler automatically sorts jobs into queues, and queues are prioritized based on factors such as runtime and project priority.

The Slurm Workload Manager

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job for example MPI job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

SLURM Commands

The jobs on Arun-HPCC can be managed with the various commands of slurm. To understand how each of slurm command work, try out the command with --help argument. The basic commands to manage your jobs is given below -


  • srun command to submit a job for execution or initiate job steps in real time

  • sbatch command to submit a job

  • squeue command to check job status

  • scancel command to cancel submitted job

To Learn about sbatch command with --help argument

[user@master]$ sbatch --help
1

To learn about partition of cluster, what nodes they include, and general system state use the sinfo command. The name with debug cluster* indicates that this is default partition for submitted jobs. sinfo-command-on-arun-hpcc

To determine current jobs on the system use the squeue command.

[user@master]$ squeue
1

The scontrol command can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.

To learn about Arun-HPCC partitions scontrol-partition-command-from-arun-hpcc

Find information about a specific node scontorl-node-command-from-arun-hpcc

To learn more about the SLURM commands visit Harvard's Convenient SLURM Commands. Jobs on HPCCs are submitted by Job Scripts. The job scripts are written with shell script. To understand about job submission, threaded jobs, MPI jobs, array jobs, GPU jobs check Sample SLURM Scripts.