Jobs on cluster
Jobs on Arun-HPCC are managed by the Slurm Workload Manager. To learn about the work load manager please visit SLURM's official website.
Jobs on Arun-HPCC doesn't run like your normal PCs. Because the resources of HPCC is aimed to be shared among users. If a user want to use all resources of the cluster for a computational job for 1 week then no job could run at that period of time. But goal of HPCC is to provide high performance computing power to everyone. To operate that way HPCCs use workload manager or job scheduler to manage the jobs. A job scheduler helps sharing resources among users by queuing jobs. The scheduler works following the fairshare principle. Fairshare is a mechanism that allows resource utilization information to be incorporated into job feasibility and priority decisions. The jobs on hpcc is prioritized by considering some key factors-
Job duration: Short jobs don't tie up resources for long periods of time, so they have higher priority than long jobs.
Queue age: The longer a job sits in a queue waiting to run, the higher its priority.
Job size: Large jobs (i.e. those which need more nodes) are much harder to fit in to the schedule than small jobs, so they get higher priority. Smaller jobs are then backfilled around the large ones.
Resource usage: Scheduler keeps track of per-user resource utilization. Users who have consumed few resources recently will have priority over those who have consumed a lot.
User priority: System operator gets higher priority on the premise that timely execution of system housekeeping functions is crucial to smooth operation of the system.
Queue priority: Scheduler automatically sorts jobs into queues, and queues are prioritized based on factors such as runtime and project priority.
The Slurm Workload Manager
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job for example MPI job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
The jobs on Arun-HPCC can be managed with the various commands of slurm. To understand how each of slurm command work, try out the command with
--help argument. The basic commands to manage your jobs is given below -
sruncommand to submit a job for execution or initiate job steps in real time
sbatchcommand to submit a job
squeuecommand to check job status
scancelcommand to cancel submitted job
To Learn about
sbatch command with
[user@master]$ sbatch --help
To learn about partition of cluster, what nodes they include, and general system state use the
sinfo command. The name with debug
cluster* indicates that this is default partition for submitted jobs.
To determine current jobs on the system use the
scontrol command can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.
To learn about Arun-HPCC partitions
Find information about a specific node
To learn more about the SLURM commands visit Harvard's Convenient SLURM Commands. Jobs on HPCCs are submitted by Job Scripts. The job scripts are written with shell script. To understand about job submission, threaded jobs, MPI jobs, array jobs, GPU jobs check Sample SLURM Scripts.