|
|
[[_TOC_]]
|
|
|
# Running the job though Slurm
|
|
|
|
|
|
Our clusters are accessed through a login node[^1], that login node can by used low cpu work like software development (testing, compiling porting, small test), job management, data transfer...
|
|
|
|
|
|
However all heavy duty computing must go through the SLURM manager, and non complying processes can be killed without prior notice.
|
|
|
|
|
|
Usage examples are provided in the [cluster specific](Home) pages of this wiki, but all facilities share a common architecture.
|
|
|
|
|
|
## Useful links
|
|
|
* [The Slurm cheat sheet](https://slurm.schedmd.com/pdfs/summary.pdf): the most usual commands and options.
|
|
|
* [The Rosetta page](https://slurm.schedmd.com/rosetta.pdf): if you're used to a popular job scheduler, you will find translations of the most common commands.
|
|
|
|
|
|
# Submitting different types of jobs
|
|
|
All the examples are [provided in this repository](/home#examples).
|
|
|
|
|
|
* [SLURM basics/sequential jobs](Running-the-jobs/SLURM-basics)
|
|
|
How to get a simple job done on the cluster, how to monitor your jobs etc.
|
|
|
* [Multi threaded jobs/shared memory parallelism](Running-the-jobs/Multi-threaded-jobs)
|
|
|
How to run a multi threaded jobs on a node.
|
|
|
* [MPI jobs/multi node parallelism](Running-the-jobs/MPI-jobs)
|
|
|
How to run and place MPI jobs.
|
|
|
* [Interactive sessions](Running-the-jobs/Interactive-sessions)
|
|
|
How to connect to a node for an interactive session.
|
|
|
|
|
|
## Partition usage
|
|
|
There are different partitions adapted to different kind of jobs. The time limit for each partition can be obtained with the command [sinfo](https://slurm.schedmd.com/sinfo.html).
|
|
|
### Licallo
|
|
|
|
|
|
* *seq:* for sequential/multi-thread jobs.
|
|
|
* *short-seq:* for short sequential/multi-thread jobs.
|
|
|
* *x40:* for parallel jobs, each node has 40 core.
|
|
|
* *short:* for short parallel jobs
|
|
|
* *fdr:* should be avoided.
|
|
|
* *besteffort:* for idempotent jobs. This partition has few constraints, but jobs of this partition will be interrupted whenever theirs resources are needed (advanced).
|
|
|
* *bash:* for interactive jobs.
|
|
|
|
|
|
|
|
|
# Controlling/querying jobs
|
|
|
* [When is my job going to start](Running-the-jobs/When is my job going to start)
|
|
|
|
|
|
# Tips and tricks
|
|
|
* [Pack arrays of sequential jobs](Running-the-jobs/Pack-arrays-of-sequential-jobs) to efficiently handle job array
|
|
|
* [Connect on a running job's node](Running-the-jobs/Connect-on-a-running-job's-node) for debugging purpose
|
|
|
* [Use local copies for IO intensive work](Running-the-jobs/Use-local-copies-for-IO-intensive-work)
|
|
|
How to use local file access when the share disk space cannot handle the load.
|
|
|
* [Job dependencies](Running-the-jobs/Job-Dependencies) to handle restarts and multi stage jobs
|
|
|
* [Moving data around](Running-the-jobs/Moving-Data-Around)
|
|
|
* [Using a specific account](Running-the-jobs/Using-a-specific-account)
|
|
|
* [Special tools](Running-the-jobs/Special-Tools)
|
|
|
Running Mathematica etc..
|
|
|
|
|
|
[^1]: there might be more than one actual machine to ensure load balancing and fault tolerance. But you only need to remember the logical name. Like *licallo.oca.eu* for the HPC cluster.
|
|
|
|
|
|
|
|
|
Size job max:
|
|
|
"En général, la façon la plus rapide (mais par essai erreur) de voir ce genre de choses, c'est de soumettre le job puis de faire un squeue -u alainm.
|
|
|
|
|
|
If the job is stuck (statut PD) by a limit, this will indicate which.
|
|
|
|
|
|
Otherwise, the max number of cores is generally given by the MaxTRES property the QOS (Quality Of Service) qosoca-par, at the moment it is 600:
|
|
|
|
|
|
15:48:58 [alainm@pollux rel]# sacctmgr show QOS qosoca-par format=name,MaxTRES
|
|
|
Name MaxTRES
|
|
|
---------- -------------
|
|
|
qosoca-par cpu=600
|
|
|
15:49:16 [alainm@pollux rel]#
|
|
|
|
|
|
Mais cela peu bouge en fonction de la charge par exemple.
|
|
|
TRES signifie Trackable RESources, qui chez nous sont les cœurs (parce que c'est important de rester intuitif).
|
|
|
[[_TOC_]]
|
|
|
# Running the job though Slurm
|
|
|
|
|
|
Our clusters are accessed through a login node[^1], that login node can by used low cpu work like software development (testing, compiling porting, small test), job management, data transfer...
|
|
|
|
|
|
However all heavy duty computing must go through the SLURM manager, and non complying processes can be killed without prior notice.
|
|
|
|
|
|
Usage examples are provided in the [cluster specific](Home) pages of this wiki, but all facilities share a common architecture.
|
|
|
|
|
|
## Useful links
|
|
|
* [The Slurm cheat sheet](https://slurm.schedmd.com/pdfs/summary.pdf): the most usual commands and options.
|
|
|
* [The Rosetta page](https://slurm.schedmd.com/rosetta.pdf): if you're used to a popular job scheduler, you will find translations of the most common commands.
|
|
|
|
|
|
# Submitting different types of jobs
|
|
|
All the examples are [provided in this repository](/home#examples).
|
|
|
|
|
|
* [SLURM basics/sequential jobs](Running-the-jobs/SLURM-basics)
|
|
|
How to get a simple job done on the cluster, how to monitor your jobs etc.
|
|
|
* [Multi threaded jobs/shared memory parallelism](Running-the-jobs/Multi-threaded-jobs)
|
|
|
How to run a multi threaded jobs on a node.
|
|
|
* [MPI jobs/multi node parallelism](Running-the-jobs/MPI-jobs)
|
|
|
How to run and place MPI jobs.
|
|
|
* [Interactive sessions](Running-the-jobs/Interactive-sessions)
|
|
|
How to connect to a node for an interactive session.
|
|
|
|
|
|
## Partition usage
|
|
|
There are different partitions adapted to different kind of jobs. The time limit for each partition can be obtained with the command [sinfo](https://slurm.schedmd.com/sinfo.html).
|
|
|
### Licallo
|
|
|
|
|
|
* *seq:* for sequential/multi-thread jobs.
|
|
|
* *x40:* for parallel jobs, each node has 40 core.
|
|
|
* *short:* for short parallel jobs
|
|
|
* *fdr:* should be avoided.
|
|
|
* *besteffort:* for idempotent jobs. This partition has few constraints, but jobs of this partition will be interrupted whenever theirs resources are needed (advanced).
|
|
|
* *bash:* for interactive jobs.
|
|
|
|
|
|
|
|
|
# Controlling/querying jobs
|
|
|
* [When is my job going to start](Running-the-jobs/When is my job going to start)
|
|
|
|
|
|
# Tips and tricks
|
|
|
* [Pack arrays of sequential jobs](Running-the-jobs/Pack-arrays-of-sequential-jobs) to efficiently handle job array
|
|
|
* [Connect on a running job's node](Running-the-jobs/Connect-on-a-running-job's-node) for debugging purpose
|
|
|
* [Use local copies for IO intensive work](Running-the-jobs/Use-local-copies-for-IO-intensive-work)
|
|
|
How to use local file access when the share disk space cannot handle the load.
|
|
|
* [Job dependencies](Running-the-jobs/Job-Dependencies) to handle restarts and multi stage jobs
|
|
|
* [Moving data around](Running-the-jobs/Moving-Data-Around)
|
|
|
* [Using a specific account](Running-the-jobs/Using-a-specific-account)
|
|
|
* [Special tools](Running-the-jobs/Special-Tools)
|
|
|
Running Mathematica etc..
|
|
|
|
|
|
[^1]: there might be more than one actual machine to ensure load balancing and fault tolerance. But you only need to remember the logical name. Like *licallo.oca.eu* for the HPC cluster.
|
|
|
|
|
|
|
|
|
Size job max:
|
|
|
"En général, la façon la plus rapide (mais par essai erreur) de voir ce genre de choses, c'est de soumettre le job puis de faire un squeue -u alainm.
|
|
|
|
|
|
If the job is stuck (statut PD) by a limit, this will indicate which.
|
|
|
|
|
|
Otherwise, the max number of cores is generally given by the MaxTRES property the QOS (Quality Of Service) qosoca-par, at the moment it is 600:
|
|
|
|
|
|
15:48:58 [alainm@pollux rel]# sacctmgr show QOS qosoca-par format=name,MaxTRES
|
|
|
Name MaxTRES
|
|
|
---------- -------------
|
|
|
qosoca-par cpu=600
|
|
|
15:49:16 [alainm@pollux rel]#
|
|
|
|
|
|
Mais cela peu bouge en fonction de la charge par exemple.
|
|
|
TRES signifie Trackable RESources, qui chez nous sont les cœurs (parce que c'est important de rester intuitif).
|
|
|
" |
|
|
\ No newline at end of file |