Welcome to the GPU Cluster introduction page of the TU-Wien Datalab.
This page will provide you with a short overview about the cluster, login methods and a short introduction to Slurm. Further information about Slurm is available on the official documentation
You can join our Slack channel: https://join.slack.com/t/gpuclu-tuw/shared_invite/zt-2g2qzpz3b-O1pmYVXO3kkhP_HhPMKKHQ
Table of Contents
Creating a User
To access the Slurm cluster you first need to have a datalab user account in https://login.datalab.tuwien.ac.at
where you can configure your ssh public key. You can create this account yourself if you have TU Wien credentials, otherwise you need to contact us.
Once you created your account you need to write us on the #support Slack channel or create a Jira request to provide you with the needed access and permissions to login.
Login to the Cluster
SSH (Secure Shell):
Afterwards you may remotely login to the head node of the cluster where you put your apply your jobs
$ ssh cluster.datalab.tuwien.ac.at -l username
Jupyter Notebook:
If you want to access the cluster via a Jupyter notebook you can instead call https://jupyter.datalab.tuwien.ac.at
from your browser you need to have a password set in the previous step when creating your user.
Slurm Cluster
Cluster Nodes:
Hostname | CPU Type | CPUs | Cores/CPU | Threads | GPU Type | Bus | Count | GPU Mem | IB |
---|---|---|---|---|---|---|---|---|---|
a-a100-o-1 | AMD | 2 | 32 | 2 | a100 | SXM4 | 8 | 80GB | YES |
a-a100-o-2 | AMD | 2 | 64 | 2 | a100 | SXM4 | 8 | 80GB | YES |
a-a100-os-3 | AMD | 2 | 64 | 2 | a100s | SXM4 | 8 | 40GB | YES |
a-a100-os-4 | AMD | 2 | 64 | 2 | a100s | SXM4 | 8 | 40GB | YES |
a-a100-q-5 | AMD | 2 | 64 | 2 | a100 | SXM4 | 4 | 80GB | YES |
a-a100-q-6 | AMD | 2 | 32 | 2 | a100 | SXM4 | 4 | 80GB | YES |
a-a100-qs-7 | AMD | 2 | 64 | 2 | a100s | SXM4 | 4 | 40GB | YES |
a-a100-qs-8 | AMD | 2 | 64 | 2 | a100s | SXM4 | 4 | 40GB | YES |
i-v100-o-1 | Intel | 2 | 12 | 2 | v100 | PCIe 3 | 8 | 32GB | YES |
i-v100-h-2 | Intel | 2 | 28 | 2 | v100 | PCIe 3 | 16 | 32GB | YES |
i-v100-q-3 | Intel | 2 | 8 | 2 | v100 | SXM | 8 | 32GB | YES |
i-v100-q-4 | Intel | 2 | 8 | 2 | v100 | SXM | 8 | 32GB | YES |
a-a40-o-1 | AMD | 2 | 24 | 2 | a40 | PCIe 4 | 8 | 48GB | NO |
DGXs | |||||||||
dgx-h100-1 | Intel | 2 | 56 | 2 | h100 | SXM5 | 8 | 80GB | YES |
dgx-h100-2 | Intel | 2 | 56 | 2 | h100 | SXM5 | 8 | 80GB | YES |
dgx-h100-3 | Intel | 2 | 56 | 2 | h100 | SXM5 | 8 | 80GB | YES |
VMs |
| ||||||||
ivm-a40-q-2 | Intel | 2 | 10 | 2 | a40 | PCIe 4 | 4 | 48GB | NO |
ivm-a40-q-3 | Intel | 2 | 10 | 2 | a40 | PCIe 4 | 4 | 48GB | NO |
avm-v100-d-5 | AMD | 2 | 8 | 1 | v100 | PCIe 4 | 2 | 32GB | NO |
avm-a100-qs-9 | AMD | 2 | 34 | 1 | a100s | PCIe 4 | 4 | 40GB | NO |
avm-a100-d-10 | AMD | 2 | 18 | 1 | a100 | PCIe 4 | 2 | 80GB | NO |
Storage:
The main Storage used is Ceph as a network file system that includes /home
and /share
on all servers
/home
for user's home directory/share
to have groups directory for collaborations (please reach us for special groups and permissions)/scratch
is a local nvme storage RAID 0 it exist only on GPU nodes - they are mounted on head nodes as NFS as well.
Partitions (queues):
- GPU-v100: i-v100-o-1,i-v100-h-2,i-v100-q-[3-4]
- GPU-a40: a-a40-o-1
- GPU-a100: a-a100-o-[1-2],a-a100-q-[5-6]
- GPU-a100s: a-a100-h-9,a-a100-os-[3-4],a-a100-qs-[7-8]
Default parameters:
There are default parameters set it is different to each partition for example:
- Max Nodes per job are 2.
- Max job time is 7 days.
- Default time for any job is 24 hours, to overwrite this value you should set in your script the time limit option
--time
which can be maximum of 168 hours. - Default MEM per CPU, default CPU per GPU, default memory per GPU, these can vary between partitions but you can overwrite them in your script.
To facilitate creating your Slurm bash script parameters, you may use this code generator platform: https://code-gen.datalab.tuwien.ac.at/
and also check the command sbatch
man pages for more information.
Using GPUs in your job:
to use GPU you should request it in your script as in the following example:
#SBATCH --gres=gpu:a100s:2 # recommended way - to choose the right GPU in case of a server with multiple GPU types that can be mentioned in multiple partitions or #SBATCH --gres=gpu:2
when specifying the type of the GPU (which is recommended) it should correspond with the partition provided as in the following:
#SBATCH --partition=GPU-a100s #SBATCH --gres=gpu:a100s:2 or #SBATCH --partition=GPU-a6000 #SBATCH --gres=gpu:a6000:2
Submitting a batch job:
Processing computational tasks requires submitting jobs to the Slurm scheduler. Slurm offers two commands to submit jobs: sbatch
and srun
.
Always use sbatch
to submit jobs to the scheduler, unless you need an interactive terminal. Otherwise only use srun
within sbatch
for submitting job steps within an sbatch
script context. The command sbatch
accepts script files as input.
Scripts should be written in bash, and should include the appropriate Slurm directives at the top of the script telling the scheduler the requested resources. Read on to learn more about how to use Slurm effectively.
Interactive session
For interactive session as in the following example:
$ srun --pty bash
You might have more options to the srun
command please check the command man pages.
Job submission
You simply create you bash script and run it with the command sbatch
how in the following examples:
A small example job submission script called hello-world.sh
#!/bin/bash #SBATCH --partition=GPU-a100 # select a partition i.e. "GPU-a100" #SBATCH --nodes=2 # select number of nodes #SBATCH --ntasks-per-node=8 # select number of tasks per node #SBATCH --time=00:15:00 # request 15 min of wall clock time #SBATCH --mem=2GB # memory size required per node echo "hello, world"
the part between the shebang and your script by putting #SBATCH [option]
you can define the resources needed for your script and this is tunable please check the sbatch
command man pages for other options.
to submit the job:
$ sbatch hello-world.sh
Basic Slurm commands
sinfo
gives information which partitions (queue) are available for job submission.scontrol
is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates.scontrol show job 32
shows information on the job with number 32.scontrol show partition
shows information on available partitions.squeue
to see the current list of submitted jobs, their state and resource allocation. Here is a description of the most important job reason codes returned by the squeue command.scancel 32
to cancel your job number 32.
Links
- The official Slurm documentation: https://slurm.schedmd.com/documentation.html
- The Slurm documentation of the Vienna Scientific Cluster: https://wiki.vsc.ac.at/doku.php?id=doku:slurm
FAQ:
Q:
I am scheduling an interactive session via srun --pty bash
but I don't see the GPUs
A:
Scheduling interactive sessions in like scheduling any job in slurm in order to have your interactive session with GPU you need to specify additional parameters for the partition and the GPU resources for example: srun -p GPU-a100 --gres=gpu:a100:2 -w a-a100-o-1 --pty bash
here you are scheduling on the partition GPU-a100
and you are asking to have 2 GPUs
of type a100
and optionally you want that to be on the node a-a100-o-1