Welcome to the GPU Cluster introduction page of the TU-Wien Datalab.
This page will provide you with a short overview about the cluster, login methods and a short introduction to Slurm. Further information about Slurm is available on the official documentation

You can join our slack channel: https://join.slack.com/t/gpuclu-tuw/shared_invite/zt-2g2qzpz3b-O1pmYVXO3kkhP_HhPMKKHQ, if you have any issue you can reach us via email to datalab@tuwien.ac.at

Creating a User

To access the Slurm cluster you first need to have a datalab user account in https://login.datalab.tuwien.ac.at where you can configure your ssh public key. You can create this account yourself if you have TU Wien credentials, otherwise you need to contact us.

Once you created your own account you need to contact datalab@tuwien.ac.at to provide you with the permission to login.

Login to the Cluster

SSH (Secure Shell):

Afterwards you may remotely login to the head node of the cluster where you put your apply your jobs

$ ssh cluster.datalab.tuwien.ac.at

Jupyter Notebook:

If you want to access the cluster via a Jupyter notebook you can instead call https://jupyter.datalab.tuwien.ac.at from your browser you need to have a password set in the previous step when creating your user.

Slurm Cluster

Cluster Nodes:

Hostname	CPU Type	CPUs	Cores	Threads	GPU Type	Count	GPU Mem
a-a100-o-1	AMD	2	32	2	A100	8	80GB
a-a100-o-2	AMD	2	64	2	A100	8	80GB
a-a100-os-3	AMD	2	64	2	A100	8	40GB
a-a100-os-4	AMD	2	64	2	A100	8	40GB
a-a100-q-5	AMD	2	64	2	A100	4	80GB
a-a100-q-6	AMD	2	32	2	A100	4	80GB
a-a100-qs-7	AMD	2	64	2	A100	4	40GB
a-a100-qs-8	AMD	2	64	2	A100	4	40GB
i-v100-o-1	Intel	2	12	2	V100	8	32GB
i-v100-h-2	Intel	2	28	2	V100	16	32GB
i-v100-q-3	Intel	2	8	2	V100	8	32GB
i-v100-q-4	Intel	2	8	2	V100	8	32GB

Storage:

The main Storage used is Ceph as a network file system that includes /home and /share on all servers

/home for user's home directory
/share to have groups directory for collaborations (please reach us for special groups and permissions)
/scratch is a local nvme storage RAID 0 it exist only on GPU nodes - they are mounted on head nodes as NFS as well.

Partitions (queues):

GPU-a100: a-a100-o-[1-2],a-a100-q-[5-6]
GPU-a100s: a-a100-os-[3-4],a-a100-qs-[7-8]
GPU-v100: i-v100-o-1,i-v100-h-2,i-v100-q-[3-4]

Basic Slurm commands

sinfo gives information which partitions (queue) are available for job submission.
scontrol is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates.
scontrol show job 32 shows information on the job with number 32.
scontrol show partition shows information on available partitions.
squeue to see the current list of submitted jobs, their state and resource allocation. Here is a description of the most important job reason codes returned by the squeue command.
scancel 32 to cancel your job number 32.

Submitting a batch job:

Processing computational tasks requires submitting jobs to the Slurm scheduler. Slurm offers two commands to submit jobs: sbatch and srun. Always use sbatch to submit jobs to the scheduler, unless you need an interactive terminal. Otherwise only use srun within sbatch for submitting job steps within an sbatch script context.

The command sbatch accepts script files as input. Scripts should be written in bash, and should include the appropriate Slurm directives at the top of the script telling the scheduler the requested resources. Read on to learn more about how to use Slurm effectively.

For interactive session as in the following example:

$ srun --pty bash

You simply create you bash script and run it with the command sbatch how in the following examples:

A small example job submission script called hello-world.sh

#!/bin/bash

#SBATCH --partition=GPU-a100 		# select a partition i.e. "GPU-a100"
#SBATCH --nodes=2 					# select number of nodes
#SBATCH --ntasks-per-node=32 		# select number of tasks per node
#SBATCH --time=00:15:00 			# request 15 min of wall clock time
#SBATCH --mem=2GB 					# memory size required per node

echo "hello, world"

the part between the shebang and your script by putting #SBATCH [option] you can define the resources needed for your script and this is tunable please check the sbatch command man pages for other options.

to submit the job:

$ sbatch hello-world.sh

Links

The official Slurm documentation: https://slurm.schedmd.com/documentation.html
The Slurm documentation of the Vienna Scientific Cluster: https://wiki.vsc.ac.at/doku.php?id=doku:slurm

Space shortcuts

Page tree