# KISLURM KI-Cluster quickstart
## Logging in to the cluster
The cluster is not directly accessible from outside the university network, so to access it from private machines you will need to first login to a machine in the university network. you can either use
- "lmblogin" `lmblogin.informatik.uni-freiburg.de`
- the technical faculty's login server `login.informatik.uni-freiburg.de`
from there on you can proceed to access the cluster using one of the following login nodes (doesn't matter which one you choose):
- kislogin1.rz.ki.privat
- kislogin2.rz.ki.privat
- kislogin3.rz.ki.privat
- account:
- use your TF account (**NOT** the LMB account)
- if your TF account works (see TF account access above) and you still don't have access, create a ticket here: https://osticket.informatik.uni-freiburg.de/open.php
## Get cluster and job info
- get info about nodes and partitions
```bash
sfree
```
- get info about partitions you are allowed to access
```bash
sinfo
```
- get info about **all** jobs (queued or running)
```bash
squeue
```
- get info about jobs of specific **user**
```bash
squeue -u username
```
- get info about **your** jobs
```bash
squeue --me
```
## Create and use storage (workspaces)
- Create a workspace
```bash
ws_allocate workspacename 100
```
- allocates a workspace named workspacename
- request a duration of 100 days (maximum on dlclarge)
- it can be extended! (up to 5 times)
- Info about your workspaces
```bash
ws_list
```
- Configure e-mail remainder before workspace duration ends
- notifiy 14 days before expiration
- notify email@cs.uni-freiburg.de
- example: while creating the workspace
```bash
ws_allocate -r 14 -m email@cs.uni-freiburg.de workspacename 100
```
- example: add to an existing workspace: (use the extend-option (-x) with duration 0)
```bash
ws_allocate -x -r 14 -m email@cs.uni-freiburg.de workspacename 0
```
- Extend workspace (reset remaining duration 100 days)
```bash
ws_allocate -x workspacename 100
```
- Utilize the extend-option (-x) with an extend-duration of 0 to update an existing workspace
### How to access your workspaces
- mounted on
```bash
/work/dlclarge1/username-workspacename/
```
- or
```bash
/work/dlclarge2/username-workspacename/
```
### Examples to get data to the SLURM cluster
- You are on the SLURM cluster and want to copy data from LMB storage
```bash
rsync -avz -e "ssh -p 2122" dienertj@lmblogin.informatik.uni-freiburg.de:/misc/lmbraid21/dienertj/data/cfg-gan-cifar10 /work/dlclarge1/dienertj-cfggan/data/
```
- You are on your personal computer and want to copy data to the SLURM cluster
- SLURM login nodes are not directly accessible from the internet
- we use the jump-host login.informatik.uni-freiburg.de
```bash
rsync -avz -e "ssh -J dienertj@login.informatik.uni-freiburg.de:22 -oPort=22" /tmp/testfolder dienertj@kislogin1.rz.ki.privat:/home/dienertj/
```
### Access your TF/LMB home
- your LMB home directory will be mounted on /ihome/username
- **you can cd there, even if /ihome/ is empty!**
```bash
cd /ihome/username
```
- **warning** the connection to ihome is slow! don't try to directly train on data on ihome.
## Running jobs
- for all jobs you should
- specify partition (queue in torque)
- specify time limit
- specify memory requirement
- specify cpu core requirements
### Job submission examples
**Note**: Run `sinfo` to find out which queues you are allowed to access.
#### Interactive sessions (gets you a bash shell on a node)
- interactive session with on a node from partition lmbhiwi_gpu-rtx2080
```bash
srun -p lmbhiwi_gpu-rtx2080 --pty bash
```
- interactive session with 2GPUs, 40GB system memory, 23:59:59 time limit
```bash
srun -p lmbhiwi_gpu-rtx2080 --gpus-per-node=2 --mem=40G --time=23:59:59 --pty bash
```
#### Batch jobs
- submit a jobscript, requesting 8GPUs
```bash
sbatch -p lmbhiwi_gpu-rtx2080 --nodes=1 --gpus-per-node=8 --mem=40G /home/your-username/myproject/train_experiment42.sh
```
- the jobscript has to be written in bash and needs to contain all the commands you need to run your training.
- the following is a minimal example to activate your virtual environment and start your python script:
```bash
#!/bin/bash
source /home/your-username/venvs/myenv/bin/activate
python3 /home/your-username/myproject/train.py
```
#### Array jobs
- if you have multiple experiments to run, you can use array jobs instead of creating a job scrip for each experiment
- you can define how many tasks should run in parallel
- example:
```bash
# TODO
```
### interact with running jobs
- with SLURM you can execute commands within running jobs
- example: we want to check the GPU utilization of a job we submitted
- use squeue to get the jobid
```bash
> squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3999690 lmb_gpu-r myocardi dienertj R 2:59:32 1 dagobert
```
- execute nvidia-smi within the job
```bash
srun --jobid=3999690 nvidia-smi
```
- we can even get an interactive bash session:
```bash
srun --jobid=3999690 --pty bash
```
### Connect to a running job
Useful e.g. for connecting VSCode, running a debugger, building an SSH tunnel to connect jupyter notebooks.
- Start an interactive job
- Then log in at the assigned node via ssh
For example:
```bash
# start a job:
srun -p lmb_gpu-1080ti --gpus-per-node=1 --mem=10G --pty bash
# then (for example from your local lmb workstation):
ssh -p 22 USERNAME@mario
# (assuming that mario is the node where your job was started)
```
So, after starting a job, students can connect their VSCode via SSH to the node and edit code there and debug stuff etc. Also port forwarding is possible, e.g. to start tensorboard/jupyter/etc. on the node and access it from the local workstation.
The advantage of this setup: no GPU conflicts, as the GPU is assigned to a specific user via the job.
The disadvantages:
- extra effort instead of just connecting via ssh (but only little)
- only one user per GPU even if user 90% of the time just edits code and doesn't use the GPU
- "developing session" ends after walltime (e.g. 24 hours)
## See also
You can find additional info about the KISLURM cluster in our [ticket system](https://osticket.informatik.uni-freiburg.de/open.php) under knowledgebase.
# LMB cluster quickstart
*Removed link to [queue status page](https://lmb.informatik.uni-freiburg.de/interna/web-qstat/) as it requires the "lmb" password*
If you are a MSc student, you should work on the KISLURM cluster only, except for a good reason. If you work on the LMB cluster, you should not use the 3090 machines.
## Usage guidelines
- Do not waste resources.
Measure or estimate the resources your program needs in advance on a local machine or on a development server (**dacky**).
- Do not use more than two GPUs at the same time! Use the **student** queue to automatically enforce this.
- See [serverload page](https://lmb.informatik.uni-freiburg.de/interna/serverload/)
## Example job submission
We will submit the following example job script `myjob.sh`:
```bash
#!/bin/bash
echo `hostname`
sleep 30 # wait 30 seconds
```
1. login to ```lmbtorque``` with ```ssh```
2. Use the ```qsub``` command to submit the job script
```bash
qsub -l nodes=1:ppn=1,mem=100mb -q student myjob.sh
```
The command above reserves 1 cpu core and 100mb.
On success a job identifier is written to stdout.
3. Check the state of your job with the ```qstat``` command.
```qstat``` returns information about the state of all submitted jobs.
The second last column shows the current job state, which is one of
- **C** Job is completed after having run/
- **H** Job is held.
- **Q** job is queued, eligible to run or routed.
- **R** job is running.
4. Check the program output after the job is completed.
torque will create two files named ```myjob.sh.oXXXXXX``` and ```myjob.sh.eXXXXXX``` with the stdout and stderr output of your program.
For more information see manuals and help texts of ```qsub``` and ```qstat``` and the manual of torque http://docs.adaptivecomputing.com/8-1-0/basic/help.htm
## Example job script
All parameters to ```qsub``` can be set within the script file like shown below
```bash
#PBS -N Jobname
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1,mem=1gb,walltime=24:00:00
#PBS -m ae
#PBS -j oe
#PBS -q student
echo `hostname`
sleep 30 # wait 30 seconds
```
The script above can then be submitted with `qsub SCRIPT`.
## Getting around the 24h walltime limit
The maximum walltime for all jobs is 24 hours.
To get around this, a job must save its intermediate results to the hard disk such that a follow up job can continue.
There are two techniques for submitting multiple jobs that run one after the other: job arrays and job chaining.
#### Job arrays
```bash
qsub -t V-W[%S]
```
V: Startnumber, W: End number, S: Number of simultaneous jobs
Arrays are used to start W-V+1 jobs that use the same script. Torque will set the environment variable `PBS_ARRAYID` to provide the array index to the instance of the script. This index can then be translated to parameter sets, input datasets, etc.
Example (HelloWorldArray.sh):
```bash
#!/bin/bash
#PBS -N HelloWorldArray
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1,mem=10mb,nice=10,walltime=00:01:00
#PBS -j oe
#PBS -q student
echo "Hello World! from array index $PBS_ARRAYID"
exit 0
```
When submitting the script with
```bash
qsub -t 1-10%3 HelloWorldArray.sh
```
ten instances jobid[1]@lmbtorque - jobid[10]@lmbtorque will be enqueued of which at most three are running simultaneously at any time. Ten output files HelloWorldArray.ojobid-1 - HelloWorldArray.ojobid-10 will be generated containing
```bash
Hello World! from array index 1
.
.
.
Hello World! from array index 10
```
To run only one job at the same time use S=1.
#### Job chaining
The job identifier returned by `qsub` can be used to create a chain of jobs, where each job depends on its predecessor.
```bash
job1=`qsub chain_ok.sh`
job2=`qsub -W depend=afterok:$job1 chain_ok.sh`
job3=`qsub -W depend=afterok:$job2 chain_not_ok.sh`
job4=`qsub -W depend=afterok:$job3 chain_ok.sh`
```
`chain_ok.sh` returns 0 as exit code.
`chain_not_ok.sh` returns 1 as exit code.
job4 will be deleted by the scheduler because job3 ends with a non-zero exit code.
```bash
#PBS -N chain_ok
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1,mem=10mb,walltime=1:00:00
#PBS -q student
echo `hostname`
sleep 60
echo success
exit 0
```
```bash
#PBS -N chain_not_ok
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1,mem=10mb,walltime=1:00:00
#PBS -q student
#!/bin/bash
echo `hostname`
sleep 60
echo failure
exit 1
```
## How can I measure the memory peak of my program?
You can measure the required memory with `/usr/bin/time -v` (not to be confused with the built-in time command of bash).
The Maximum resident set size (kbytes) gives the amount of memory required to run your program.
## How can I measure the gpu resources of my program?
Add the following two lines to your script to print the resource usage measured on the gpu.
```bash
echo "pid, gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]"
nvidia-smi --query-accounted-apps="pid,gpu_util,mem_util,max_memory_usage,time" --format=csv | tail -n1
```
In case the output is empty, check if accounting mode is enabled on the used gpu.
```bash
nvidia-smi -q | grep Accounting
```
## Connecting to servers that are running in a job
### Jupyter notebook
Since you cannot build an SSH tunnel, access it like this instead:
```bash
jupyter notebook --generate-config
# config file is in .jupyter/jupyter_notebook_config.py
# change config line 351
c.NotebookApp.ip = '*'
# start interactive job and run the notebook
jupyter notebook
# now you should be able to access it via browser as servername:port
```
### Other server tools
```bash
# start interactive job, assuming you get server chip
# ping it from e.g. lmbtorque to get the local ip
ping chip
# start your server and tell it to host on the local ip instead of localhost
start-my-server --host IP --port PORT
# now you should be able to access it via browser as IP:PORT
```
## Connecting to jupyter running on a development server
Let's say you are running jupyter lab on dacky port 8888, and you want to open it in your browser at home.
Setup your `.ssh/config` file on your home computer like this (assuming you already setup ssh keys)
```
Host lmblogin
HostName lmblogin.informatik.uni-freiburg.de
Port 2122
User USERNAME
Host dacky
Port 2122
User USERNAME
ProxyCommand ssh lmblogin -W %h:%p
```
Now you should be able to directly connect to dacky using `ssh dacky` from home.
Next, open the tunnel and leave it open:
```
ssh -v -N -L 8888:localhost:8888 dacky
```
Now you should be able to access the jupyter lab running on dacky, on your own pc at home.
## See also
You can find additional info about our cluster and internal servers in the [LMB Wiki](https://lmb.informatik.uni-freiburg.de/interna/lmbwiki/)