Computer Vision Group, Freiburg

Pattern Recognition and Image Processing

Dept. of Computer Science Faculty of Engineering

# KISLURM KI-Cluster quickstart ## Logging in to the cluster The cluster is not directly accessible from outside the university network, so to access it from private machines you will need to first login to a machine in the university network. you can either use - "lmblogin" `lmblogin.informatik.uni-freiburg.de` - the technical faculty's login server `login.informatik.uni-freiburg.de` from there on you can proceed to access the cluster using one of the following login nodes (doesn't matter which one you choose): - kislogin1.rz.ki.privat - kislogin2.rz.ki.privat - kislogin3.rz.ki.privat - account: - use your TF account (**NOT** the LMB account) - if your TF account works (see TF account access above) and you still don't have access, create a ticket here: https://osticket.informatik.uni-freiburg.de/open.php ## Get cluster and job info - get info about nodes and partitions ```bash sfree ``` - get info about partitions you are allowed to access ```bash sinfo ``` - get info about **all** jobs (queued or running) ```bash squeue ``` - get info about jobs of specific **user** ```bash squeue -u username ``` - get info about **your** jobs ```bash squeue --me ``` ## Create and use storage (workspaces) - Create a workspace ```bash ws_allocate workspacename 100 ``` - allocates a workspace named workspacename - request a duration of 100 days (maximum on dlclarge) - it can be extended! (up to 5 times) - Info about your workspaces ```bash ws_list ``` - Configure e-mail remainder before workspace duration ends - notifiy 14 days before expiration - notify email@cs.uni-freiburg.de - example: while creating the workspace ```bash ws_allocate -r 14 -m email@cs.uni-freiburg.de workspacename 100 ``` - example: add to an existing workspace: (use the extend-option (-x) with duration 0) ```bash ws_allocate -x -r 14 -m email@cs.uni-freiburg.de workspacename 0 ``` - Extend workspace (reset remaining duration 100 days) ```bash ws_allocate -x workspacename 100 ``` - Utilize the extend-option (-x) with an extend-duration of 0 to update an existing workspace ### How to access your workspaces - mounted on ```bash /work/dlclarge1/username-workspacename/ ``` - or ```bash /work/dlclarge2/username-workspacename/ ``` ### Examples to get data to the SLURM cluster - You are on the SLURM cluster and want to copy data from LMB storage ```bash rsync -avz -e "ssh -p 2122" dienertj@lmblogin.informatik.uni-freiburg.de:/misc/lmbraid21/dienertj/data/cfg-gan-cifar10 /work/dlclarge1/dienertj-cfggan/data/ ``` - You are on your personal computer and want to copy data to the SLURM cluster - SLURM login nodes are not directly accessible from the internet - we use the jump-host login.informatik.uni-freiburg.de ```bash rsync -avz -e "ssh -J dienertj@login.informatik.uni-freiburg.de:22 -oPort=22" /tmp/testfolder dienertj@kislogin1.rz.ki.privat:/home/dienertj/ ``` ### Access your TF/LMB home - your LMB home directory will be mounted on /ihome/username - **you can cd there, even if /ihome/ is empty!** ```bash cd /ihome/username ``` - **warning** the connection to ihome is slow! don't try to directly train on data on ihome. ## Running jobs - for all jobs you should - specify partition (queue in torque) - specify time limit - specify memory requirement - specify cpu core requirements ### Job submission examples **Note**: Run `sinfo` to find out which queues you are allowed to access. #### Interactive sessions (gets you a bash shell on a node) - interactive session with on a node from partition lmbhiwi_gpu-rtx2080 ```bash srun -p lmbhiwi_gpu-rtx2080 --pty bash ``` - interactive session with 2GPUs, 40GB system memory, 23:59:59 time limit ```bash srun -p lmbhiwi_gpu-rtx2080 --gpus-per-node=2 --mem=40G --time=23:59:59 --pty bash ``` #### Batch jobs - submit a jobscript, requesting 8GPUs ```bash sbatch -p lmbhiwi_gpu-rtx2080 --nodes=1 --gpus-per-node=8 --mem=40G /home/your-username/myproject/train_experiment42.sh ``` - the jobscript has to be written in bash and needs to contain all the commands you need to run your training. - the following is a minimal example to activate your virtual environment and start your python script: ```bash #!/bin/bash source /home/your-username/venvs/myenv/bin/activate python3 /home/your-username/myproject/train.py ``` #### Array jobs - if you have multiple experiments to run, you can use array jobs instead of creating a job scrip for each experiment - you can define how many tasks should run in parallel - example: ```bash # TODO ``` ### interact with running jobs - with SLURM you can execute commands within running jobs - example: we want to check the GPU utilization of a job we submitted - use squeue to get the jobid ```bash > squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3999690 lmb_gpu-r myocardi dienertj R 2:59:32 1 dagobert ``` - execute nvidia-smi within the job ```bash srun --jobid=3999690 nvidia-smi ``` - we can even get an interactive bash session: ```bash srun --jobid=3999690 --pty bash ``` ### Connect to a running job Useful e.g. for connecting VSCode, running a debugger, building an SSH tunnel to connect jupyter notebooks. - Start an interactive job - Then log in at the assigned node via ssh For example: ```bash # start a job: srun -p lmb_gpu-1080ti --gpus-per-node=1 --mem=10G --pty bash # then (for example from your local lmb workstation): ssh -p 22 USERNAME@mario # (assuming that mario is the node where your job was started) ``` So, after starting a job, students can connect their VSCode via SSH to the node and edit code there and debug stuff etc. Also port forwarding is possible, e.g. to start tensorboard/jupyter/etc. on the node and access it from the local workstation. The advantage of this setup: no GPU conflicts, as the GPU is assigned to a specific user via the job. The disadvantages: - extra effort instead of just connecting via ssh (but only little) - only one user per GPU even if user 90% of the time just edits code and doesn't use the GPU - "developing session" ends after walltime (e.g. 24 hours) ## See also You can find additional info about the KISLURM cluster in our [ticket system](https://osticket.informatik.uni-freiburg.de/open.php) under knowledgebase. # LMB cluster quickstart *Removed link to [queue status page](https://lmb.informatik.uni-freiburg.de/interna/web-qstat/) as it requires the "lmb" password* If you are a MSc student, you should work on the KISLURM cluster only, except for a good reason. If you work on the LMB cluster, you should not use the 3090 machines. ## Usage guidelines - Do not waste resources. Measure or estimate the resources your program needs in advance on a local machine or on a development server (**dacky**). - Do not use more than two GPUs at the same time! Use the **student** queue to automatically enforce this. - See [serverload page](https://lmb.informatik.uni-freiburg.de/interna/serverload/) ## Example job submission We will submit the following example job script `myjob.sh`: ```bash #!/bin/bash echo `hostname` sleep 30 # wait 30 seconds ``` 1. login to ```lmbtorque``` with ```ssh``` 2. Use the ```qsub``` command to submit the job script ```bash qsub -l nodes=1:ppn=1,mem=100mb -q student myjob.sh ``` The command above reserves 1 cpu core and 100mb. On success a job identifier is written to stdout. 3. Check the state of your job with the ```qstat``` command. ```qstat``` returns information about the state of all submitted jobs. The second last column shows the current job state, which is one of - **C** Job is completed after having run/ - **H** Job is held. - **Q** job is queued, eligible to run or routed. - **R** job is running. 4. Check the program output after the job is completed. torque will create two files named ```myjob.sh.oXXXXXX``` and ```myjob.sh.eXXXXXX``` with the stdout and stderr output of your program. For more information see manuals and help texts of ```qsub``` and ```qstat``` and the manual of torque http://docs.adaptivecomputing.com/8-1-0/basic/help.htm ## Example job script All parameters to ```qsub``` can be set within the script file like shown below ```bash #PBS -N Jobname #PBS -S /bin/bash #PBS -l nodes=1:ppn=1:gpus=1,mem=1gb,walltime=24:00:00 #PBS -m ae #PBS -j oe #PBS -q student echo `hostname` sleep 30 # wait 30 seconds ``` The script above can then be submitted with `qsub SCRIPT`. ## Getting around the 24h walltime limit The maximum walltime for all jobs is 24 hours. To get around this, a job must save its intermediate results to the hard disk such that a follow up job can continue. There are two techniques for submitting multiple jobs that run one after the other: job arrays and job chaining. #### Job arrays ```bash qsub -t V-W[%S] ``` V: Startnumber, W: End number, S: Number of simultaneous jobs Arrays are used to start W-V+1 jobs that use the same script. Torque will set the environment variable `PBS_ARRAYID` to provide the array index to the instance of the script. This index can then be translated to parameter sets, input datasets, etc. Example (HelloWorldArray.sh): ```bash #!/bin/bash #PBS -N HelloWorldArray #PBS -S /bin/bash #PBS -l nodes=1:ppn=1,mem=10mb,nice=10,walltime=00:01:00 #PBS -j oe #PBS -q student echo "Hello World! from array index $PBS_ARRAYID" exit 0 ``` When submitting the script with ```bash qsub -t 1-10%3 HelloWorldArray.sh ``` ten instances jobid[1]@lmbtorque - jobid[10]@lmbtorque will be enqueued of which at most three are running simultaneously at any time. Ten output files HelloWorldArray.ojobid-1 - HelloWorldArray.ojobid-10 will be generated containing ```bash Hello World! from array index 1 . . . Hello World! from array index 10 ``` To run only one job at the same time use S=1. #### Job chaining The job identifier returned by `qsub` can be used to create a chain of jobs, where each job depends on its predecessor. ```bash job1=`qsub chain_ok.sh` job2=`qsub -W depend=afterok:$job1 chain_ok.sh` job3=`qsub -W depend=afterok:$job2 chain_not_ok.sh` job4=`qsub -W depend=afterok:$job3 chain_ok.sh` ``` `chain_ok.sh` returns 0 as exit code. `chain_not_ok.sh` returns 1 as exit code. job4 will be deleted by the scheduler because job3 ends with a non-zero exit code. ```bash #PBS -N chain_ok #PBS -S /bin/bash #PBS -l nodes=1:ppn=1,mem=10mb,walltime=1:00:00 #PBS -q student echo `hostname` sleep 60 echo success exit 0 ``` ```bash #PBS -N chain_not_ok #PBS -S /bin/bash #PBS -l nodes=1:ppn=1,mem=10mb,walltime=1:00:00 #PBS -q student #!/bin/bash echo `hostname` sleep 60 echo failure exit 1 ``` ## How can I measure the memory peak of my program? You can measure the required memory with `/usr/bin/time -v` (not to be confused with the built-in time command of bash). The Maximum resident set size (kbytes) gives the amount of memory required to run your program. ## How can I measure the gpu resources of my program? Add the following two lines to your script to print the resource usage measured on the gpu. ```bash echo "pid, gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]" nvidia-smi --query-accounted-apps="pid,gpu_util,mem_util,max_memory_usage,time" --format=csv | tail -n1 ``` In case the output is empty, check if accounting mode is enabled on the used gpu. ```bash nvidia-smi -q | grep Accounting ``` ## Connecting to servers that are running in a job ### Jupyter notebook Since you cannot build an SSH tunnel, access it like this instead: ```bash jupyter notebook --generate-config # config file is in .jupyter/jupyter_notebook_config.py # change config line 351 c.NotebookApp.ip = '*' # start interactive job and run the notebook jupyter notebook # now you should be able to access it via browser as servername:port ``` ### Other server tools ```bash # start interactive job, assuming you get server chip # ping it from e.g. lmbtorque to get the local ip ping chip # start your server and tell it to host on the local ip instead of localhost start-my-server --host IP --port PORT # now you should be able to access it via browser as IP:PORT ``` ## Connecting to jupyter running on a development server Let's say you are running jupyter lab on dacky port 8888, and you want to open it in your browser at home. Setup your `.ssh/config` file on your home computer like this (assuming you already setup ssh keys) ``` Host lmblogin HostName lmblogin.informatik.uni-freiburg.de Port 2122 User USERNAME Host dacky Port 2122 User USERNAME ProxyCommand ssh lmblogin -W %h:%p ``` Now you should be able to directly connect to dacky using `ssh dacky` from home. Next, open the tunnel and leave it open: ``` ssh -v -N -L 8888:localhost:8888 dacky ``` Now you should be able to access the jupyter lab running on dacky, on your own pc at home. ## See also You can find additional info about our cluster and internal servers in the [LMB Wiki](https://lmb.informatik.uni-freiburg.de/interna/lmbwiki/)