First steps
Requirements
Get an account at the computing center (Rechenzentrum). If you are a member (e.g. a student or a staff member) of the University of Freiburg, you should already have one, which you can configure on the myAccount page. Otherwise please contact the user support.
Activate your account on the ATLAS-BFG by filling in the Application form. Please note that your working group has to be member of the ATLAS-BFG community to work on the ATLAS-BFG, otherwise contact us.
You also need some basic knowledge of Linux systems to work on the cluster.
The User Interfaces (UI)
The User Interfaces are the computers that give access to the cluster. Almost every interaction with the cluster takes place on one UI. To log onto one UI, just type:
ssh -l <UID> ui.bfg.uni-freiburg.de
You will be automatically directed to the UI with the lowest load. Don't forget to replace <UID>
by your user account.
The first job
You have to provide the system with some information to execute a program on the cluster. Most important are the code to execute and the parameters for the batch system. If you need a program that is not available on the platform, you have to send it to the cluster. Furthermore, there are many settings that can be specified: number of worker nodes, amount of memory, estimated running time, etc.
SLURM
The code, the parameters and the settings altogether constitute the so called job. Usually a job is put together in a file. The following shows a simple example of a job description saved in the file hello-cluster.slm
:
#!/bin/sh # #SBATCH -p express /bin/echo Hello Cluster!
- The first line is a shebang, used to specify the interpreter (in this case the bourne shell).
- The third line tells the cluster to put the job in the express queue, where the computation usually starts shortly after submission, however the running time may not exceed 20 minutes. An overview of the available queues can be found here.
- The fifth line executes the program
echo
and shows the messageHello Cluster!
on the standard display. The programecho
is a standard program available on every Linux system.
Now we should send the job description to the cluster. For this we call the program sbatch
which is installed on each UI:
sbatch hello-cluster.slm
If the job is successfully sent, an ID to identify the job is shown, which consists of 8 digits. The job is put in the express
queue and waits to be executed. Use the program squeue
to display the status of the job:
squeue -u <UID>
where <UID>
stands for your user account. The outputs should look like the following:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345678 express hello-cl <UID> PD 0:00 1 (Ressources)
or
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345678 express hello-cl <UID> R 0:10 1 n123456
The first column shows the job-ID. The job can be deleted with the following command:
scancel 12345678
The column ST shows the status of the job. Most of the time, one of the following two states are shown:
Status | Description |
---|---|
PD |
The job is still in the queue and waiting to be executed. |
R |
The job is running. |
The column TIME shows the elapsed running time (in hours, minutes, and seconds) if the job has started.
The column NODELIST(REASON) shows on which node the job is running or the reason if the job is still pending. Typical reasons are Resources (not enough resources available) or Priority (there are other jobs with higher priority).
The job is done when it doesn't show in the list anymore. You will find a file of the form slurm-<job-id>.out
in the so called working directory of the job, usually the directory from where you submitted the job.
An alternative path for the logfile can be specified with the following directive:
#SBATCH -o ./logs/my_log_file.log
Job description
In a shell script, the pound sign (#) at the beginning of a line is usually used for comments. Yet the string #SBATCH
is interpreted as a directive for the batch system (see the third line of the job hello-cluster.slm above).
Use the man page to have a look at the directives you may send to the batch system:
man sbatch
More information can be found in this documentation here or in the official slurm documentation.
Each sbatch
option can be entered in one #sbatch
line. We can drop the line
#SBATCH -p express
in the job description and instead call sbatch
with the option -p
on the command line:
sbatch -p express hello-cluster.slm
Temporary data
A job can produce a lot of data that is no longer usefull after its completion. This data can be stored in /tmp/slurm_<job-id>
. The tmp path is stored in the variable $TMPDIR
. Unlike the home directory of a user, the tmp directory is a local directory (located on the worker node where the job has been executed), ensuring thus a much better I/O performance.
Other documentation ressources
- Overview of the ATLAS-BFG hardware
- Overview of the ATLAS-BFG software
- Storage with the Lustre file system
- Overview of the ATLAS-BFG queues
- Slurm overview
Help!
If you need help, please contact the BFG team of the computing center.