First steps

This document is intended to everyone eligible to work on the ATLAS-BFG without knowledge of cluster computing.

Requirements

Get an account at the computing center (Rechenzentrum). If you are a member (e.g. a student or a staff member) of the University of Freiburg, you should already have one, which you can configure on the myAccount page. Otherwise please contact the user support.

Activate your account on the ATLAS-BFG by filling in the Application form. Please note that your working group has to be member of the ATLAS-BFG community to work on the ATLAS-BFG, otherwise contact us.

You also need some basic knowledge of Linux systems to work on the cluster.

The User Interfaces (UI)

The User Interfaces are the computers that give access to the cluster. Almost every interaction with the cluster takes place on one UI. To log onto one UI, just type:

ssh -l <UID> ui.bfg.uni-freiburg.de

You will be automatically directed to the UI with the lowest load. Don't forget to replace <UID> by your user account.

The first job

You have to provide the system with some information to execute a program on the cluster. Most important are the code to execute and the parameters for the batch system. If you need a program that is not available on the platform, you have to send it to the cluster. Furthermore, there are many settings that can be specified: number of worker nodes, amount of memory, estimated running time, etc.

SLURM

The code, the parameters and the settings altogether constitute the so called job. Usually a job is put together in a file. The following shows a simple example of a job description saved in the file hello-cluster.slm:

 #!/bin/sh
 #
 #SBATCH -p express

 /bin/echo Hello Cluster!

The first line is a shebang, used to specify the interpreter (in this case the bourne shell).
The third line tells the cluster to put the job in the express queue, where the computation usually starts shortly after submission, however the running time may not exceed 20 minutes. An overview of the available queues can be found here.
The fifth line executes the program echo and shows the message Hello Cluster! on the standard display. The program echo is a standard program available on every Linux system.

Now we should send the job description to the cluster. For this we call the program sbatch which is installed on each UI:

 sbatch hello-cluster.slm

If the job is successfully sent, an ID to identify the job is shown, which consists of 8 digits. The job is put in the express queue and waits to be executed. Use the program squeue to display the status of the job:

 squeue -u <UID>

where <UID> stands for your user account. The outputs should look like the following:

    JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
 12345678   express hello-cl     <UID> PD       0:00      1 (Ressources)

    JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
 12345678   express hello-cl     <UID>  R       0:10      1 n123456

The first column shows the job-ID. The job can be deleted with the following command:

 scancel 12345678

The column ST shows the status of the job. Most of the time, one of the following two states are shown:

Status	Description
`PD`	The job is still in the queue and waiting to be executed.
`R`	The job is running.

The column TIME shows the elapsed running time (in hours, minutes, and seconds) if the job has started.

The column NODELIST(REASON) shows on which node the job is running or the reason if the job is still pending. Typical reasons are Resources (not enough resources available) or Priority (there are other jobs with higher priority).

The job is done when it doesn't show in the list anymore. You will find a file of the form slurm-<job-id>.out
in the so called working directory of the job, usually the directory from where you submitted the job.

An alternative path for the logfile can be specified with the following directive:

 #SBATCH -o ./logs/my_log_file.log

Job description

In a shell script, the pound sign (#) at the beginning of a line is usually used for comments. Yet the string #SBATCH is interpreted as a directive for the batch system (see the third line of the job hello-cluster.slm above).

Use the man page to have a look at the directives you may send to the batch system:

 man sbatch

More information can be found in this documentation here or in the official slurm documentation.

Each sbatch option can be entered in one #sbatch line. We can drop the line

 #SBATCH -p express

in the job description and instead call sbatch with the option -p on the command line:

 sbatch -p express hello-cluster.slm

Temporary data

A job can produce a lot of data that is no longer usefull after its completion. This data can be stored in /tmp/slurm_<job-id>. The tmp path is stored in the variable $TMPDIR. Unlike the home directory of a user, the tmp directory is a local directory (located on the worker node where the job has been executed), ensuring thus a much better I/O performance.

Help!

If you need help, please contact the BFG team of the computing center.