Skip Navigation | ANU Home | Search ANU
The Australian
National University
Mathematical Sciences Institute (MSI)
Advanced Computation and Modelling Program
Printer Friendly Version of this Document

Bogong Cluster

The cluster consists of four nodes connected with PathScale InfiniPath HTX.
Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory.
The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85% of peak performance.

The Alexander Technologies Administrator/User guide provides a useful overview of the Bogong cluster.

Compiler

We have the gcc 4.0.* installed on bogong. -march=opteron turns on any special opteron features the compiler might have.
We did try the pathscale for a short while, but didn't get much out of it. Shouldn't be a problem to get another trial license if somebody cares.

blas / lapack

We have two versions of blas/lapack installed. One is in

/opt/cluster/acml/gnu64/lib/

The other is

/opt/lib64/libgoto_opteron64p-r1.00.so

MPI with ssh

In order for mpirun to communicate with the various hosts without prompting you for a password for each connection, you _must_ configure your personal ssh2 settings to have a NULL passphrase for public key authentication.

ssh-keygen -t dsa
cp ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys

will do that.

Using TORQUE 

TORQUE (Terascale Open-Source Resource and Queue manager) is a cluster resource manager for providing control over batch jobs, and is derived from OpenPBS.
Its features include:
  • Run serial and parallel batch jobs remotely (create, route, execute, modify, delete) 
  • Define resource policies to determine resources for jobs 
  • Manage node availability 

How TORQUE works

There are three essential components to TORQUE:
  • pbs_server 
  • pbs_mom 
  • pbs_sched 
The SERVER daemon runs on the head node (Bogong server) and handles all TORQUE commands such as qsub, qstat etc.
The MOM daemons (known as the MOMs) run on each nodes and the head node, monitor the nodes' health, restrict resources on nodes for job execution, and handle the jobs for the server.
The SCHEDULER runs on the head node for now and handles the order of job execution for jobs submitted to all the PBS queues.

Checking Node Health 

To check the health and status of nodes, use the pbsnodes command for node query and control. Common uses of the pbsnodes command are as follows: To diagnose one node and report its health information, use:

pbsnodes -d nodeXX  (currently not supported on the bogong cluster)

To query all nodes and their attributes, use:

pbsnodes -ap

where the flag "p" forces a ping of all nodes to update the pbsnodes record, followed by:

pbsnodes -a

which then lists all the nodes and their attributes.

Bogong Queues

To submit a job you need to be logged into the cluster server (head node bogong) and use the TORQUE qsub command. There is a default queue called batch which is used if no other queue is specified. Currently only these queues are available:

batch     (default queue - walltime = 04:00:00)
lowpri    (low priority queue - walltime = 01:00:00)

Submitting Jobs to a Queue

To submit a job you need to be logged into the cluster server (head node bogong) and use the TORQUE qsub  command. This command takes a number of command line arguments and integrates this into the specified PBS command file. The PBS command file is specified as a filename on the qsub command line.
  • The PBS command file does not need to be executable.
  • The PBS command file may be piped into qsub (i.e., 'cat pbs.cmd | qsub')
  • In the case of parallel jobs, the PBS command file is staged to, and executed on the first allocated compute node only. (use pbsdsh to run actions on multiple nodes)
  • The command script is executed from the user's home directory in all cases (the script may determine the submission directory by using the $PBS_O_WORKDIR environment variable)
  • The command script will be executed using the default set of user environment variables unless the '-V' or -v flags are specified to include aspects of the job submission environment.
For example, so submit a simple serial job:

> qsub -l nodes=1 job_to_run.sh

To use a different queue (to come), use the -q flag:

> qsub -l nodes=1 -q low job_to_run.sh

In the above example the job script only contains the commands to run the job. You cannot use a binary file as the job script.

You will generally pass options to TORQUE from a jobs script like this example:


#!/bin/bash
#PBS -l nodes=4
#PBS -l walltime=4:00:00
mpirun -machinefile $PBS_NODEFILE -np 16 -ppn 2 ~/jobs/mpi_job
exit


Any line that begins #PBS -l will pass options to TORQUE.

NOTE: Only use bash shell in your scripts as other shells may not work properly.

TORQUE writes output to the directory which was current when you submitted your job, the stderr with file name:

myscript.sh.eJOBID

and writes the stdout of the job to the file:

myscript.sh.oJOBID

We are using epilogue scripts with TORQUE which attach to myscript.sh.oJOBID details like:

Job ID: 604.cluster

Job Name: myscript.sh
Resource List: neednodes=2,nodes=2,walltime=01:00:00
Resources Used: cput=00:00:00,mem=596kb,vmem=4992kb,walltime=00:00:10

You can find out which nodes you have been assigned by TORQUE using the command:

qstat -n

on the head node, which outputs the list of your host nodes.

For best performance of the cluster, it is important that walltimes are set as accurately as possible, to ensure jobs are scheduled in the right queues.

Requesting Resources 

Various resources can be requested at the time of job submission. A job can request a particular node, a particular node attribute, or even a number of nodes with particular attributes. The native TORQUE resources are listed in the table below :
Resource Format Description
arch string Specifies the administrator defined system architecture required. This defaults to whatever the PBS_MACH string is set to in “local.mk”.
cput seconds, or [[HH:]MM:]SS Maximum amount of CPU time used by all processes in the job
file size* The amount of total disk requested for the job
host string Name of the host on which the job should be run. This resource is provided for use by the site’s scheduling policy. The allowable values and effect on job placement is site dependent.
mem size* Maximum amount of physical memory used by the job
nice integer between -20 (highest priority) and 19 (lowest priority) Adjust the process’ execution priority
nodes {<node_count> | <hostname>} [:ppn=<ppn>][:<property>[:<property>]...] [+ ...] Number and/or type of nodes to be reserved for exclusive use by the job. The value is one or more node_specs joined with the ‘+’ character, “node_spec[+node_spec...]”. Each node_spec is an number of nodes required of the type declared in the node_spec and a name or one or more property or properties desired for the nodes. The number, the name, and each property in the node_spec are separated by a colon ‘:’. If no number is specified, one (1) is assumed.

The name of a node is its hostname. The properties of nodes are:
* ppn=# - specifying the number of processors per node requested. Defaults to 1.
* property - a string assigned by the system administrator specify a node’s features. Check with your administrator as to the node names and properties available to you.

See Example 1 (-l nodes) for examples.
other string Allows a user to specify site specific information. This resource is provided for use by the site’s scheduling policy. The allowable values and effect on job placement is site dependent.
pcput seconds, or [[HH:]MM:]SS Maximum amount of CPU time used by any single process in the job
pmem size* Maximum amount of physical memory used by any single process of the job
pvmem size* Maximum amount of virtual memory used by any single process in the job
software string Allows a user to specify software required by the job. This is useful if certain software packages are only available on certain systems in the site. This resource is provided for use by the site’s scheduling policy. The allowable values and effect on job placement is site dependent. (see Scheduler License Management)
vmem size* Maximum amount of virtual memory used by all concurrent processes in the job
walltime seconds, or [[HH:]MM:]SS Maximum amount of real time during which the job can be in the running state

*size format: The size format specifies the maximum amount in terms of bytes or words. It is expressed in the form integer[suffix]. The suffix is a multiplier defined in the following table (’b’ means bytes (the default) and ‘w’ means words). The size of a word is calculated on the execution server as its word size.

Suffix Multiplier
b w 1
kb kw 1024
mb mw 1,048,576
gb gw 1,073,741,824
tb tw 1,099,511,627,776

Example 1 (-l nodes)

  • To ask for 12 nodes of any: -l nodes=12
  • To ask for 2 “server” nodes and 14 other nodes (a total of 16): -l nodes=2:server+ 14
  • The above consist of two node_specs “2:server” and “14”.
  • To ask for (a) 1 node that is a “server” and has a “hippi” interface, (b) 10 nodes that are not servers, and © 3 nodes that have a large amount of memory an have hippi: -l nodes=server:hippi+10:noserver+3:bigmem:hippi
  • To ask for three nodes by name: -l nodes=b2005+b1803+b1813
  • To ask for 2 processors on each of four nodes: -l nodes=4:ppn=2
  • To ask for 4 processors on one node: -l nodes=1:ppn=4
  • To ask for 2 processors on each of two blue nodes and three processors on one red node: -l nodes=2:blue:ppn=2+red:ppn=3

Other Resources

When a batch job is started, a number of variables are introduced into the job’s environment which can be used by the batch script in making decisions, creating output files, etc. These variables are listed in the table below:

Variable Description
PBS_JOBNAME user specified jobname
PBS_O_WORKDIR job’s submission directory
PBS_ENVIRONMENT N/A
PBS_TASKNUM number of tasks requested
PBS_O_HOME home directory of submitting user
PBS_MOMPORT active port for mom daemon
PBS_O_LOGNAME name of submitting user
PBS_O_LANG language variable for job
PBS_JOBCOOKIE job cookie
PBS_NODENUM node offset number
PBS_O_SHELL script shell
PBS_O_JOBID unique pbs job id
PBS_O_HOST host on which job script is currently running
PBS_QUEUE job queue
PBS_NODEFILE file containing line delimited list on nodes allocated to the job
PBS_O_PATH path variable used to locate executables within job script

Deleting Jobs with qdel

Simply use qdel with the job number as argument:

qdel 112

Deleting Jobs when qdel does not work

If you need to delete stale jobs from the queue in TORQUE and the qdel command doen't work, ask the administrator for help.

More Information

The webpage for Torque can be found here.
There is also a wiki where part of this information is taken from.