![]() |
Mathematical Sciences Institute (MSI)
Advanced Computation and Modelling Program
|
|
Bogong ClusterThe cluster consists of four nodes connected with PathScale InfiniPath HTX.Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85% of peak performance. The Alexander Technologies Administrator/User guide provides a useful overview of the Bogong cluster. CompilerWe have the gcc 4.0.* installed on bogong. -march=opteron turns on any special opteron features the compiler might have.We did try the pathscale for a short while, but didn't get much out of it. Shouldn't be a problem to get another trial license if somebody cares. blas / lapackWe have two versions of blas/lapack installed. One is in
/opt/cluster/acml/gnu64/lib/
/opt/lib64/libgoto_opteron64p-r1.00.so MPI with sshIn order for mpirun to communicate with the various hosts without prompting you for a password for each connection, you _must_ configure your personal ssh2 settings to have a NULL passphrase for public key authentication.
ssh-keygen -t dsa Using TORQUETORQUE (Terascale Open-Source Resource and Queue manager) is a cluster resource manager for providing control over batch jobs, and is derived from OpenPBS.Its features include:
How TORQUE worksThere are three essential components to TORQUE:
The MOM daemons (known as the MOMs) run on each nodes and the head node, monitor the nodes' health, restrict resources on nodes for job execution, and handle the jobs for the server. The SCHEDULER runs on the head node for now and handles the order of job execution for jobs submitted to all the PBS queues. Checking Node HealthTo check the health and status of nodes, use the pbsnodes command for node query and control. Common uses of the pbsnodes command are as follows: To diagnose one node and report its health information, use:pbsnodes -d nodeXX (currently not supported on the bogong cluster) To query all nodes and their attributes, use: pbsnodes -ap where the flag "p" forces a ping of all nodes to update the pbsnodes record, followed by: pbsnodes -a which then lists all the nodes and their attributes. Bogong QueuesTo submit a job you need to be logged into the cluster server (head node bogong) and use the TORQUE qsub command. There is a default queue called batch which is used if no other queue is specified. Currently only these queues are available:batch (default queue - walltime = 04:00:00) lowpri (low priority queue - walltime = 01:00:00) Submitting Jobs to a QueueTo submit a job you need to be logged into the cluster server (head node bogong) and use the TORQUE qsub command. This command takes a number of command line arguments and integrates this into the specified PBS command file. The PBS command file is specified as a filename on the qsub command line.
> qsub -l nodes=1 job_to_run.sh To use a different queue (to come), use the -q flag: > qsub -l nodes=1 -q low job_to_run.sh In the above example the job script only contains the commands to run the job. You cannot use a binary file as the job script. You will generally pass options to TORQUE from a jobs script like this example: #!/bin/bash #PBS -l nodes=4 #PBS -l walltime=4:00:00 mpirun -machinefile $PBS_NODEFILE -np 16 -ppn 2 ~/jobs/mpi_job exit Any line that begins #PBS -l will pass options to TORQUE. NOTE: Only use bash shell in your scripts as other shells may not work properly. TORQUE writes output to the directory which was current when you submitted your job, the stderr with file name: myscript.sh.eJOBID and writes the stdout of the job to the file: myscript.sh.oJOBID We are using epilogue scripts with TORQUE which attach to myscript.sh.oJOBID details like: Job ID: 604.cluster Job Name: myscript.sh Resource List: neednodes=2,nodes=2,walltime=01:00:00 Resources Used: cput=00:00:00,mem=596kb,vmem=4992kb,walltime=00:00:10 You can find out which nodes you have been assigned by TORQUE using the command: qstat -n on the head node, which outputs the list of your host nodes. For best performance of the cluster, it is important that walltimes are set as accurately as possible, to ensure jobs are scheduled in the right queues. Requesting ResourcesVarious resources can be requested at the time of job submission. A job can request a particular node, a particular node attribute, or even a number of nodes with particular attributes. The native TORQUE resources are listed in the table below :
*size format: The size format specifies the maximum amount in terms of bytes or words. It is expressed in the form integer[suffix]. The suffix is a multiplier defined in the following table (’b’ means bytes (the default) and ‘w’ means words). The size of a word is calculated on the execution server as its word size.
Example 1 (-l nodes)
Other ResourcesWhen a batch job is started, a number of variables are introduced into the job’s environment which can be used by the batch script in making decisions, creating output files, etc. These variables are listed in the table below:
Deleting Jobs with qdelSimply use qdel with the job number as argument:qdel 112 Deleting Jobs when qdel does not workIf you need to delete stale jobs from the queue in TORQUE and the qdel command doen't work, ask the administrator for help.More InformationThe webpage for Torque can be found here.There is also a wiki where part of this information is taken from. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Page last updated: 24 November, 2005 Please direct all enquiries to: MSI webmaster Page authorised by: Dean, MSI |
| The Australian National University - CRICOS Provider Number 00120C |