Summary: comprehensive information for the PSK cluster users; topics include: hardware, usage rules, programming tools, queueing system.

 

PSK cluster user's guide

Table of contents


Overview

 

Hardware configuration

The PSK cluster is build of:

  • frontend psk: 24-core Opteron CPUs, 32 GB RAM
  • new nodes psk-{1-10}: 16 core Xeon CPUs, 64 GB RAM, nodes 1-4 are equipped with Nvidia Titan GPUs
  • old nodes psk-{20-26}: various Operon CPUs
  • infiniband network: all components are connected with a fast, low-lattency Infiniband network >20 Gbps (new nodes >40 Gbps)
  • lustre filesystem: fast, cluster filesystem build of 3 OSS servers and one MDS server (see storage section)

The current hardware configuration can be displayed with commands: 'pbsnodes' or 'mdiag -n'.

 

Access rules

To use the cluster one has to ssh to its master node (front-end): psk. It is accessible to all users of the CAMK network, but the local disk space is assigned upon request (see next chapter). The front-end should be used to develop codes, prepare and submit jobs to the queueing system. Production runs should always go into the queueing system. Interactive use of the front-end is subject to certain limits which are announced in the message displayed after login (e.g. number of shell sessions, maximum cpu time of a single process...). Short runs (aimed at code debugging) are acceptable.

 

Local disk space

/work/psk is a high performance lustre filesystem associated ditectly with the cluster. It is accessible from all linux workstations but only cluster (frontend and nodes) have dedicated connection to it for fast access. It's important parameters:

  • 140 TB size
  • around 2.5 GB/s aggregated throughput
  • previous day backup exists
  • space allocated upon request (psk@camk.edu.pl) with default quota 500 GB (can be increased)
  • acl enabled
  • performance is high for reads/writes of larger chunks of data (say >1 MB), can be very poor for small writes; please do not flush when writing text files (allow buffering).



Back to top


Programming environment

Users of PSK cluster should remember that it has 64 bit architecture (while some of the workstations in CAMK are 32 bit). For the programmers it means that the default length of data types may be different from what they are used to (e.g. the default integers may be 64 bit long instead of 32 bit). In general data type length depends not only on the processor architecture but also on the compiler - please refer to the compiler documentation.
Despite of 64 bit architecture PSK processors are able to run 32 bit code as well. In principle many 32 bit codes compiled on other CAMK computers should run on PSK but you are likely to encounter library incompatibilities. It is also possible to compile the code for 64 bit architecture on 32 bit computers - some compilers (like GNU or PGI) offer an option to choose the target architecture. In most cases there is no difference in performance between 32 and 64 bit versions of the code.
If you are not ready to solve problems with incompatible libraries it is strongly advised that you recompile your codes on the frontend for 64 bit architecture (this is the default for all compilers).
It is also advised to link codes statically (this is especially important if you are using own, special libraries which are not installed on computing nodes). While the libraries from psk should be present on computing nodes it is still safer to have them linked statically (for example during system upgrades library versions may be temporarily inconsistent).

 

Compilers, optimization, debugging

GNU compilers

Popular GNU compilers. Call gcc/g++/gfortran . The debugger is called gdb. Read manuals from system man pages. Additionally there is other Fortran 95 implementation, g95 , installed in /opt/g95{*} directories.

 

PGI Compilers and Tools Suite

Links: Local documentation , Vendor pages
C/C++
: For C call pgcc, for C++ call pgCC. Read manual here.
Fortran
: depending on standard call pgf77, pgf90, pgf95; in fact pgf95 implements parts of Fortran 2003 standard. Read manual here.
Debugger & profiler : nice graphical tools for C/C++ and Fortran; call pgdbg or pgprof respectively. Both tools support OpenMP and MPI programs. Read manual here.
Libraries
: PGI includes AMD Core Math Library. It is a set of numerical routines tuned specifically for AMD64 platform processors. The routines, which are available via both FORTRAN 77 and C interfaces, include: BLAS, LAPACK, FFT, random numbers generators. Read manual here.


Intel Compilers and Tools

Links: Local documentation , Vendor pages
C/C++ : Call icc . Read manual here.
Fortran
: Call ifort. Implements Fortran 95 and parts of Fortran 2003. Read manual here.
Debugger
: Call idb. Read manual here.
Libraries
: Intel Math Kernel Library is installed. It contains optimized routines from BLAS, LAPACK, ScaLAPACK, SparseSolver, Vector Math Library, Conventional DFTs and Cluster DFTs, Partial Differential Equations support. Read manual here.



The table below summarizes some useful options to compilers installed on the cluster (please read specific manual before use):

 

Compiler
Standards Optimization
related options
Math precision
related options
>2GB code related options
gcc/g++ ANSI C, C99, C++98
-O3 -march=k8 -m3dnow
-mieee-fp -mcmodel=medium
gfortran Fortran 77
Fortran 95
parts of Fortran 2003
-O3 -march=k8 -m3dnow -mieee-fp -mcmodel=medium
g95 Fortran 95,
parts of Fortran 2003
pgcc/pgCC ANSI C, C99, C++98 -fast -Mipa=fast
-Kieee
-mcmodel=medium
-Marge_arrays
pgf77/pgf90/pgf95 Fortran 77
Fortran 90/95
parts of Fortran 2003
-fast -Mipa=fast
-Kieee
-mcmodel=medium
-Marge_arrays
icc ANSI C, C99, C++98 -fast
-fp-model strict

-fltconsistency

-mcmodel=medium
ifort Fortran 77
Fortran 95
parts of Fortran 2003
-fast -fp-modelstrict

-fltconsistency

-mcmodel=medium

 

Remarks on programming and compilers

Avoid using non-standard extensions to languages. The code should compile with different compilers. It is really worth trying different compiler options - sometimes speedup may be supprising. On the other hand, high optimization may cause errors - this is why you should use different compilers and options before production run. And for this you should avoid using non-standard extensions :)
In general Intel and PGI compilers offer better code optimization than GNU, while GNU compilers are thought to better conform to the standards.
If you have to read/write a lot of data from the disk please do it intelligently since it will be major bottleneck in the code and no compiler optimization can deal with it - it is usually algorithmic problem. Read data in large, sequential chunks; avoid re-reading and random reads; for large files use binary format instead of ASCII.

 

Parallelization

There are two main models for code parallelization: shared memory and distributed memory. In shared memory model code threads have access to shared memory which is used for communication. Such codes are limited to a single multi-processor machine. Most popular standard for this model is called OpenMP. In distributed memory model, code processes communicate via messages sent over network. Such codes can run on many nodes of the cluster but need a fast network. Most popular standard for shared memory is called MPI.
The PSK cluster is well suited for both forms of parallel codes. For OpenMP codes there are multi-core nodes (from 2 to 32 cores). For MPI codes there is a fast, low-latency Infiniband network.

 

Using OpenMP

GNU compilers: use option -fopenmp .
PGI compilers: use option -mp ; The User's Guide contains nice introduction to the programming in OpenMP.
Intel compilers: use option -openmp .

To run OpenMP code in the queueing system set environment variable OPM_NUM_TRHREADS and run as normal code (check example scripts).

 

Using MPI

MPI is in principle a set of libraries which can be used with different compilers. On PSK there are two MPI-2 implementations installed which use Infiniband network interface: MVAPICH2 and OpenMPI . They are compiled with different compilers. The user has to select preferred combination of MPI version and compiler using modules facility. Most important commands are:

  • module available - lists available modules
  • module list - lists loaded modules
  • module load  - loads module

There is no default setting. You may want to add command 'module load '  to your shell's init scripts. The default command names are e.g.: mpif77, mpif90, mpicc, mpicxx

Please note that the old mpi-selector-menu is not working any more.

Both are compiled with all available compilers: GNU, PGI & Intel. The default combination is set to MVAPICH2 with PGI compilers. To use it just call mpif77, mpif90, mpicc or mpicxx (this are 'wrapper' scripts which set paths, call compilers and link MPI libraries). There is no need to set any additional environment variables. In order to change the default combination call mpi-selector-menu command and choose another user default. The changes will take effect after next login.

Please check example scripts page for instruction about running MPI codes.

 

Libraries

Popular scientific libraries are installed on the cluster, including: fftw3, Lapack, HDF4, HDF5.
If you need other libraries please ask psk©camk.edu.pl . In general standard libraries, available in the distribution packages are installed without problems.


Back to top


Queueing system

Introduction

The cluster consists of a number of computational nodes with associated resources like the number of processors, memory size, clock speed of processors. Running jobs on nodes selected arbitrary by the user would lead to overall imbalance and inefficient usage of resources. The batch system takes the duty of finding free resources and manages them in order to:

  • optimize the overall system efficiency by weighting user requirements against system load,
  • make sure that all users get equal access to resources by enforcing scheduling policy.

In practice, the user submits a job to the proper queue and the batch system runs it when sufficient resources are available, taking into account priorities of other waiting jobs. Of course, the user should specify requested resources - the more precise is the request, the more efficient will be the job scheduling. Coarse grained specification is implemented in the batch queues which predefine different, maximum resource limits. These limits can be further refined (limited) by the user.

The scheduling policy configured on PSK has two main components:

  • fair-share. The priority on the job is assigned on basis of historical records of the time allocated to the user (last 7 days are taken into account with weights decreasing for older records). This means that the system assigns lower priority for jobs of users who requested more computational time recently. Thus, it is in users interest to tighten the time limits as much as possible.
  • service time. The priority of a job increases with time spent in the idle state.


The priority calculated from those components affects only the execution order of idle jobs; after a job is run it gets exclusive access to the processors. Fair-share records can be displayed with command 'mdiag -f'.

People who contributed to the cluster and users indicated by them can optionally choose Quality of Service feature to increase their job's priority. QoS can be granted only up to the limits of contributed resources (for example: 2 processors and 4 GB of memory in total per all jobs of donator's group). QoS effectively moves the job to the first place on the wait list to the given queue. If there are more QoS jobs on the wait list, general scheduling policy rules apply. Check Job submission section for instructions on how to submit the QoS job.

 

Queues configuration

The table below lists the configured queues. Most up to date configuration can be displayed by the command qstat -q.

 

QueueWall time
(max)
Memory
(max/def)
NCPUS
(max/def)
NNODES
(max/def)
NRUNNotes
short 3h 3GB/480MB 1/1 1/1 - Execution queue for serial jobs
medium 2d 3GB/1GB 1/1 1/1 - Execution queue for serial jobs
long 7d 3GB/1GB 1/1 1/1 64 Execution queue for serial jobs
bigmem 14d 31GB/3GB
(min. 3GB)
1/1 1/1 5 Execution queue for serial jobs with high memory requirements
ibpara 7d -/- 96/- 6/1  (128 cpus) Execution queue for parallel jobs
admin - - - - - Special execution queue for admins only

Legend:

  • max/def - maximum and default value
  • Wall time - limit for the job runtime as measured by clock on the wall (contrary to the cpu time); after this period the job is killed (more precisely - SIGTERM is send and after 30 sec SIGKILL follows)
  • Memory - how much memory can be allocated by a single job
  • NCPUS - how many processors can be used by a single job
  • NNODES - how many nodes can be used by a single job
  • NRUN - maximum number of jobs running in the queue

 

The detailed list of queue parameters and resources can be displayed by the command 'qstat -Qf'.

In the case none of the available queues meets your needs please contact admins at psk©camk.edu.pl .
If your code runs longer than one week you should seriously consider implementing a restart feature (save all needed data to a file which can be used later to restart the code from the saved state). In special cases admins may create dedicated queues with custom limits for one-time use.

 

Job submission

The command to submit a job is 'qsub job_script', where the job_script file contains a sequence of commands to execute. Requested resources can be specified as qsub options, however it is advised to put them in the script file. Each line in the job_script file which begins with the string #PBS is processed as a command line option to qsub. Please remember also that the job script starts in your home directory so you will probably need to cd to the proper location. For details, check man qsub.

 

QoS account of the cluster donators and privileged users can be accessed with option: '-W x=QOS:qname' where qname is the account name as listed by the 'mdiag -q' command.


There are examples of job scripts for most popular activities.

Interactive jobs

It is possible to run interactive jobs which simply means that you get a shell session on a computational node (so it's like ssh but the queueing system decides where). This feature is useful for debuging or running interactive programs like mathematica. Please do not overuse it - interactive sessions usually do not use much cpu while it still has to be reserved for the job and is not available for other users. Especially do not leave interactive sessions when you are away or overnight. To run interactive job type 'qsub -I'. It is also possible to combine command line options and a script file: 'qsub -I -q ibpara -l nodes=2:ppn=4 job_script'. If you need graphical display add option -X to qsub (also you have to be logged in to psk2 with ssh -X). For more examples please refer to the Guide on using Mathematica at CAMK.

 

Job management

The current status of the submitted job can be checked with qstat command. By default, all running and queued jobs are displayed. The output format is obvious - explanation is needed only for column "S". It shows job status and the most common values are "R" (running) and "Q" (queued).
Details of the job resources can be displayed with command 'qstat -f job_id', where job_id is the id printed by qstat.

For more options check 'man qstat'.

Scheduling information about the job can be displayed with command 'checkjob job_id'. Among other things it prints job priority and a message from scheduler (for example why the job is not running).

A job can be removed from the queue with 'qdel job_id' command. It will be removed regardless of its status.

 

Typical session

This is an example of the typical session on the cluster:

 

>ssh psk2 #log in
>cd /work/psk2/my_code_src #go to the local filesystem
>pgf90 -o mycode mycode.F90 #compile the code
>cd /work/psk2/my_dir/run1 #change to run directory
>cp /work/psk2/my_code_src/mycode . #copy the code to run directory
>vi job_script #create job script (e.g. use on of the examples, decide which queue, time limits...)
>qsub job_script #submit the job
>qstat #check the job status



Summary of useful commands

Check man pages for details.

  • qalter - alter a batch job
  • qdel - delete a batch job
  • qhold - place a hold on a batch job
  • qmove - move a job to a different queue
  • qorder - exchange the FIFO ordering of two jobs in a queue
  • qrerun - terminate a job and return it to the queue
  • qrls - release a hold on a job
  • qselect - list all jobs meeting certain criteria
  • qsig - send a signal to a job
  • qstat - list all batch jobs in the system
  • qsub - submit a new batch job

 

Advanced topics

  • Details of the batch system configuration

    The batch system on PSK consists of the TORQUE resource manager and the MOAB scheduler. The complete configuration of the scheduler can be displayed by the command 'showconfig'.

  • Priority determination

    PRIORITY = FAIRSHARE.USER + SERVICE.QUEUETIME + CREDENTIAL.QOS

  • Scheduling algorithm

    MOAB deals with almost everything it has to schedule around (jobs, system downtime, etc...) in terms of reservations. A reservation has three components: start time, duration and associated resources. There are two types of reservations: normal (single-time events like jobs) and standing (continuous or periodic events like reservation of specific resources for a given user everyday at certain hours).

    During every scheduling iteration MOAB does the following:
    • query resource manager (TORQUE) for current state of all jobs
    • create a reservation for every running job in the Active Jobs list, starting at its start time and lasting until the end of its wallclock limit. Delete reservations for jobs which have terminated since the last scheduling iteration.
    • sort all non-running jobs into the Idle Jobs list, in order of descending priority
    • discard any jobs which violate any configured scheduling policies (eg. max. CPU count per user) or are held into Blocked Jobs list.
    • if there are sufficient resources available to run highest priority job on the Idle Jobs list, given the current set of jobs and reservations, then run it. Continue to do this until the next highest priority idle job cannot be run.
    • create reservation at the earliest time in the future when there are sufficient resources to run the highest priority idle job.
    • examine the rest of the idle jobs to see if any of them may be run immediately without causing the highest priority waiting job to be postponed; if so, run them. This is known as backfill.
  • Useful scheduler commands
    • showq - show queued jobs
    • showstart - show estimates of when job can/will start
    • mdiag - provide diagnostic report for various aspects of resources, workload, and scheduling
    • checkjob - provide detailed status report for specified job
    • showbf - show backfill window - show resources available for immediate use
    • showres - show existing reservations
    • showstate - show current state of resources
    • showstats - show usage statistics
  • Environment variables set by TORQUE

    Variables below are always set by TORQUE. To import other variables from your environment use -v option to qsub (or -V to import all variables).

    • PBS_JOBNAME - the value of the -N argument to qsub
    • PBS_O_HOME - the HOME environment variable of the submitter
    • PBS_O_SHELL - the SHELL environment variable of the submitter
    • PBS_O_QUEUE - the name of the queue to which you submitted your request
    • PBS_O_HOST - the name of the host from which you submitted your request
    • PBS_ENVIRONMENT - equal to PBS_INTERACTIVE or PBS_BATCH
    • PBS_O_LOGNAME - the LOGNAME environment variable of the submitter
    • PBS_O_PATH - the PATH environment variable of the submitter
    • PBS_O_WORKDIR - the path name of the directory from which you submitted your request
    • PBS_O_MAIL - the MAIL environment variable of the submitter
    • PBS_JOBID - the PBS job ID assigned to the job
    • PBS_TASKNUM - number of tasks requested
    • PBS_NODENUM - node offset number
    • PBS_NODEFILE - file containing line delimited list on nodes allocated to the job



Back to top