Storage - lustre filesystem

The main filesystem for the cluster users is mounted under /work/psk. It is a highly-scalable, cluster filesystem called Lustre. To use the Lustre efficiently users should understand it's architecture. The filesystem is served by the dedicated network of nodes (not accessible for users). The data are striped across Object Storage Targets which in practice are disk arrays connected to a set of Object Storage Servers. There can be many OST-s and the total filesystem bandwidth is rougly equal to the sum of their bandwiths (high-scalability). OST-s store the actual data but all the matadata like filenames and locations of particular stripes of data are stored on the Metadata Server (MDS). All those components are coordinated by the special subsystem using dedicated protocols over Infiniband network. Such filesystem can be mounted directly across multiple clients (in our case frontend and computing nodes). This architecture allows to build high-performance storage scalabe up to Petabytes with bandwiths of Terrabytes per second.
At CAMK we have a very basic Lustre configuration. There is one MDS and 6 OST scattered over 3 OSS-es. The total size of the filesystem is 140TB .
Important facts/features of Lustre at CAMK:

  • Lustre is mounted directly (over Infiniband) under /work /psk across the whole cluster; other computers access /work/psk over NFS which is much slower
  • throughput from the single thread to/from lustre depends  on node and is in range 500 - 700 MB/s
  • aggregated throughput from all nodes reaches 2500 MB/s
  • by default data are striped in 1 MB chunks across all OST's
  • Lustre is designed to store large binary files; performance can be much lower for small, text files (see notes below)


Lustre specific commands:

lfs utility provides some useful commands for users. To list them and their descriptions just tyle 'lfs' and then 'help'. Particulary useful commands:

  • getstripe/setstripe - allows to control how the files are stripped across OST's; this setting is possible per file or per directory
  • df - reports filesystem disk space usage of each MDS and all OSDs (similar to standard linux df)
  • quota - reports user and group quotas on lustre filesystem (e.g. 'quota -u pci /lustre')
  • find - this is replacement for the standard linux find (which is very slow on a lustre filesystem)


Using ACLs
:

/work/psk filesystem is enabled to use ACLs (Access Control Lists). In short it allows to have fine-grained permissions of files. This is standard POSIX implementation and works like on any other filesystem. Relevant command are getfacl and setfacl (please read man pages).

Usage examples:

  • setfacl -m user:pci:rwx test - sets full rwx permissions to file test for user pci
  • getfacl test - prints all acls


Performance tips:

  • if possible store data in big, binary files
  • when writing text files avoid flushing and use buffering:
    - in C fopen/fprintf do this automatically (do not use system's open/write)
    - in Fortran there is usually a compiler switch or variable to enable buffering; PGI compilers do this by default; Intel compilers need environment variable FORT_BUFFERED=1 (already set across the cluster)
  • if you can't control flushing (e.g. in Mathematica):
    - use different filesystem (e.g. /tmp for small files, remember to clean up before the job ends)
    - use wrapper library which intercepts all calls to the flush function and discards them; to use it add line
    export LD_PRELOAD=/opt/lib/libwrap_flush.so (in bash) or
    setenv LD_PRELOAD /opt/lib/libwrap_flush.so (in tcsh) ;
    the wrapper is rather a dirty hack and there is no guarantee it won't break some applications which depend on flushing; in case of problems/requests please contact its author - pci©camk.edu.pl .