Cluster news,Nicolaus Copernicus Astronomical Center, Warsaw

22.03.2024

python3-matplotlib installed on the cluster

18.03.2024 - Post-upgrade highlights

ROCKY 9 OS
gcc 11 as default, gcc 12,13 in toolsets (see `scl -l`)
CUDA 12.4
new frontend and new nodes with faster cpus and larger memory
300 TB more on /work/chuck (+ 100 TB in the next months)
new limits on chuck frontend: 2x more memory, 4x more processes

In short: new libraries, compilers, packages. You will have to recompile everything... If you use python virtual environments, usually you should recreate them on the new system... The default compiler is gcc 11 and there was a lot of changes around versions 9,10. You may encounter many new warnings or even errors when compiling with gcc 11.

How to access chuck?

The new systems increase the security standards. Old ssh keys became deprecated. If you can't access chuck with old RSA keys please generate new ED25519 keys. It will not affect the old RSA keys. If you don't have ED25519 keys, then on any linux machine execute:

ssh-keygen -t ed25519    (accept the default paths and set non-empty passphrase)
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

The second change are the host keys. When logging to chuck you may be asked to remove an old, conflicting key from the known_hosts file. The error message will tell you how to do this - please follow instructions.

11-17.03.2024

New hardware: additional 350TB for /work/chuck, 4 new nodes with 40 cores each, new gpu node with 40 cores and 2 x Nvidia A100 accelerator

Pre upgrade

automatic installer prepared based on Rocky 9 OS
new chuck and backup servers configured
installer tests on new nodes

Monday

cluster shutdown, /work/chuck stopped
/work/chuck was checked for errors (only 6 lost files over 6 years is good, it can happen during system crash)
backups start

Tuesday

two backups of /work/chuck are ready
one of the backups is served read-only as /work/chuck_backup for the whole camk network
the actual upgrade starts with the most critical part: /work/chuck servers (which is like a separate cluster of disk arrays called a cluster filesystem); along the OS, the filesystem has to be upgraded
improved installer
the metadata server is upgraded (it took a few hours just to boot it from network)

Wedensday

good news! /work/chuck is upgraded and operational (I can see the files! but it will not be accessible to users until the whole cluster is ready)
time for cleaning after installations of the core servers, the next step is to install nodes

Thursday

started migration of data from one of the /work/chuck servers to other servers (for disk replacement)
final improvements and tests of the node installer
testing slurm, MPI, CUDA on a few nodes

Fiday

all the nodes are installed
old chuck was installed under different name and will serve /work/chuck; it will enable less strict rules for users on the new chuck frontend
verifying if everything works

Weekend

tests of everything :)
slurm tests with: mvapich2 , cuda 12