22.03.2024

  • python3-matplotlib installed on the cluster

 

18.03.2024 - Post-upgrade highlights

 

  • ROCKY 9 OS
  • gcc 11 as default, gcc 12,13 in toolsets (see `scl -l`)
  • CUDA 12.4
  • new frontend and new nodes with faster cpus and larger memory
  • 300 TB more on /work/chuck (+ 100 TB in the next months)
  • new limits on chuck frontend: 2x more memory, 4x more processes

 

In short: new libraries, compilers, packages. You will have to recompile everything... If you use python virtual environments, usually you should recreate them on the new system... The default compiler is gcc 11 and there was a lot of changes around versions 9,10. You may encounter many new warnings or even errors when compiling with gcc 11.

 

How to access chuck?

The new systems increase the security standards. Old ssh keys became deprecated. If you can't access chuck with old RSA keys please generate new ED25519 keys. It will not affect the old RSA keys. If you don't have ED25519 keys, then on any linux machine execute:

ssh-keygen -t ed25519    (accept the default paths and set non-empty passphrase)
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

The second change are the host keys. When logging to chuck you may be asked to remove an old, conflicting key from the known_hosts file. The error message will tell you how to do this - please follow instructions.

 

11-17.03.2024

 

New hardware: additional 350TB for /work/chuck, 4 new nodes with 40 cores each, new gpu node with 40 cores and 2 x Nvidia A100 accelerator

Pre upgrade

  • automatic installer prepared based on Rocky 9 OS
  • new chuck and backup servers configured
  • installer tests on new nodes

 

Monday

  • cluster shutdown, /work/chuck stopped
  • /work/chuck was checked for errors (only 6 lost files over 6 years is good, it can happen during system crash)
  • backups start

 

Tuesday

  • two backups of /work/chuck are ready
  • one of the backups is served read-only as /work/chuck_backup for the whole camk network
  • the actual upgrade starts with the most critical part: /work/chuck servers (which is like a separate cluster of disk arrays called a cluster filesystem); along the OS, the filesystem has to be upgraded
  • improved installer
  • the metadata server is upgraded (it took a few hours just to boot it from network)

 

Wedensday

  • good news! /work/chuck is upgraded and operational (I can see the files! but it will not be accessible to users until the whole cluster is ready)
  • time for cleaning after installations of the core servers, the next step is to install nodes

 

Thursday

  • started migration of data from one of the /work/chuck servers to other servers (for disk replacement)
  • final improvements and tests of the node installer
  • testing slurm, MPI, CUDA on a few nodes

 

Fiday

  • all the nodes are installed
  • old chuck was installed under different name and will serve /work/chuck; it will enable less strict rules for users on the new chuck frontend
  • verifying if everything works

 

Weekend

  • tests of everything :)
  • slurm tests with: mvapich2 , cuda 12