Note

Running Cthulhu on a Cluster

For real world applications, one typically needs a cross section computed across a grid of temperatures and pressures. Such computations can be much more efficiently handled on a computing cluster, where each (P,T) pair can be assigned to a distributed ‘job’.

This tutorial describes how to run Cthulhu on a cluster, focusing on clusters managed by the common Slurm scheduling system.

Note:

Every computing cluster is special and has its own unique architecture. We strongly recommend reading the documentation for your local cluster before proceeding and adapting the code below accordingly.

Running on a cluster involves two separate files:

A Python file calling Cthulhu (similar to those you’ve seen in the previous tutorials).
An batch script or sbatch file to submit specific combinations of pressure-temperature points to different cores.

Option 1: Python script in cluster mode (e.g. many atoms)

Let’s first create a python file to calculate cross sections for \(\mathrm{Fe}\), \(\mathrm{Ti}\), \(\mathrm{Mg}\), and \(\mathrm{Fe^{+}}\) all at once. When a user places Cthulhu in cluster mode (via cluster_run = True) the code will use a single core for each pressure and temperature pair. This is ideal for small line lists that are quick to compute, such as for atoms.

In the example here, a core will first compute the \(\mathrm{Fe}\) cross section at a given (P, T), then continue to compute \(\mathrm{Ti}\) for the same (P, T) pair and so on. So each core will calculate 4 cross sections at a single (P, T) point and we will use a Slurm shell script to request enough cores to cover all the (P, T) points.

Copy the code below into a .py file… how about many_atoms_on_my_powerful_cluster.py

[3]:

%%script echo skipping     # <---- REMOVE THIS LINE (it is just for the documentation to skip running this cell)

#***** Example script to batch-run Cthulhu on a cluster *****#

from Cthulhu.core import compute_cross_section

species_neutral = ['Fe', 'Ti', 'Mg']   # Fe, Ti, and Mg (neutral atoms)
species_ions = ['Fe']                  # Fe + (the ionization state is entered below)

database = 'VALD'

input_directory = '/PATH_TO_YOUR_LINE_LISTS/input/'      # Change this to point to your line list input folder

P = [1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0e0, 1.0e1, 1.0e2]    # Pressure (bar)
T = [100.0, 200.0, 300.0, 400.0, 500.0, 600.0, 700.0, 800.0,                 # Temperature (K)
     900.0, 1000.0, 1200.0, 1400.0, 1600.0, 1800.0, 2000.0,
     2500.0, 3000.0, 3500.0]

# Create cross sections for the atoms
for i in range(len(species_neutral)):
     compute_cross_section(database = database, species = species_neutral[i],
                           pressure = P, temperature = T, S_cut = 1.0e-100,
                           input_dir = input_directory, ionization_state = 1,
                           nu_out_min = 200, nu_out_max = 40000, dnu_out = 0.01,
                           verbose = False, N_cores = 1, cluster_run = True)        # The last argument must be True for a cluster run!

# Create cross sections for the ions
for i in range(len(species_ions)):
     compute_cross_section(database = database, species = species_ions[i],
                           pressure = P, temperature = T, S_cut = 1.0e-100,
                           input_dir = input_directory, ionization_state = 2,
                           nu_out_min = 200, nu_out_max = 40000, dnu_out = 0.01,
                           verbose = False, N_cores = 1, cluster_run = True)        # The last argument must be True for a cluster run!

skipping # <---- This is just for the documentation to skip this cell, please remove before running

Now we need to create a shell script to assign a core to each (P, T) pair.

From looking at the Python script above, we have 9 pressures and 18 temperatures, for a total of 162 (P, T) pairs. So we will create a shell script to submit 162 jobs, one for each (P, T) point, with a single core being assigned to each job. Each job will run Cthulhu from the terminal automatically via the following commands:

python -u many_atoms_on_my_powerful_cluster.py 0
python -u many_atoms_on_my_powerful_cluster.py 1
python -u many_atoms_on_my_powerful_cluster.py 2
.
.
.
python -u many_atoms_on_my_powerful_cluster.py 161

Where ‘0’ here denotes the first (P, T) pair (P[0], T[0]) and ‘161’ denotes the final (P, T) pair (P[8], T[17]).

To accomplish this, copy the code below into a shell script (.sh). We’ll call it my_ultimate_shell_script.sh.

You should also make a folder called logs in the same folder to store the terminal output from each job.

[4]:

%%script echo skipping     # <---- REMOVE THIS LINE (it is just for the documentation to skip running this cell)

# Job name for the group
JOB_NAME="atoms"

for i in {0..161}; do                     # Loops over the 162 jobs
    srun \
        --account=YOUR_USER_ACCOUNT \             # Your user account on your cluster (or which account will be charged)
        --partition=standard \                    # Your cluster may have a different name for the default partition
        --nodes=1 \
        --cpus-per-task=1 \                       # We only need one core per job, since this is a cluster run
        --tasks-per-node=1 \
        --mem=100G \                              # Reserving 100 GB of RAM (less is probably fine)
        --time=0-01:00:00 \                       # Max runtime of 1 hour (atoms will take seconds)
        --output=./logs/${JOB_NAME}.$i.out \      # In a directory called 'logs' (make one!), write the terminal output
        --error=./logs/${JOB_NAME}.$i.err \       # Write any error messages into a seperate file
        --job-name=$JOB_NAME \
        python -u /PATH/TO/YOUR/CODE/many_atoms_on_my_powerful_clutster.py $i &     # Path to the Python file above
done

skipping # <---- This is just for the documentation to skip this cell, please remove before running

You run this shell script simply by:

./my_ultimate_shell_script.sh

Congratulations, you have just unlocked the power of calculating cross sections in parallel on 162 cores! 🎉

Option 2: Multiple nodes in parallel (e.g. molecules with large line lists)

For very large line molecular line lists, to minimise runtime you’ll usually want to throw many cores at each (P, T) point while also computing several (P, T) points in parallel.

Here we’ll look at an example of how to efficiently calculate a cross section for the ExoMol \(\mathrm{SO_2}\) line list (1.3 billion transitions). We’ll break the (P, T) range into 6 segments and throw 36 cores at each segment.

Copy the code below into a .py file, e.g. SO2_cluster.py

[ ]:

import numpy as np
from Cthulhu.core import compute_cross_section

species = 'SO2'

database = 'ExoMol'
linelist = 'ExoAmes'

broadening = 'H2-He'

input_directory = '/PATH_TO_YOUR_LINE_LISTS/input/'      # Change this to point to your line list input folder

#***** Each node will work on all the pressures for a subset of temperatures *****#

P = np.array([1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0e0, 1.0e1, 1.0e2])

#***** Selectively uncomment the temperature range you want this job to work on, then submit the job (see below) *****#

T = np.array([100.0, 200.0, 300.0, 400.0])           # Only have one of these lines uncommented at a time
#T = np.array([500.0, 600.0, 700.0, 800.0])
#T = np.array([900.0, 1000.0, 1200.0])
#T = np.array([1400.0, 1600.0, 1800.0])
#T = np.array([2000.0, 2500.0])
#T = np.array([3000.0, 3500.0])

# Create cross section
compute_cross_section(database = database, species = species, linelist = linelist,
                      pressure = P, temperature = T, S_cut = 0.0,
                      input_dir = input_directory, broad_type = broadening,
                      nu_out_min = 200, nu_out_max = 25000, dnu_out = 0.01,
                      verbose = True, N_cores = 36)   # 36 cores in parallel

Now let’s create the Slurm batch script to submit this job. Copy the below into a file called SO2.sbatch.

[ ]:

%%script echo skipping     # <---- REMOVE THIS LINE (it is just for the documentation to skip running this cell)

#SBATCH --job-name=SO2
#SBATCH --account=YOUR_USER_ACCOUNT
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --cpus-per-task=36
#SBATCH --tasks-per-node=1
#SBATCH --mem=100G
#SBATCH --time=1-00:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=YOUREMAIL@YOURUNIVERSITY.edu
#SBATCH -o ./logs/SO2.%j.out
#SBATCH -e ./logs/SO2.%j.err
#SBATCH --get-user-env

declare -xr WDIR="/home/YOUR_PATH_HERE/Cthulhu/REST_OF_YOUR_PATH/"

declare PATH=${PATH}:${WDIR}

time_start=`date '+%T%t%d_%h_06'`

echo ------------------------------------------------------
echo SBATCH: job name is $SLURM_JOB_NAME
echo SBATCH: job identifier is $SLURM_JOBID
echo SBATCH: sbatch is running on $SLURM_SUBMIT_HOST
echo SBATCH: executing queue is $SLURM_JOB_PARTITION
echo SBATCH: working directory is $SLURM_SUBMIT_DIR
echo SBATCH: node file is $SLURM_JOB_NODELIST
echo ------------------------------------------------------

cd ${WDIR}

python -u SO2_cluster.py

time_end=`date '+%T%t%d_%h_06'`
echo Started at: $time_start
echo Ended at: $time_end
echo ------------------------------------------------------
echo Job ends

On your Slurm-managed cluster, you can submit this job in a terminal via the command

sbatch SO2.sbatch

Check that the job has started running with

sq

Once the first job is running, you can see the terminal output in ./logs/SO2.JOB_ID.out (where JOB_ID will be a number assigned by your cluster). If you need to cancel the job

scancel JOB_ID (replace with the specific number you see from sq).

Then you can edit the Python file to specify the next range of pressures and temperatures you want to compute, save the .py file, submit the next job, and so on.

Enjoy the power of Cthulhu! 🐙