Ch 4.1: Digital Research Alliance of Canada Considerations

Jazmin Romero; Roger Selzler; Nicholi Shiell; Ryan Taylor; Andrew Schoenrock

22 Ch 4.1: Digital Research Alliance of Canada Considerations

There are a number of considerations to run AI algorithms using the DRAC resources. The DRAC wiki provides a comprehensive guide on how to use the resources. This section will provide a brief overview of some of the key considerations:

Loading an interpreter
Storage considerations
Submit jobs
Interactive jobs
Checking the status of your job
The output of the job
GPU jobs

Loading an interpreter

Packages might require a specific python version. For example, at the point in time this material was written, Tensorflow(v2.16.1) and Pytorch (v2.6) require a python version between 3.9 and 3.12. The numpy(v2.2.2) package requires a python version >= 3.10. To check the modules available in the Digital Alliance of Canada infrastructure, you can use the following command:

#!/bin/bash
module avail python

Which will produce an output similar to:


--------------------------------------------------- Core Modules ----------------------------------------------------
   ipython-kernel/3.10        ipython-kernel/3.13          python-build-bundle/2025a (D)           python/3.12.4 ([1;34mt[0m)
   ipython-kernel/3.11 (D)    python-build-bundle/2023b    python/3.10.13            ([1;34mt[0m,3.10)      python/3.13.2 ([1;34mt[0m)
   ipython-kernel/3.12        python-build-bundle/2024a    python/3.11.5             ([1;34mt[0m,D:3.11)

  Where:
   [1;34mt[0m:        Tools for development / Outils de développement
   Aliases:  Aliases exist: foo/1.2.3 (1.2) means that "module load foo/1.2" will load foo/1.2.3
   D:        Default Module

If the avail list is too long consider trying:

“module –default avail” or “ml -d av” to just list the default modules.
“module overview” or “ml ov” to display the number of modules for each name.

Use “module spider” to find all possible modules and extensions.
Use “module keyword key1 key2 …” to search for all possible modules matching any of the “keys”.

To load a specific python version (e.g. 3.11.5), you can use the following command:

module load python/3.11.5

Once the python module is loaded, you can check the version of python using the following command:

Storage considerations

It is important to consider the location of the data you intend on using. There are shared filesysems in the infrastructure (home, scratch and project), and there is the local storage ($SLURM_TMPDIR). Use the local storage for the data you intend to use as much as possible. When using a given dataset, download the dataset to the filesystem, and copy it to the local storage while running your algorithms. The infrastructure works faster transferring large files instead of multiple smaller files over the network.

There are several storage types available in the Digital Research Alliance of Canada infrastructure:

Home ($HOME): Persistent storage for personal files and software. Limited in size and not intended for large datasets.
Scratch ($SCRATCH): Temporary storage for active computations. Data is not backed up and may be purged after a certain period.
Project: Shared storage for project-related data. Suitable for collaborative work and larger datasets.
Local Storage ($SLURM_TMPDIR): Node-local storage for job-specific data. Fast access but data is deleted after the job completes.

It is possible to create a disk usage report with the command diskusage_report, which will produce an output similar to the following:

                             Description                Space           # of files
                   /home (user someuser)            6752M/53G             96k/500k
                /scratch (user someuser)              0/1099G              2/1000k
           /project (group def-someuser)           522M/1000G              17/500k
--
On some clusters, a break down per user may be available by adding the option '--per_user'.

You can easily access these directories using the cd command. For example:

cd $HOME

Notice that the $SLURM_TMPDIR is accessible from the sbatch script when a job is running. If you try to access it outside of a job, the $SLURM_TMPDIR will be null.

Submitting jobs

Once you have your script ready to be executed, you need to submit a job using sbatch. A minimal sbatch script will contain the amount of time(--time) required to run the script, the account name (--account) and the code to be executed. Additional options can be found in this page. The sample below shows a simple script (sbatch_sample.sh) that will simply print a message and sleep for 30 seconds.

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
sleep 30

You can also launch your job using command line arguments, such as:

sbatch --time=00:30:00 \
--account=def-someuser \
sbatch_sample.sh

Interactive jobs

Sometimes it is useful to run interactive jobs. To do so, it is possible to request the resources using the salloc command.

salloc --time=01:00:00 --account=def-someuser

Once the resources are allocated, you can run your script interactively. For example, you can run a python script using the following command (assuming you have the python script file <my_script.py>):

python my_script.py

At this stage, it is possible to install missing packages with the pip install --no-index <package> command. You can also run python directly in the terminal with the command python. Once you are done with the interactive session, you can quit using the exit command.

Checking the status of your jobs

Once the jobs are launched, it is possible to verify the status of your jobs using the squeue or sq command. The squeue command will list all jobs in the system, whereas the sq command will list only your jobs.

The sample output below was created with the sq command, and it shows two jobs in the system, one running and one pending.

JOBID     USER      ACCOUNT      NAME  ST   TIME_LEFT NODES CPUS    GRES MIN_MEM NODELIST (REASON)
123456   smithj   def-smithj  simple_j   R        0:03     1    1  (null)      4G cdr234  (None)
123457   smithj   def-smithj  bigger_j  PD  2-00:00:00     1   16  (null)     16G (Priority)

The output of the job

The output of the job will be saved in a file called slurm-<job_id>.out, where job_id is the id of the job. The output file will contain the output of the script, as well as any error messages. The output file will be saved in the directory where the sbatch script was executed. You can get more information about the output of the job in this link.

You can change the output of the job with the option --output. To keep information about the job in the output file filename, you can look at the filename patterns in this link. For example, to add the job name, you can use the following command:

sbatch --output="slurm-%j-%x.out" <bash-script.sh>

GPU jobs

To run a job on a GPU, you need to request the GPU resources using the --gpus-per-node option. The sample command below shows how to request 1 GPU for a job:

sbatch --gpus-per-node=1 ...

You can also request a specific GPU type using the --gpus option. For example, to request 1 NVIDIA V100 GPU, you can use the following command:

sbatch --gpus-per-node=v100:1

For more information about how to request GPU resources, you can check this link. You will be able to find a list of resources in different clusters, and the available GPU types.

License

Icon for the Creative Commons Attribution 4.0 International License

Introduction to Advanced Research Computing using Digital Research Alliance of Canada Resources Copyright © by Jazmin Romero; Roger Selzler; Nicholi Shiell; Ryan Taylor; and Andrew Schoenrock is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.