21 Ch 4.1: Digital Research Alliance of Canada Considerations
There are a number of considerations to run AI algorithms using the DRAC resources. The DRAC wiki provides a comprehensive guide on how to use the resources. This section will provide a brief overview of some of the key considerations:
- Loading an interpreter
- Memory considerations
- Submit jobs
- Interactive jobs
- Checking the status of your job
- The output of the job
- GPU jobs
Loading an interpreter
Packages might require a specific python version. For example, at the point in time this material was written, Tensorflow(v2.16.1) and Pytorch (v2.6) require a python version between 3.9 and 3.12. The numpy(v2.2.2) package requires a python version >= 3.10. To check the modules available in the Digital Alliance of Canada infrastructure, you can use the following command:
#!/bin/bash
module avail python
Which will produce an output similar to:
--------------------------------------------------- Core Modules ----------------------------------------------------
ipython-kernel/3.10 ipython-kernel/3.13 python-build-bundle/2025a (D) python/3.12.4 ([1;34mt[0m)
ipython-kernel/3.11 (D) python-build-bundle/2023b python/3.10.13 ([1;34mt[0m,3.10) python/3.13.2 ([1;34mt[0m)
ipython-kernel/3.12 python-build-bundle/2024a python/3.11.5 ([1;34mt[0m,D:3.11)
Where:
[1;34mt[0m: Tools for development / Outils de développement
Aliases: Aliases exist: foo/1.2.3 (1.2) means that "module load foo/1.2" will load foo/1.2.3
D: Default Module
If the avail list is too long consider trying:
"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
To load a specific python version (e.g. 3.11.5), you can use the following command:
module load python/3.11.5
Once the python module is loaded, you can check the version of python using the following command:
python --version
Memory considerations
It is important to consider the location of the data you intend on using. There are shared filesysems in the infrastructure (home, scratch and project), and there is the local storage ($SLURM_TMPDIR
). As much as possible, use the local storage for the data you intend to use. When using a given dataset, download the dataset to the filesystem, and copy it to the local storage while running your algorithms. The infrastructure works faster transfering large files instead of multiple files over the network.
There are several storage types available in the Digital Research Alliance of Canada infrastructure:
- Home ($dollar;HOME): Persistent storage for personal files and software. Limited in size and not intended for large datasets.
- Scratch ($SCRATCH): Temporary storage for active computations. Data is not backed up and may be purged after a certain period.
- Project: Shared storage for project-related data. Suitable for collaborative work and larger datasets.
- Local Storage ($SLURM_TMPDIR): Node-local storage for job-specific data. Fast access but data is deleted after the job completes.
It is possible to create a disk usage report with the command diskusage_report
, which will produce an output similar to the following:
Description Space # of files
/home (user someuser) 6752M/53G 96k/500k
/scratch (user someuser) 0/1099G 2/1000k
/project (group def-someuser) 522M/1000G 17/500k
--
On some clusters, a break down per user may be available by adding the option '--per_user'.
You can easily access these directories using the cd
command. For example:
cd $dollar;HOME
Notice that the $SLURM_TMPDIR
is accessible from the sbatch script when a job is running. If you try to access it outside of a job, the $SLURM_TMPDIR
will be null.
Submitting jobs
Once you have your script ready to be executed, you need to submit a job using sbatch. A minimal sbatch script will contain the amount of time(--time
) required to run the script, the account name (--account
) and the script to be executed. Additional options can be found in this page. The sample below shows a simple script (sbatch_sample.sh
) that will simply print a message and sleep for 30 seconds.
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
sleep 30
You can also launch your job using command line arguments, such as:
sbatch --time=00:30:00 \
--account=def-someuser \
sbatch_sample.sh
Interactive jobs
Sometimes it is useful to run interactive jobs. To do so, it is possible to request the resources using the salloc command.
salloc --time=01:00:00 --account=def-someuser
Once the resources are allocated, you can run your script interactively. For example, you can run a python script using the following command (considering you have the python script file <my_script.py>
):
python my_script.py
At this stage, it is possible to install missing packages with the pip install --no-index <package>
command. You can also run python directly in the terminal with the command python
. Once you are done with the interactive session, you can exit using the exit
command.
Checking the status of your jobs
Once the jobs are launched, it is possible to verify the status of your jobs using the squeue
or sq
command. The squeue
command will list all jobs in the system, whereas the sq
command will list only your jobs.
The sample output below was created with the sq
command, and it shows two jobs in the system, one running and one pending.
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS GRES MIN_MEM NODELIST (REASON)
123456 smithj def-smithj simple_j R 0:03 1 1 (null) 4G cdr234 (None)
123457 smithj def-smithj bigger_j PD 2-00:00:00 1 16 (null) 16G (Priority)
The output of the job
The output of the job will be saved in a file called slurm-<job_id>.out
, where job_id
is the id of the job. The output file will contain the output of the script, as well as any error messages. The output file will be saved in the directory where the sbatch script was executed. You can get more information about the output of the job in this link.
You can change the output of the job with the option --output
. To keep information about the job, you can look the filename pattern in this link. For example, to add the job name, you can use the following command:
sbatch --output="slurm-%j-%x.out" <bash-script.sh>
GPU jobs
To run a job on a GPU, you need to request the GPU resources using the --gpus-per-node
option. The sample command below shows how to request 1 GPU for a job:
sbatch --gpus-per-node=1 ...
You can also request a specific GPU type using the --gpus
option. For example, to request 1 NVIDIA V100 GPU, you can use the following command:
sbatch --gpus-per-node=v100:1
For more information about how to request GPU resources, you can check this link. You will be able to find a list of resources in different clusters, and the available GPU types.