Ch. 4.2: Handwritten Digit Recognition Using the MNIST Dataset

Jazmin Romero; Roger Selzler; Nicholi Shiell; Ryan Taylor; Andrew Schoenrock

22 Ch. 4.2: Handwritten Digit Recognition Using the MNIST Dataset

Handwritten Digit recognition using the MNIST dataset

The MNIST dataset is a large database of handwritten digits that is commonly used for training various image processing systems. It was created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges and has become one of the most popular datasets in the field of machine learning and computer vision. The dataset contains 60,000 training images and 10,000 testing images, each of which is a 28×28 grayscale image of a single handwritten digit (0-9). The images are normalized and centered, making it easier for machine learning algorithms to process and learn from them. Using the MNIST dataset, we can train models to recognize handwritten digits with high accuracy. This is a fundamental task in the field of computer vision and serves as a benchmark for evaluating the performance of various algorithms and models.

In this section, we will explore how to build a simple handwritten digit recognition system using the MNIST dataset and the PyTorch library. We will train a convolutional neural network (CNN) on the MNIST dataset and evaluate its performance on a test set of images. We will also discuss some considerations for running the code on the DRAC resources. We will explore three different ways of running the MNIST handwritten digit recognition code:

Running the code locally on your machine
Running the code on DRAC resources
- Interactive session
- Job submission/scheduling

Running Locally

In this section, we will demonstrate how to run a handwritten digit recognition system using the MNIST dataset on your local machine. We will guide you through setting up a virtual environment, installing the necessary dependencies, and running the code. This will help you understand the process of building and testing machine learning models locally before deploying them on more powerful resources like the DRAC infrastructure.

Creating the (local) virtual environment

To create a virtual environment, you can use the following script:

#!/bin/bash
# This script is intended to be run locally to set up the environment
# for the handwritten digit recognition project. It installs all the
# necessary dependencies and tools required for the project.

# Check if the .venv directory exists, if not, create a virtual environment using Python 3.11
[ -d .venv ] || virtualenv --python=python3.11 .venv

# Activate the virtual environment
source .venv/bin/activate   

# Install required packages using pip
pip install torch torchvision

The final MNIST training script

The original code used in this section can be found in the pytorch github. There are a few modifications to the original code. The most up-to-date code for the MNIST handwritten digit recognition system can be downloaded from this link. The code at the time of writing can be seen below:

import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
import os

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--no-mps', action='store_true', default=False,
                        help='disables macOS GPU training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()
    use_mps = not args.no_mps and torch.backends.mps.is_available()

    torch.manual_seed(args.seed)

    if use_cuda:
        device = torch.device("cuda")
    elif use_mps:
        device = torch.device("mps")
    else:
        device = torch.device("cpu")

    train_kwargs = {'batch_size': args.batch_size}
    test_kwargs = {'batch_size': args.test_batch_size}
    if use_cuda:
        cuda_kwargs = {'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    datapath = os.path.join(
        os.getenv('SLURM_TMPDIR', os.getenv('HOME')), 
        'data'
    )
    dataset1 = datasets.MNIST(datapath, train=True,
                       transform=transform)
    dataset2 = datasets.MNIST(datapath, train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)
        scheduler.step()

    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

Using the script above, you can create a file called mnist.py to then run using the instructions in Running the script locally.

Running the script locally

To run the script locally, you can use the following command:

#!/bin/bash
python mnist.py

The script has several options that you can use to customize its behavior:

–epochs: Number of epochs to train the model. Default is 10.
–batch-size: Batch size for training. Default is 64.
–lr: Learning rate for the optimizer. Default is 0.01.
–save-model: Save the trained model to a file. Default is False.

For example, to run the script with 20 epochs and a batch size of 128, you can use the following command:

#!/bin/bash
python mnist.py --epochs 20 --batch-size 128

If you find problems downloading the dataset, you can manually download it from a terminal:

#!/bin/bash
wget www.di.ens.fr/~lelarge/MNIST.tar.gz -P $HOME/data
tar -zxvf $HOME/data/MNIST.tar.gz -C $HOME/data

If you find problems with the version of numpy, you can install the version required by the script:

#!/bin/bash
pip uninstall numpy
pip install numpy==1.26.4

Questions

These questions were formulated to explore the python code and the options available in the script to run it locally. The sample output provided below gives an idea about the output of the script after running the mnist.py script as: python mnist.py --epochs 10 --batch-size 128 --lr 1.0

mnist.py output

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.302122
Train Epoch: 1 [1280/60000 (2%)]	Loss: 2.306612
Train Epoch: 1 [2560/60000 (4%)]	Loss: 2.296520
Train Epoch: 1 [3840/60000 (6%)]	Loss: 2.299252
Train Epoch: 1 [5120/60000 (9%)]	Loss: 2.297876
Train Epoch: 1 [6400/60000 (11%)]	Loss: 2.302602
Train Epoch: 1 [7680/60000 (13%)]	Loss: 2.297840
Train Epoch: 1 [8960/60000 (15%)]	Loss: 2.299316
Train Epoch: 1 [10240/60000 (17%)]	Loss: 2.297020
Train Epoch: 1 [11520/60000 (19%)]	Loss: 2.303276
Train Epoch: 1 [12800/60000 (21%)]	Loss: 2.302283
Train Epoch: 1 [14080/60000 (23%)]	Loss: 2.308701
Train Epoch: 1 [15360/60000 (26%)]	Loss: 2.302152
Train Epoch: 1 [16640/60000 (28%)]	Loss: 2.295946
Train Epoch: 1 [17920/60000 (30%)]	Loss: 2.296917
Train Epoch: 1 [19200/60000 (32%)]	Loss: 2.299727
Train Epoch: 1 [20480/60000 (34%)]	Loss: 2.309196
Train Epoch: 1 [21760/60000 (36%)]	Loss: 2.305005
Train Epoch: 1 [23040/60000 (38%)]	Loss: 2.303269
Train Epoch: 1 [24320/60000 (41%)]	Loss: 2.300710
Train Epoch: 1 [25600/60000 (43%)]	Loss: 2.303122
Train Epoch: 1 [26880/60000 (45%)]	Loss: 2.295654
Train Epoch: 1 [28160/60000 (47%)]	Loss: 2.300717
Train Epoch: 1 [29440/60000 (49%)]	Loss: 2.303221
Train Epoch: 1 [30720/60000 (51%)]	Loss: 2.302185
Train Epoch: 1 [32000/60000 (53%)]	Loss: 2.306206
Train Epoch: 1 [33280/60000 (55%)]	Loss: 2.294770
Train Epoch: 1 [34560/60000 (58%)]	Loss: 2.293919
Train Epoch: 1 [35840/60000 (60%)]	Loss: 2.302590
Train Epoch: 1 [37120/60000 (62%)]	Loss: 2.296479
Train Epoch: 1 [38400/60000 (64%)]	Loss: 2.293786
Train Epoch: 1 [39680/60000 (66%)]	Loss: 2.302353
Train Epoch: 1 [40960/60000 (68%)]	Loss: 2.304022
Train Epoch: 1 [42240/60000 (70%)]	Loss: 2.301171
Train Epoch: 1 [43520/60000 (72%)]	Loss: 2.306725
Train Epoch: 1 [44800/60000 (75%)]	Loss: 2.306609
Train Epoch: 1 [46080/60000 (77%)]	Loss: 2.293973
Train Epoch: 1 [47360/60000 (79%)]	Loss: 2.299634
Train Epoch: 1 [48640/60000 (81%)]	Loss: 2.301875
Train Epoch: 1 [49920/60000 (83%)]	Loss: 2.306968
Train Epoch: 1 [51200/60000 (85%)]	Loss: 2.305866
Train Epoch: 1 [52480/60000 (87%)]	Loss: 2.300992
Train Epoch: 1 [53760/60000 (90%)]	Loss: 2.300236
Train Epoch: 1 [55040/60000 (92%)]	Loss: 2.301974
Train Epoch: 1 [56320/60000 (94%)]	Loss: 2.301962
Train Epoch: 1 [57600/60000 (96%)]	Loss: 2.295600
Train Epoch: 1 [58880/60000 (98%)]	Loss: 2.297477

Test set: Average loss: 2.3013, Accuracy: 1134/10000 (11%)

Train Epoch: 2 [0/60000 (0%)]	Loss: 2.294314
Train Epoch: 2 [1280/60000 (2%)]	Loss: 2.304370
Train Epoch: 2 [2560/60000 (4%)]	Loss: 2.300747
Train Epoch: 2 [3840/60000 (6%)]	Loss: 2.296614
Train Epoch: 2 [5120/60000 (9%)]	Loss: 2.300035
Train Epoch: 2 [6400/60000 (11%)]	Loss: 2.304614
Train Epoch: 2 [7680/60000 (13%)]	Loss: 2.296954
Train Epoch: 2 [8960/60000 (15%)]	Loss: 2.298748
Train Epoch: 2 [10240/60000 (17%)]	Loss: 2.297321
Train Epoch: 2 [11520/60000 (19%)]	Loss: 2.301153
Train Epoch: 2 [12800/60000 (21%)]	Loss: 2.303508
Train Epoch: 2 [14080/60000 (23%)]	Loss: 2.309472
Train Epoch: 2 [15360/60000 (26%)]	Loss: 2.302835
Train Epoch: 2 [16640/60000 (28%)]	Loss: 2.294156
Train Epoch: 2 [17920/60000 (30%)]	Loss: 2.297649
Train Epoch: 2 [19200/60000 (32%)]	Loss: 2.298760
Train Epoch: 2 [20480/60000 (34%)]	Loss: 2.307563
Train Epoch: 2 [21760/60000 (36%)]	Loss: 2.307542
Train Epoch: 2 [23040/60000 (38%)]	Loss: 2.301838
Train Epoch: 2 [24320/60000 (41%)]	Loss: 2.299103
Train Epoch: 2 [25600/60000 (43%)]	Loss: 2.303837
Train Epoch: 2 [26880/60000 (45%)]	Loss: 2.295858
Train Epoch: 2 [28160/60000 (47%)]	Loss: 2.301188
Train Epoch: 2 [29440/60000 (49%)]	Loss: 2.300500
Train Epoch: 2 [30720/60000 (51%)]	Loss: 2.302749
Train Epoch: 2 [32000/60000 (53%)]	Loss: 2.306937
Train Epoch: 2 [33280/60000 (55%)]	Loss: 2.295369
Train Epoch: 2 [34560/60000 (58%)]	Loss: 2.293380
Train Epoch: 2 [35840/60000 (60%)]	Loss: 2.302872
Train Epoch: 2 [37120/60000 (62%)]	Loss: 2.296003
Train Epoch: 2 [38400/60000 (64%)]	Loss: 2.293757
Train Epoch: 2 [39680/60000 (66%)]	Loss: 2.302584
Train Epoch: 2 [40960/60000 (68%)]	Loss: 2.304035
Train Epoch: 2 [42240/60000 (70%)]	Loss: 2.302217
Train Epoch: 2 [43520/60000 (72%)]	Loss: 2.306696
Train Epoch: 2 [44800/60000 (75%)]	Loss: 2.305884
Train Epoch: 2 [46080/60000 (77%)]	Loss: 2.293440
Train Epoch: 2 [47360/60000 (79%)]	Loss: 2.299376
Train Epoch: 2 [48640/60000 (81%)]	Loss: 2.302816
Train Epoch: 2 [49920/60000 (83%)]	Loss: 2.307003
Train Epoch: 2 [51200/60000 (85%)]	Loss: 2.305448
Train Epoch: 2 [52480/60000 (87%)]	Loss: 2.300000
Train Epoch: 2 [53760/60000 (90%)]	Loss: 2.300042
Train Epoch: 2 [55040/60000 (92%)]	Loss: 2.302535
Train Epoch: 2 [56320/60000 (94%)]	Loss: 2.302467
Train Epoch: 2 [57600/60000 (96%)]	Loss: 2.295814
Train Epoch: 2 [58880/60000 (98%)]	Loss: 2.297883

Test set: Average loss: 2.3012, Accuracy: 1135/10000 (11%)

Train Epoch: 3 [0/60000 (0%)]	Loss: 2.294128
Train Epoch: 3 [1280/60000 (2%)]	Loss: 2.304410
Train Epoch: 3 [2560/60000 (4%)]	Loss: 2.300996
Train Epoch: 3 [3840/60000 (6%)]	Loss: 2.296478
Train Epoch: 3 [5120/60000 (9%)]	Loss: 2.298928
Train Epoch: 3 [6400/60000 (11%)]	Loss: 2.304500
Train Epoch: 3 [7680/60000 (13%)]	Loss: 2.296688
Train Epoch: 3 [8960/60000 (15%)]	Loss: 2.298723
Train Epoch: 3 [10240/60000 (17%)]	Loss: 2.297010
Train Epoch: 3 [11520/60000 (19%)]	Loss: 2.300348
Train Epoch: 3 [12800/60000 (21%)]	Loss: 2.303568
Train Epoch: 3 [14080/60000 (23%)]	Loss: 2.310058
Train Epoch: 3 [15360/60000 (26%)]	Loss: 2.303400
Train Epoch: 3 [16640/60000 (28%)]	Loss: 2.294910
Train Epoch: 3 [17920/60000 (30%)]	Loss: 2.297825
Train Epoch: 3 [19200/60000 (32%)]	Loss: 2.299802
Train Epoch: 3 [20480/60000 (34%)]	Loss: 2.307095
Train Epoch: 3 [21760/60000 (36%)]	Loss: 2.307942
Train Epoch: 3 [23040/60000 (38%)]	Loss: 2.300907
Train Epoch: 3 [24320/60000 (41%)]	Loss: 2.298472
Train Epoch: 3 [25600/60000 (43%)]	Loss: 2.305239
Train Epoch: 3 [26880/60000 (45%)]	Loss: 2.296292
Train Epoch: 3 [28160/60000 (47%)]	Loss: 2.302102
Train Epoch: 3 [29440/60000 (49%)]	Loss: 2.300763
Train Epoch: 3 [30720/60000 (51%)]	Loss: 2.302742
Train Epoch: 3 [32000/60000 (53%)]	Loss: 2.307274
Train Epoch: 3 [33280/60000 (55%)]	Loss: 2.294868
Train Epoch: 3 [34560/60000 (58%)]	Loss: 2.292994
Train Epoch: 3 [35840/60000 (60%)]	Loss: 2.302315
Train Epoch: 3 [37120/60000 (62%)]	Loss: 2.296483
Train Epoch: 3 [38400/60000 (64%)]	Loss: 2.293768
Train Epoch: 3 [39680/60000 (66%)]	Loss: 2.301819
Train Epoch: 3 [40960/60000 (68%)]	Loss: 2.305091
Train Epoch: 3 [42240/60000 (70%)]	Loss: 2.301944
Train Epoch: 3 [43520/60000 (72%)]	Loss: 2.306972
Train Epoch: 3 [44800/60000 (75%)]	Loss: 2.306590
Train Epoch: 3 [46080/60000 (77%)]	Loss: 2.294029
Train Epoch: 3 [47360/60000 (79%)]	Loss: 2.299841
Train Epoch: 3 [48640/60000 (81%)]	Loss: 2.303495
Train Epoch: 3 [49920/60000 (83%)]	Loss: 2.307017
Train Epoch: 3 [51200/60000 (85%)]	Loss: 2.305472
Train Epoch: 3 [52480/60000 (87%)]	Loss: 2.300445
Train Epoch: 3 [53760/60000 (90%)]	Loss: 2.299820
Train Epoch: 3 [55040/60000 (92%)]	Loss: 2.302832
Train Epoch: 3 [56320/60000 (94%)]	Loss: 2.301771
Train Epoch: 3 [57600/60000 (96%)]	Loss: 2.295650
Train Epoch: 3 [58880/60000 (98%)]	Loss: 2.297921

Test set: Average loss: 2.3011, Accuracy: 1135/10000 (11%)

Train Epoch: 4 [0/60000 (0%)]	Loss: 2.293454
Train Epoch: 4 [1280/60000 (2%)]	Loss: 2.304861
Train Epoch: 4 [2560/60000 (4%)]	Loss: 2.300608
Train Epoch: 4 [3840/60000 (6%)]	Loss: 2.296399
Train Epoch: 4 [5120/60000 (9%)]	Loss: 2.298568
Train Epoch: 4 [6400/60000 (11%)]	Loss: 2.305020
Train Epoch: 4 [7680/60000 (13%)]	Loss: 2.296818
Train Epoch: 4 [8960/60000 (15%)]	Loss: 2.298520
Train Epoch: 4 [10240/60000 (17%)]	Loss: 2.297254
Train Epoch: 4 [11520/60000 (19%)]	Loss: 2.301020
Train Epoch: 4 [12800/60000 (21%)]	Loss: 2.302528
Train Epoch: 4 [14080/60000 (23%)]	Loss: 2.309703
Train Epoch: 4 [15360/60000 (26%)]	Loss: 2.302992
Train Epoch: 4 [16640/60000 (28%)]	Loss: 2.294273
Train Epoch: 4 [17920/60000 (30%)]	Loss: 2.297176
Train Epoch: 4 [19200/60000 (32%)]	Loss: 2.299407
Train Epoch: 4 [20480/60000 (34%)]	Loss: 2.307258
Train Epoch: 4 [21760/60000 (36%)]	Loss: 2.307357
Train Epoch: 4 [23040/60000 (38%)]	Loss: 2.301389
Train Epoch: 4 [24320/60000 (41%)]	Loss: 2.298829
Train Epoch: 4 [25600/60000 (43%)]	Loss: 2.304840
Train Epoch: 4 [26880/60000 (45%)]	Loss: 2.296466
Train Epoch: 4 [28160/60000 (47%)]	Loss: 2.301988
Train Epoch: 4 [29440/60000 (49%)]	Loss: 2.300320
Train Epoch: 4 [30720/60000 (51%)]	Loss: 2.302351
Train Epoch: 4 [32000/60000 (53%)]	Loss: 2.306984
Train Epoch: 4 [33280/60000 (55%)]	Loss: 2.295238
Train Epoch: 4 [34560/60000 (58%)]	Loss: 2.292788
Train Epoch: 4 [35840/60000 (60%)]	Loss: 2.301318
Train Epoch: 4 [37120/60000 (62%)]	Loss: 2.296157
Train Epoch: 4 [38400/60000 (64%)]	Loss: 2.293370
Train Epoch: 4 [39680/60000 (66%)]	Loss: 2.300601
Train Epoch: 4 [40960/60000 (68%)]	Loss: 2.281593
Train Epoch: 4 [42240/60000 (70%)]	Loss: 2.300054
Train Epoch: 4 [43520/60000 (72%)]	Loss: 2.299062
Train Epoch: 4 [44800/60000 (75%)]	Loss: 2.306527
Train Epoch: 4 [46080/60000 (77%)]	Loss: 2.292625
Train Epoch: 4 [47360/60000 (79%)]	Loss: 2.289883
Train Epoch: 4 [48640/60000 (81%)]	Loss: 2.295331
Train Epoch: 4 [49920/60000 (83%)]	Loss: 2.302386
Train Epoch: 4 [51200/60000 (85%)]	Loss: 2.302122
Train Epoch: 4 [52480/60000 (87%)]	Loss: 2.306089
Train Epoch: 4 [53760/60000 (90%)]	Loss: 2.265780
Train Epoch: 4 [55040/60000 (92%)]	Loss: 2.274360
Train Epoch: 4 [56320/60000 (94%)]	Loss: 2.267254
Train Epoch: 4 [57600/60000 (96%)]	Loss: 2.287991
Train Epoch: 4 [58880/60000 (98%)]	Loss: 2.292989

Test set: Average loss: 2.2537, Accuracy: 1741/10000 (17%)

Train Epoch: 5 [0/60000 (0%)]	Loss: 2.268983
Train Epoch: 5 [1280/60000 (2%)]	Loss: 2.305036
Train Epoch: 5 [2560/60000 (4%)]	Loss: 2.246490
Train Epoch: 5 [3840/60000 (6%)]	Loss: 2.272037
Train Epoch: 5 [5120/60000 (9%)]	Loss: 2.291932
Train Epoch: 5 [6400/60000 (11%)]	Loss: 2.291811
Train Epoch: 5 [7680/60000 (13%)]	Loss: 2.273647
Train Epoch: 5 [8960/60000 (15%)]	Loss: 2.233832
Train Epoch: 5 [10240/60000 (17%)]	Loss: 2.274340
Train Epoch: 5 [11520/60000 (19%)]	Loss: 2.256846
Train Epoch: 5 [12800/60000 (21%)]	Loss: 2.270750
Train Epoch: 5 [14080/60000 (23%)]	Loss: 2.275166
Train Epoch: 5 [15360/60000 (26%)]	Loss: 2.253652
Train Epoch: 5 [16640/60000 (28%)]	Loss: 2.249290
Train Epoch: 5 [17920/60000 (30%)]	Loss: 2.270487
Train Epoch: 5 [19200/60000 (32%)]	Loss: 2.224527
Train Epoch: 5 [20480/60000 (34%)]	Loss: 2.271551
Train Epoch: 5 [21760/60000 (36%)]	Loss: 2.276472
Train Epoch: 5 [23040/60000 (38%)]	Loss: 2.285954
Train Epoch: 5 [24320/60000 (41%)]	Loss: 2.228102
Train Epoch: 5 [25600/60000 (43%)]	Loss: 2.269305
Train Epoch: 5 [26880/60000 (45%)]	Loss: 2.254030
Train Epoch: 5 [28160/60000 (47%)]	Loss: 2.248622
Train Epoch: 5 [29440/60000 (49%)]	Loss: 2.246872
Train Epoch: 5 [30720/60000 (51%)]	Loss: 2.257539
Train Epoch: 5 [32000/60000 (53%)]	Loss: 2.276368
Train Epoch: 5 [33280/60000 (55%)]	Loss: 2.244905
Train Epoch: 5 [34560/60000 (58%)]	Loss: 2.229979
Train Epoch: 5 [35840/60000 (60%)]	Loss: 2.275868
Train Epoch: 5 [37120/60000 (62%)]	Loss: 2.244531
Train Epoch: 5 [38400/60000 (64%)]	Loss: 2.240904
Train Epoch: 5 [39680/60000 (66%)]	Loss: 2.285148
Train Epoch: 5 [40960/60000 (68%)]	Loss: 2.211104
Train Epoch: 5 [42240/60000 (70%)]	Loss: 2.250556
Train Epoch: 5 [43520/60000 (72%)]	Loss: 2.235474
Train Epoch: 5 [44800/60000 (75%)]	Loss: 2.283053
Train Epoch: 5 [46080/60000 (77%)]	Loss: 2.213004
Train Epoch: 5 [47360/60000 (79%)]	Loss: 2.242662
Train Epoch: 5 [48640/60000 (81%)]	Loss: 2.293972
Train Epoch: 5 [49920/60000 (83%)]	Loss: 2.305880
Train Epoch: 5 [51200/60000 (85%)]	Loss: 2.225832
Train Epoch: 5 [52480/60000 (87%)]	Loss: 2.252605
Train Epoch: 5 [53760/60000 (90%)]	Loss: 2.202726
Train Epoch: 5 [55040/60000 (92%)]	Loss: 2.202342
Train Epoch: 5 [56320/60000 (94%)]	Loss: 2.230510
Train Epoch: 5 [57600/60000 (96%)]	Loss: 2.251595
Train Epoch: 5 [58880/60000 (98%)]	Loss: 2.203939

Test set: Average loss: 2.1806, Accuracy: 1903/10000 (19%)

Train Epoch: 6 [0/60000 (0%)]	Loss: 2.180199
Train Epoch: 6 [1280/60000 (2%)]	Loss: 2.264313
Train Epoch: 6 [2560/60000 (4%)]	Loss: 2.207158
Train Epoch: 6 [3840/60000 (6%)]	Loss: 2.188053
Train Epoch: 6 [5120/60000 (9%)]	Loss: 2.157844
Train Epoch: 6 [6400/60000 (11%)]	Loss: 2.186026
Train Epoch: 6 [7680/60000 (13%)]	Loss: 2.184875
Train Epoch: 6 [8960/60000 (15%)]	Loss: 2.174940
Train Epoch: 6 [10240/60000 (17%)]	Loss: 2.219181
Train Epoch: 6 [11520/60000 (19%)]	Loss: 2.217708
Train Epoch: 6 [12800/60000 (21%)]	Loss: 2.192343
Train Epoch: 6 [14080/60000 (23%)]	Loss: 2.224646
Train Epoch: 6 [15360/60000 (26%)]	Loss: 2.179729
Train Epoch: 6 [16640/60000 (28%)]	Loss: 2.170908
Train Epoch: 6 [17920/60000 (30%)]	Loss: 2.244955
Train Epoch: 6 [19200/60000 (32%)]	Loss: 2.189663
Train Epoch: 6 [20480/60000 (34%)]	Loss: 2.214189
Train Epoch: 6 [21760/60000 (36%)]	Loss: 2.219710
Train Epoch: 6 [23040/60000 (38%)]	Loss: 2.272699
Train Epoch: 6 [24320/60000 (41%)]	Loss: 2.178215
Train Epoch: 6 [25600/60000 (43%)]	Loss: 2.262794
Train Epoch: 6 [26880/60000 (45%)]	Loss: 2.152906
Train Epoch: 6 [28160/60000 (47%)]	Loss: 2.215632
Train Epoch: 6 [29440/60000 (49%)]	Loss: 2.165996
Train Epoch: 6 [30720/60000 (51%)]	Loss: 2.225581
Train Epoch: 6 [32000/60000 (53%)]	Loss: 2.215235
Train Epoch: 6 [33280/60000 (55%)]	Loss: 2.155055
Train Epoch: 6 [34560/60000 (58%)]	Loss: 2.188354
Train Epoch: 6 [35840/60000 (60%)]	Loss: 2.124085
Train Epoch: 6 [37120/60000 (62%)]	Loss: 2.204495
Train Epoch: 6 [38400/60000 (64%)]	Loss: 2.119761
Train Epoch: 6 [39680/60000 (66%)]	Loss: 2.107212
Train Epoch: 6 [40960/60000 (68%)]	Loss: 2.061167
Train Epoch: 6 [42240/60000 (70%)]	Loss: 2.165369
Train Epoch: 6 [43520/60000 (72%)]	Loss: 2.123863
Train Epoch: 6 [44800/60000 (75%)]	Loss: 2.134081
Train Epoch: 6 [46080/60000 (77%)]	Loss: 2.058522
Train Epoch: 6 [47360/60000 (79%)]	Loss: 2.100058
Train Epoch: 6 [48640/60000 (81%)]	Loss: 2.156971
Train Epoch: 6 [49920/60000 (83%)]	Loss: 2.043832
Train Epoch: 6 [51200/60000 (85%)]	Loss: 2.051765
Train Epoch: 6 [52480/60000 (87%)]	Loss: 2.059500
Train Epoch: 6 [53760/60000 (90%)]	Loss: 2.068943
Train Epoch: 6 [55040/60000 (92%)]	Loss: 2.075705
Train Epoch: 6 [56320/60000 (94%)]	Loss: 2.040582
Train Epoch: 6 [57600/60000 (96%)]	Loss: 2.083462
Train Epoch: 6 [58880/60000 (98%)]	Loss: 1.952212

Test set: Average loss: 1.9597, Accuracy: 2331/10000 (23%)

Train Epoch: 7 [0/60000 (0%)]	Loss: 2.090027
Train Epoch: 7 [1280/60000 (2%)]	Loss: 2.026432
Train Epoch: 7 [2560/60000 (4%)]	Loss: 1.999249
Train Epoch: 7 [3840/60000 (6%)]	Loss: 2.018092
Train Epoch: 7 [5120/60000 (9%)]	Loss: 2.016849
Train Epoch: 7 [6400/60000 (11%)]	Loss: 1.943704
Train Epoch: 7 [7680/60000 (13%)]	Loss: 1.967701
Train Epoch: 7 [8960/60000 (15%)]	Loss: 1.945909
Train Epoch: 7 [10240/60000 (17%)]	Loss: 1.980906
Train Epoch: 7 [11520/60000 (19%)]	Loss: 1.967740
Train Epoch: 7 [12800/60000 (21%)]	Loss: 1.914054
Train Epoch: 7 [14080/60000 (23%)]	Loss: 1.976711
Train Epoch: 7 [15360/60000 (26%)]	Loss: 2.003543
Train Epoch: 7 [16640/60000 (28%)]	Loss: 1.861521
Train Epoch: 7 [17920/60000 (30%)]	Loss: 1.978965
Train Epoch: 7 [19200/60000 (32%)]	Loss: 1.968214
Train Epoch: 7 [20480/60000 (34%)]	Loss: 1.992969
Train Epoch: 7 [21760/60000 (36%)]	Loss: 1.949283
Train Epoch: 7 [23040/60000 (38%)]	Loss: 1.971691
Train Epoch: 7 [24320/60000 (41%)]	Loss: 1.908140
Train Epoch: 7 [25600/60000 (43%)]	Loss: 1.902956
Train Epoch: 7 [26880/60000 (45%)]	Loss: 1.833011
Train Epoch: 7 [28160/60000 (47%)]	Loss: 1.901236
Train Epoch: 7 [29440/60000 (49%)]	Loss: 1.940375
Train Epoch: 7 [30720/60000 (51%)]	Loss: 1.964390
Train Epoch: 7 [32000/60000 (53%)]	Loss: 1.984134
Train Epoch: 7 [33280/60000 (55%)]	Loss: 1.919427
Train Epoch: 7 [34560/60000 (58%)]	Loss: 1.977266
Train Epoch: 7 [35840/60000 (60%)]	Loss: 1.923299
Train Epoch: 7 [37120/60000 (62%)]	Loss: 1.912774
Train Epoch: 7 [38400/60000 (64%)]	Loss: 1.873731
Train Epoch: 7 [39680/60000 (66%)]	Loss: 1.855098
Train Epoch: 7 [40960/60000 (68%)]	Loss: 1.903855
Train Epoch: 7 [42240/60000 (70%)]	Loss: 1.881510
Train Epoch: 7 [43520/60000 (72%)]	Loss: 1.946360
Train Epoch: 7 [44800/60000 (75%)]	Loss: 1.942104
Train Epoch: 7 [46080/60000 (77%)]	Loss: 1.800483
Train Epoch: 7 [47360/60000 (79%)]	Loss: 1.918265
Train Epoch: 7 [48640/60000 (81%)]	Loss: 1.852508
Train Epoch: 7 [49920/60000 (83%)]	Loss: 1.835838
Train Epoch: 7 [51200/60000 (85%)]	Loss: 1.891507
Train Epoch: 7 [52480/60000 (87%)]	Loss: 1.811868
Train Epoch: 7 [53760/60000 (90%)]	Loss: 1.864721
Train Epoch: 7 [55040/60000 (92%)]	Loss: 1.866868
Train Epoch: 7 [56320/60000 (94%)]	Loss: 1.865902
Train Epoch: 7 [57600/60000 (96%)]	Loss: 1.950668
Train Epoch: 7 [58880/60000 (98%)]	Loss: 1.795573

Test set: Average loss: 1.7925, Accuracy: 2907/10000 (29%)

Train Epoch: 8 [0/60000 (0%)]	Loss: 1.929929
Train Epoch: 8 [1280/60000 (2%)]	Loss: 2.024763
Train Epoch: 8 [2560/60000 (4%)]	Loss: 1.858009
Train Epoch: 8 [3840/60000 (6%)]	Loss: 1.884810
Train Epoch: 8 [5120/60000 (9%)]	Loss: 1.897856
Train Epoch: 8 [6400/60000 (11%)]	Loss: 1.841636
Train Epoch: 8 [7680/60000 (13%)]	Loss: 1.838920
Train Epoch: 8 [8960/60000 (15%)]	Loss: 1.750095
Train Epoch: 8 [10240/60000 (17%)]	Loss: 1.856052
Train Epoch: 8 [11520/60000 (19%)]	Loss: 1.850692
Train Epoch: 8 [12800/60000 (21%)]	Loss: 1.784122
Train Epoch: 8 [14080/60000 (23%)]	Loss: 1.867140
Train Epoch: 8 [15360/60000 (26%)]	Loss: 1.900107
Train Epoch: 8 [16640/60000 (28%)]	Loss: 1.702633
Train Epoch: 8 [17920/60000 (30%)]	Loss: 1.842601
Train Epoch: 8 [19200/60000 (32%)]	Loss: 1.889099
Train Epoch: 8 [20480/60000 (34%)]	Loss: 1.823684
Train Epoch: 8 [21760/60000 (36%)]	Loss: 1.803875
Train Epoch: 8 [23040/60000 (38%)]	Loss: 1.811653
Train Epoch: 8 [24320/60000 (41%)]	Loss: 1.852651
Train Epoch: 8 [25600/60000 (43%)]	Loss: 1.790305
Train Epoch: 8 [26880/60000 (45%)]	Loss: 1.703516
Train Epoch: 8 [28160/60000 (47%)]	Loss: 1.900053
Train Epoch: 8 [29440/60000 (49%)]	Loss: 1.840437
Train Epoch: 8 [30720/60000 (51%)]	Loss: 1.813079
Train Epoch: 8 [32000/60000 (53%)]	Loss: 2.095883
Train Epoch: 8 [33280/60000 (55%)]	Loss: 1.870054
Train Epoch: 8 [34560/60000 (58%)]	Loss: 1.883030
Train Epoch: 8 [35840/60000 (60%)]	Loss: 1.844843
Train Epoch: 8 [37120/60000 (62%)]	Loss: 1.875854
Train Epoch: 8 [38400/60000 (64%)]	Loss: 1.733647
Train Epoch: 8 [39680/60000 (66%)]	Loss: 1.751632
Train Epoch: 8 [40960/60000 (68%)]	Loss: 1.726646
Train Epoch: 8 [42240/60000 (70%)]	Loss: 1.843401
Train Epoch: 8 [43520/60000 (72%)]	Loss: 1.857497
Train Epoch: 8 [44800/60000 (75%)]	Loss: 1.807124
Train Epoch: 8 [46080/60000 (77%)]	Loss: 1.757513
Train Epoch: 8 [47360/60000 (79%)]	Loss: 1.819974
Train Epoch: 8 [48640/60000 (81%)]	Loss: 1.798023
Train Epoch: 8 [49920/60000 (83%)]	Loss: 1.750340
Train Epoch: 8 [51200/60000 (85%)]	Loss: 1.900196
Train Epoch: 8 [52480/60000 (87%)]	Loss: 1.776257
Train Epoch: 8 [53760/60000 (90%)]	Loss: 1.759554
Train Epoch: 8 [55040/60000 (92%)]	Loss: 1.777745
Train Epoch: 8 [56320/60000 (94%)]	Loss: 1.745965
Train Epoch: 8 [57600/60000 (96%)]	Loss: 1.943156
Train Epoch: 8 [58880/60000 (98%)]	Loss: 1.713140

Test set: Average loss: 1.7038, Accuracy: 3443/10000 (34%)

Train Epoch: 9 [0/60000 (0%)]	Loss: 1.832337
Train Epoch: 9 [1280/60000 (2%)]	Loss: 1.934987
Train Epoch: 9 [2560/60000 (4%)]	Loss: 1.776938
Train Epoch: 9 [3840/60000 (6%)]	Loss: 1.777745
Train Epoch: 9 [5120/60000 (9%)]	Loss: 1.826795
Train Epoch: 9 [6400/60000 (11%)]	Loss: 1.691161
Train Epoch: 9 [7680/60000 (13%)]	Loss: 1.770931
Train Epoch: 9 [8960/60000 (15%)]	Loss: 1.693899
Train Epoch: 9 [10240/60000 (17%)]	Loss: 1.749530
Train Epoch: 9 [11520/60000 (19%)]	Loss: 1.759076
Train Epoch: 9 [12800/60000 (21%)]	Loss: 1.757854
Train Epoch: 9 [14080/60000 (23%)]	Loss: 1.890249
Train Epoch: 9 [15360/60000 (26%)]	Loss: 1.774607
Train Epoch: 9 [16640/60000 (28%)]	Loss: 1.621988
Train Epoch: 9 [17920/60000 (30%)]	Loss: 1.856874
Train Epoch: 9 [19200/60000 (32%)]	Loss: 1.741723
Train Epoch: 9 [20480/60000 (34%)]	Loss: 1.798806
Train Epoch: 9 [21760/60000 (36%)]	Loss: 1.680563
Train Epoch: 9 [23040/60000 (38%)]	Loss: 1.815430
Train Epoch: 9 [24320/60000 (41%)]	Loss: 1.828471
Train Epoch: 9 [25600/60000 (43%)]	Loss: 1.709843
Train Epoch: 9 [26880/60000 (45%)]	Loss: 1.701019
Train Epoch: 9 [28160/60000 (47%)]	Loss: 1.854627
Train Epoch: 9 [29440/60000 (49%)]	Loss: 1.753559
Train Epoch: 9 [30720/60000 (51%)]	Loss: 1.737419
Train Epoch: 9 [32000/60000 (53%)]	Loss: 2.170925
Train Epoch: 9 [33280/60000 (55%)]	Loss: 1.822912
Train Epoch: 9 [34560/60000 (58%)]	Loss: 1.811824
Train Epoch: 9 [35840/60000 (60%)]	Loss: 1.734667
Train Epoch: 9 [37120/60000 (62%)]	Loss: 1.755970
Train Epoch: 9 [38400/60000 (64%)]	Loss: 1.713913
Train Epoch: 9 [39680/60000 (66%)]	Loss: 1.651311
Train Epoch: 9 [40960/60000 (68%)]	Loss: 1.711650
Train Epoch: 9 [42240/60000 (70%)]	Loss: 1.684796
Train Epoch: 9 [43520/60000 (72%)]	Loss: 1.717655
Train Epoch: 9 [44800/60000 (75%)]	Loss: 1.795536
Train Epoch: 9 [46080/60000 (77%)]	Loss: 1.712079
Train Epoch: 9 [47360/60000 (79%)]	Loss: 1.770020
Train Epoch: 9 [48640/60000 (81%)]	Loss: 1.713666
Train Epoch: 9 [49920/60000 (83%)]	Loss: 1.722717
Train Epoch: 9 [51200/60000 (85%)]	Loss: 1.841270
Train Epoch: 9 [52480/60000 (87%)]	Loss: 1.722087
Train Epoch: 9 [53760/60000 (90%)]	Loss: 1.730507
Train Epoch: 9 [55040/60000 (92%)]	Loss: 1.745938
Train Epoch: 9 [56320/60000 (94%)]	Loss: 1.650783
Train Epoch: 9 [57600/60000 (96%)]	Loss: 1.865296
Train Epoch: 9 [58880/60000 (98%)]	Loss: 1.655864

Test set: Average loss: 1.6531, Accuracy: 3591/10000 (36%)

Train Epoch: 10 [0/60000 (0%)]	Loss: 1.807748
Train Epoch: 10 [1280/60000 (2%)]	Loss: 1.920595
Train Epoch: 10 [2560/60000 (4%)]	Loss: 1.754044
Train Epoch: 10 [3840/60000 (6%)]	Loss: 1.765863
Train Epoch: 10 [5120/60000 (9%)]	Loss: 1.765213
Train Epoch: 10 [6400/60000 (11%)]	Loss: 1.674707
Train Epoch: 10 [7680/60000 (13%)]	Loss: 1.697510
Train Epoch: 10 [8960/60000 (15%)]	Loss: 1.686634
Train Epoch: 10 [10240/60000 (17%)]	Loss: 1.717700
Train Epoch: 10 [11520/60000 (19%)]	Loss: 1.788128
Train Epoch: 10 [12800/60000 (21%)]	Loss: 1.727557
Train Epoch: 10 [14080/60000 (23%)]	Loss: 1.813594
Train Epoch: 10 [15360/60000 (26%)]	Loss: 1.765687
Train Epoch: 10 [16640/60000 (28%)]	Loss: 1.595699
Train Epoch: 10 [17920/60000 (30%)]	Loss: 1.731486
Train Epoch: 10 [19200/60000 (32%)]	Loss: 1.788696
Train Epoch: 10 [20480/60000 (34%)]	Loss: 1.761674
Train Epoch: 10 [21760/60000 (36%)]	Loss: 1.691835
Train Epoch: 10 [23040/60000 (38%)]	Loss: 1.745392
Train Epoch: 10 [24320/60000 (41%)]	Loss: 1.944455
Train Epoch: 10 [25600/60000 (43%)]	Loss: 1.613760
Train Epoch: 10 [26880/60000 (45%)]	Loss: 1.605459
Train Epoch: 10 [28160/60000 (47%)]	Loss: 1.832196
Train Epoch: 10 [29440/60000 (49%)]	Loss: 1.692135
Train Epoch: 10 [30720/60000 (51%)]	Loss: 1.743709
Train Epoch: 10 [32000/60000 (53%)]	Loss: 2.153744
Train Epoch: 10 [33280/60000 (55%)]	Loss: 1.711511
Train Epoch: 10 [34560/60000 (58%)]	Loss: 1.777244
Train Epoch: 10 [35840/60000 (60%)]	Loss: 1.685776
Train Epoch: 10 [37120/60000 (62%)]	Loss: 1.783445
Train Epoch: 10 [38400/60000 (64%)]	Loss: 1.682252
Train Epoch: 10 [39680/60000 (66%)]	Loss: 1.643781
Train Epoch: 10 [40960/60000 (68%)]	Loss: 1.742568
Train Epoch: 10 [42240/60000 (70%)]	Loss: 1.736070
Train Epoch: 10 [43520/60000 (72%)]	Loss: 1.701879
Train Epoch: 10 [44800/60000 (75%)]	Loss: 1.681437
Train Epoch: 10 [46080/60000 (77%)]	Loss: 1.672221
Train Epoch: 10 [47360/60000 (79%)]	Loss: 1.758867
Train Epoch: 10 [48640/60000 (81%)]	Loss: 1.683032
Train Epoch: 10 [49920/60000 (83%)]	Loss: 1.650964
Train Epoch: 10 [51200/60000 (85%)]	Loss: 1.787226
Train Epoch: 10 [52480/60000 (87%)]	Loss: 1.647728
Train Epoch: 10 [53760/60000 (90%)]	Loss: 1.676250
Train Epoch: 10 [55040/60000 (92%)]	Loss: 1.719211
Train Epoch: 10 [56320/60000 (94%)]	Loss: 1.689755
Train Epoch: 10 [57600/60000 (96%)]	Loss: 1.848105
Train Epoch: 10 [58880/60000 (98%)]	Loss: 1.669609

Test set: Average loss: 1.6174, Accuracy: 3749/10000 (37%)

What is the accuracy of the model on the test set?

Solution

The accuracy of the model on the test set can be found in the output of the script. In the case of 10 epochs, with the arguments set in the example, the accuracy was around 37%. It will vary, not only depending on the number of epochs, on the batch size and learning rate, but also on how data is initialized.

How does the model perform with different hyperparameters?

Solution

The model’s performance can vary significantly with different hyperparameters such as the number of epochs, batch size, and learning rate. For example, increasing the number of epochs generally leads to better accuracy, but it may also lead to overfitting if the model is trained for too long. Similarly, adjusting the batch size can affect the convergence speed and stability of the training process. The learning rate is crucial for determining how quickly the model learns; a too high learning rate may cause the model to diverge, while a too low learning rate may result in slow convergence. You can explore how these parameters affect the model’s performance by running the script with different values for these hyperparameters and observing the changes in accuracy and loss.

What are some ways to improve the model’s performance?

Solution

There are several ways to improve the model’s performance, including:

Increasing the number of epochs to allow the model to learn more from the training data.
Adjusting the learning rate to find a balance between convergence speed and stability.
Using data augmentation techniques to increase the diversity of the training data.
Implementing regularization techniques such as dropout or weight decay to prevent overfitting.
Experimenting with different architectures or adding more layers to the neural network.

How long does it take to run the script?

Solution

The time it takes to run the script can vary depending on the hardware used, the number of epochs, and the batch size. In the example provided, it took approximately 15 minute to run the script with 10 epochs, a batch size of 128, and a learning rate of 1.0, with CPU only resources. You can measure the time taken by the script by using the time command in the terminal, for example: time python mnist.py --epochs 10 --batch-size 128 --lr 1.0.

This time can be a reference for how long it takes to run the script with different hyperparameters. For example, if you increase the number of epochs to 20, you can expect the script to take approximately twice as long to run, assuming other parameters remain constant. If you have access to a GPU, the time taken to run the script can be significantly reduced, as GPUs are optimized for parallel processing and can handle large matrix operations more efficiently than CPUs.

Running on the DRAC resources

To run python code on the DRAC resources, you need to schedule a job using the sbatch command. There is a series of steps required to set the environemnt, install the required packages, make data available in efficient ways, and run the code. If your code is self contained and does not require additional packages, a simple sbatch script can be used to run the code. In this example, we require packages and data to be available in the environment so we can run the code. In this section we could provide a set of scripts to run the code on the DRAC resources. Instead, we will provide a general overview of the steps required to run the code on the DRAC resources, and additional commands/steps that will be usefull in your own endeavours. To follow along this section, please follow along all the subsections.

Downloading the dataset

We need to download the dataset to DRAC storage. It was previously done in the local environment. You can either copy the contents of the data folder, or download it again from the source. In this example, we will download it into the data folder. To avoid downloading it twice, we check if the dataset file already exists, otherwise we download it:

[ -e $HOME/data/MNIST.tar.gz ] || wget -P $HOME/data/ www.di.ens.fr/~lelarge/MNIST.tar.gz

Notice that the MNIST.tar.gz file is downloaded to the $HOME/data folder. We can check the contents by listing the files in the folder:

ls $HOME/data/

For this example, we could extract the contents of the archive file inside the $HOME/data folder (which is contained in the Home storage), and read it directly from the compute nodes. However, it is a good practice to copy large data files to the Node-local storage and then read individual files from there. This will speed up the process of reading the data, as the local storage is faster than the shared filesystem and avoid slowing down the DRAC network. This is even more important when the whole dataset is not loaded entirely in memory: files are constantly read from storage as they are required. We will do this steps in the sbatch script.

Test existing script interactively

Before submitting a long time job, it is a good idea to test the script interactively. This will allow you to see if the script is working as expected, and if there are any issues with the environment. To do so, you are first required to log in to one of the DRAC clusters. Once logged in, you will have access to a terminal. Since we are using python, we will need to load the python3 module.

Allocate the resources

To allocate the resources and run the script interactively, we can use the salloc command. This command will allocate the resources requested and provide you with a terminal to run your script interactively. The command below will allocate the resources for 1 hour, and will use your account def-$USER. For this example, let’s request 16 cpus with the --cpus-per-task option.

salloc --time=01:00:00 --account=def-$USER --cpus-per-task=16

Loading required modules

Since we are running a python script, we need to load the python module. To list the available python modules, you can use the following command:

module avail python

For this example, we will load the python/3.11.5 module: make sure that the module you intend on using is available in the system, and supports the required packages. To load the module, you can use the following command:

module load python/3.11.5

Questions

How to list the python modules available in the DRAC resources?

Solution

To list the available python modules, you can use the module avail python command. This will display a list of available python modules that you can load and use in your scripts.

How to load a specific python module?

Solution

To load a specific python module, you can use the module load python/<version> command. Replace <version> with the version of the python module you want to load. This will make the specified python version available for use in your scripts.

Exercises

Load the cuda and cudnn modules (will be used later when running on gpu).

module load cuda cudnn

Prepare the virtual environment

We will create a virtual environment to install the packages and run our script. To create a virtual environment you can use the command virtualenv --no-download <env_name>. Let’s create a virtual environment called .venv:

virtualenv --no-download $SLURM_TMPDIR/.venv

This will create a virtual environment in the current directory. To activate the virtual environment, you can use the following command:

source $SLURM_TMPDIR/.venv/bin/activate

This will activate the virtual environment and allow you to install packages without affecting the system-wide installation. To list the available packages, you can use the command pip list, which will produce something similar to:

Package    Version
---------- --------------------
pip        25.0.1
setuptools 68.0.0
wheel      0.45.1+computecanada

For this code, we also need to load the torch and torchvision packages. You can check if the packages are available using the avail_wheels <package>. for example, to check if the torch package is available, you can use the following command:

avail_wheels torch

Which will produce an output similar to:

name    version    python    arch
------  ---------  --------  ---------
torch   2.6.0      cp311     x86-64-v3

It is also possible to verify if a package is available for a specific python version. For example, to check if the torchvision package is available for python3.11, you can use the following command:

avail_wheels -p 3.11 -n torchvision

To load the torch package, you can use the pip install --no-index <package> command the following command:

pip install --no-index torch

Questions

List the packages in your virtual environemnt. What is the difference? Is torch available now?

Solution

pip list

Possible output:

Package           Version
----------------- ----------------------
filelock          3.18.0+computecanada
fsspec            2025.3.0+computecanada
jinja2            3.1.6+computecanada
MarkupSafe        2.1.5+computecanada
mpmath            1.3.0+computecanada
networkx          3.4.2+computecanada
pip               25.0.1
setuptools        68.0.0
sympy             1.13.1+computecanada
torch             2.6.0+computecanada
typing_extensions 4.12.2+computecanada
wheel             0.45.1+computecanada

How to check if a specific package is available in the DRAC resources?

Solution

To check if a specific package is available in the DRAC resources, you can use the avail_wheels <package> command. Replace <package> with the name of the package you want to check. This will display information about the package, including its name, version, and compatibility with different python versions.

How can you check if the torchvision package is available for python3.11.

Solution

To check if the torchvision package is available for python3.11, you can use the avail_wheels torchvision command. This will display information about the torchvision package, including its name, version, and compatibility with different python versions.

Exercises

Install the torchvision package.

Solution

pip install --no-index torchvision

Running the script with allocated resources

To run the python script, we require at least torch and torchvision packages. Make sure you did the exercise in the previous section so your environment is ready to run the code. With all required packages installed in the newly created environment that is running in the allocated resources, we can transfer the data to the local storage:

#create the directory
mkdir $SLURM_TMPDIR/data
# copy the data to the local storage
cp $HOME/data/MNIST.tar.gz $SLURM_TMPDIR/data/
# extract the data
tar -zxvf $SLURM_TMPDIR/data/MNIST.tar.gz -C $SLURM_TMPDIR/data

You can verify that the MNIST folder is now available in the $SLURM_TMPDIR/data folder with:

ls $SLURM_TMPDIR/data

At this stage, you can install missing packages with the pip install --no-index <package> command. You can also run python directly in the terminal with the command python. Once you are done with the interactive session, you can exit using the exit command.

Now that the data is available in the local storage, we can run the mnist script. To do so, lets set the number of --epochs to 1:

python mnist.py --epochs 1

You can also monitor the resources while running the code above. To monitor the resources, you can attach a new terminal to the running job, using the following command:

srun --jobid=<the-job-id> --pty bash

With the new bash terminal, you can monitor the resources using the htop command. This will show you the resources used by the job, including the CPU and memory usage. You can also use the top command to monitor the resources used by the job.

Exercises

Run the mnist.py script interactively and monitor the resources while the script is running.

Solution

To run the MNIST script, you can use the command below. At this moment, you might want to limit the time the script is running by setting the epochs to 1, and testing different batch sizes to fit training in memory.

python mnist.py

or with extra options:

python mnist.py --epochs 1 --batch-size 8

To monitor the resources while running the code above, you can attach a new terminal to the running job using the srun command. This will allow you to run commands in the new terminal while the job is running in the background.

srun --jobid=<the-job-id> --pty bash

With the new bash terminal, you can monitor the resources using the htop command. This will show you the resources used by the job, including the CPU and memory usage. You can also use the top command to monitor the resources used by the job.

Questions

How to allocate resources and run a script interactively on the DRAC resources?

Solution

To allocate resources and run a script interactively on the DRAC resources, you can use the salloc command. This command will allocate the resources requested and provide you with a terminal to run your script interactively. Once the resources are allocated, you can run your script using the python command. When you are done, you can exit the interactive session using the exit command.

If you had a limited amount of memory to run the MNIST script and should use only 32 cpus, what would be the optimum batch size?

Solution

To find the optimum batch size, you can run the MNIST script with different batch sizes and monitor the memory usage. You can use the --batch-size option to specify the batch size when running the script. For example, to run the script with a batch size of 16, you can use the following command:

python mnist.py --batch-size 16

For this specific example, the memory available can be large, and it might be possible to fit all samples in one epoch. However, it might not be as useful as the model will perform worse than if it was trained with a smaller batch size. There is a tradeoff between the batch size and the number of epochs. A smaller batch size will require more epochs to train the model, but it will also allow the model to learn better. A larger batch size will require fewer epochs to train the model, but it will also make the model less accurate. The optimum batch size will depend on the specific problem you are trying to solve and the resources available.

Running MNIST as a sbatch script

In the previous section, we explored how to run the MNIST handwritten digit recognition system on DRAC resources interactively. The process of running the script interactively is useful for testing and debugging the code. However, for long-running jobs or jobs that require specific resources, it is more efficient to submit a job using the sbatch command. In this section, we will demonstrate how to run the MNIST handwritten digit recognition system as a sbatch script on the DRAC resources.

The process of running the MNIST handwritten digit recognition system as a sbatch script involves creating a sbatch script that specifies the configuration required for the script to be executed. The sbatch script will be submitted using the sbatch command, which will schedule the job to run on the DRAC resources. The script execution will be managed by the SLURM scheduler, which will allocate the resources requested and monitor the job’s progress. As mentioned in previous sections, there are a few configurations required to run the script on the DRAC resources, such as the account name and the number of cpus required. In addition to the configuration, you need to prepare the environment and install the required packages to run the script successfully. Finally, you need to add the script to the sbatch script and submit the job using the sbatch command. The following sections will guide you through the process of running the MNIST handwritten digit recognition system as a sbatch script on the DRAC resources.

The updated MNIST script

The Python script used in the previous section used a fixed location to load the data. It is a good practice to copy the data to the compute nodes ($SLURM_TMPDIR), preferably with fewer larger files instead of many smaller files. In other words, if thousands of images are to be constantly read from disk, you should copy the files to $SLURM_TMPDIR and adapt your script to read the files from there. While the MNIST is not a very large dataset and could be read entirely in memory, we will copy the dataset to the local storage to follow good practices and to have an example on how to properly copy files to the compute nodes.

The updated Python script includes an additional argument input (--data-dir), which defaults to data. When running the python script using the local storage, we could run the script using python drac-mnist.py --data-dir $SLURM_TMPDIR/data, assuming the name of the saved script is drac-mnist.py.

The updated script can be seen below, and can be found here. To use this script in the following sections, make sure to save it as drac-mnist.py.

import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--no-mps', action='store_true', default=False,
                        help='disables macOS GPU training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--data-dir', type=str, default='data', 
                        help='directory that contains the data')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()
    use_mps = not args.no_mps and torch.backends.mps.is_available()

    torch.manual_seed(args.seed)

    if use_cuda:
        device = torch.device("cuda")
    elif use_mps:
        device = torch.device("mps")
    else:
        device = torch.device("cpu")

    train_kwargs = {'batch_size': args.batch_size}
    test_kwargs = {'batch_size': args.test_batch_size}
    if use_cuda:
        cuda_kwargs = {'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    
    dataset1 = datasets.MNIST(args.data_dir, train=True,
                       transform=transform)
    dataset2 = datasets.MNIST(args.data_dir, train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)
        scheduler.step()

    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

The SBATCH script to submit the job

The script in this section is an example of a sbatch script that can be used to run the MNIST handwritten digit recognition system on the DRAC resources. The script specifies the configuration required for the job, such as the number of cpus (--cpus-per-task), memory required (--mem), and the time required to allocate the resources (--time). It also loads the required modules and sets up the environment to run the script. The script then runs the MNIST handwritten digit recognition system using the python command. You can customize the script to fit your specific requirements and configurations.

Note that the account was not provided in the sbatch script, and it is required to run the job if you have more than one account. You can find more information about it in this link. If you have more than one account, you need to provide the account name when submitting the job using the sbatch command, or it should be added at the script (#SBATCH --account=def-<YOUR-USERNAME> mnist-sbatch.sh). Note that there are a number of keyword arguments that are provided to the python script, such as --epochs, --batch-size, --data-source-dir, and --script-source. These arguments are used to customize the behavior of the script and can be modified to fit your specific requirements.

The updated sbatch script can be seen below, and can be found here. To use this script in the following sections, make sure to save it as mnist-sbatch.sh. Notice that at the end of the script, the python script uses the argument --save-model, which will save the model at the location where the script was launched.

#!/bin/bash
#SBATCH --cpus-per-task=32  # Refer to cluster's documentation for the right CPU/GPU ratio
#SBATCH --mem=4800M       # Memory proportional to GPUs: 32000 Cedar, 47000 Béluga, 64000 Graham.
#SBATCH --time=0-0:30      # DD-HH:MM:SS
set -ex

epochs=1
batch_size=8
datasourcedir=$HOME/data
script_source=$(pwd)/drac-mnist.py

# Parse command-line arguments
while [[ "$#" -gt 0 ]]; do
    case $1 in
        --epochs) epochs="$2"; shift ;;
        --batch-size) batch_size="$2"; shift ;;
        --data-source-dir|-d) datasourcedir="$2"; shift ;;
        --script-source|-s) script_source="$2"; shift ;;
    esac
    shift
done

echo "Epochs: $epochs"
echo "Batch size: $batch_size"
echo "Data source directory: $datasourcedir"
echo "Script source: $script_source"

# Check if MNIST.tar.gz exists in $datasourcedir
if [ ! -e $datasourcedir/MNIST.tar.gz ]; then
    echo "MNIST.tar.gz not found in $datasourcedir. Please download the dataset with the following command:"
    echo "wget -P $datasourcedir www.di.ens.fr/~lelarge/MNIST.tar.gz"
    exit 1
fi


# # Required for GPU use on Graham
module load StdEnv/2020  gcc/11.3.0

# Load Python 3.11.5 and CUDA
module load python cuda cudnn

# Prepare virtualenv
virtualenv --no-download $SLURM_TMPDIR/.env

# Activate virtualenv and install dependencies
source $SLURM_TMPDIR/.env/bin/activate
pip install --no-index --upgrade pip
pip install --no-index torch torchvision

# Prepare data directory in $SLURM_TMPDIR
[ -d $SLURM_TMPDIR/data ] || mkdir $SLURM_TMPDIR/data

# Copy MNIST.tar.gz to $SLURM_TMPDIR/data
cp $datasourcedir/MNIST.tar.gz $SLURM_TMPDIR/data/

# Extract MNIST.tar.gz
tar -zxvf $SLURM_TMPDIR/data/MNIST.tar.gz -C $SLURM_TMPDIR/data/

# Start training
python $script_source --data-dir $SLURM_TMPDIR/data --epochs $epochs --batch-size $batch_size --save-model

Running and monitoring the MNIST script using the sbatch script

To run the the MNIST script using the sbatch script, you first need to download the dataset. Let’s download the dataset to the $HOME/data folder:

wget -P $HOME/data www.di.ens.fr/~lelarge/MNIST.tar.gz

Once the dataset is downloaded, a minimal command to run the script (using default values from the <mnist-sbatch.sh>) script would be:

sbatch --account=def-$USER mnist-sbatch.sh

The previous command will run the script using the default values. If you want to customize the behavior of the script, you can use the arguments provided in the sbatch script. For example, to run the script with 2 epochs and a batch size of 128, you can use the following command:

sbatch --account=def-$USER \
mnist-sbatch.sh \
--epochs 2 \
--batch-size 128

You can explore the sbatch options by using the command sbatch --help. This will provide you with a list of available options that you can use to customize the behavior of the sbatch script. One useful option is to name the job using the --job-name option. This will help you identify the job in the queue and monitor its progress.

Let’s launch 4 jobs, using the mnist-sbatch.sh script, providing a different number of cpus (16, 32), with different batch sizes (128, 512), and giving each job a different name:

sbatch --account=def-$USER --cpus-per-task=16 --job-name="cpu-16-batch-128" mnist-sbatch.sh --epochs 1 --batch-size 128
sbatch --account=def-$USER --cpus-per-task=16 --job-name="cpu-16-batch-512" mnist-sbatch.sh --epochs 1 --batch-size 512
sbatch --account=def-$USER --cpus-per-task=32 --job-name="cpu-32-batch-128" mnist-sbatch.sh --epochs 1 --batch-size 128
sbatch --account=def-$USER --cpus-per-task=32 --job-name="cpu-32-batch-512" mnist-sbatch.sh --epochs 1 --batch-size 512

The previous commands will submit four jobs to the queue, each with different configurations. You can monitor the progress of the jobs using the sq command. This will display a list of jobs in the queue, including their status, name, and other information. A similar output to the sq command is shown below:

            JOBID     USER              ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) 
         28237737 someuser     def-someuser_cpu cpu-16-batch-1   R      28:24     1   16        N/A   4800M gra82 (None) 
         28237738 someuser     def-someuser_cpu cpu-16-batch-5   R      28:24     1   16        N/A   4800M gra119 (None) 
         28237739 someuser     def-someuser_cpu cpu-32-batch-1  PD      30:00     1   32        N/A   4800M  (Priority) 
         28237740 someuser     def-someuser_cpu cpu-32-batch-5  PD      30:00     1   32        N/A   4800M  (Priority)

Notice that the jobs state are either R (Running) or PD (Pending). You can also see the name of the jobs that you provided in the --job-name option.

You can monitor the efficiency of the code using the seff <job-id>. The efficiency will not be available until the job is completed. The seff command will provide you with information about the job, including the CPU efficiency, memory efficiency, and other metrics. An example of command and output about a job that is pending can be seen below:

seff 28237740

Job ID: 28237740
Cluster: graham
User/Group: someuser/someuser
State: PENDING
Cores: 1
Efficiency not available for jobs in the PENDING state.

Once the job is completed, you can use the seff command to get the efficiency of the job. An example of the command and output for a completed job can be seen below. Notice that the efficiency is calculated based on the resources used by the job. A memory of 3.04GB was used in this case, and the requested memory per node was 4.8GB. The CPU efficiency was 10.77%. However, we should keep in mind that there is a time required to copy data, setup the environment, and other tasks before start training the model. Changing the epoch number from 1 to 3 increase the efficiency from 10.77% to 42.55%, and would be even more if more epochs were to be used. For long time model training, the seff command can be quite useful to estimate the optimal resource needs. Remember that wait time is based on resource availability and other rules by DRAC. Estimating the optimal resource requirements can speed up the access to the resources and processing of your scripts.

seff 28237740

Job ID: 28237737
Cluster: graham
User/Group: someuser/someuser
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:11:50
CPU Efficiency: 10.77% of 01:49:52 core-walltime
Job Wall-clock time: 00:06:52
Memory Utilized: 3.04 GB
Memory Efficiency: 64.78% of 4.69 GB (4.69 GB/node)

It is a good exercise to train more epochs to keep the relative computation time larger. Try running the script with 10 epochs and monitor the efficiency of the job. You can also try different batch sizes and number of cpus to see how the efficiency changes. Remember that the efficiency is calculated based on the resources used by the job, and it can help you optimize the resource allocation for your scripts. What efficiency would you expect for a job that is running for 10 epochs?

Investigating the output

The command sbatch captures script output in a file. This file is useful for tracing the script’s progress. The default output of the script is stored in a file named slurm-<job-id>.out. You can check the output of the script using the cat command. For example, to check the output of a job with the job id 28237740, you can use the following command from the folder where you launched the job:

cat slurm-28237740.out

For new jobs, you can modify the name of the output by providing the --output option in the sbatch script. For documentation regarding different patterns, look at the slurm documentation in this link. To add a comprehensible name, you could add the job name with the %x format. The command below would launch a job with the name of the job and keep the job id:

sbatch \
--output=slurm-%j-%x.out \
--account=def-$USER \
--job-name="cpu-32-batch-8" \
mnist-sbatch.sh

The output of the script with job-id would then be “slurm-<job-id>-cpu-32-batch-8.out”. This would help you identify the job in the queue and monitor its progress by visualizing the outputs.

Running the MNIST on a GPU

To run the MNIST script using a sbatch script, you just need to add the required GPU configuration. There are a number of GPU configurations available in the sbatch command. You can filter the options using the command sbatch --help | grep gpu. This will display a list of available options that you can use to customize the behavior of the sbatch script. One useful option is the --gpus-per-node option, which allows you to specify the number of GPUs required to run the script. You can customize the script to fit your specific requirements and configurations. You can add the configuration to the bash script (as seen below), or add it to the command line when submitting the job.

#SBATCH --gpus-per-node=1

After adding this line to the sbatch script, you can run the script as in the previous examples. For example, to run a minimal command with the default values from the <mnist-sbatch.sh> script, you can use the following command (you can remove the --gpus-per-node option if you added it to the sbatch script):

sbatch --account=def-$USER --gpus-per-node=1 mnist-sbatch.sh

Keep in mind that GPUs can process large matrixes faster than CPUs, and can be used to speed up the training process. However, the efficiency of the GPU is dependent on the size of the matrixes and the number of operations required. For small matrixes, the CPU can be more efficient than the GPU. Increasing the batch size can significantly improve the GPU performance by using the maximum capabilities of the hardware. The matrix operations are also dependant on model size, input dimensions and others. It is a good exercise to test the efficiency of the GPU using the seff command. You can run the script with different batch sizes and epochs to see how the efficiency changes, and even to determine if a given model will fit in memory. Remember that the efficiency is calculated based on the resources used by the job, and it can help you optimize the resource allocation for your scripts.

Monitoring GPU resources

To monitor the GPU resources used by the job, you can use the nvidia-smi command. This command will display information about the GPU, including its usage, memory usage, and other metrics. You can run the command in a separate terminal while the job is running to monitor the GPU resources in real-time. The command below will constantly display the GPU resources used by the job, updating every second:

srun --jobid <the-job-id> --pty watch -n 1 nvidia-smi

Notice that for the command above to work, the job should be currently running. If it is required to estimate the resource requirements, it is possible to request an interactive session, prepare the environment, and run the desired script. At the same time, watch the output of the nvidia-smi. The example below, shows the output of the nvidia-smi command while running the MNIST script. The output shows the GPU utilization, memory usage, and other metrics. You can use this information to monitor the GPU resources used by the job and optimize the resource allocation for your scripts.

Fri Mar 28 09:48:45 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:C1:00.0 Off |                    0 |
| 30%   44C    P2    79W / 230W |   2164MiB / 23028MiB |     13%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     63341      C   python                           2161MiB |
+-----------------------------------------------------------------------------+

License

Icon for the Creative Commons Attribution 4.0 International License

Introduction to Advanced Research Computing using Digital Research Alliance of Canada Resources Copyright © by Jazmin Romero; Roger Selzler; Nicholi Shiell; Ryan Taylor; and Andrew Schoenrock is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

Handwritten Digit recognition using the MNIST dataset

Running Locally

Creating the (local) virtual environment

The final MNIST training script

Running the script locally

Questions

Running on the DRAC resources

Downloading the dataset

Test existing script interactively

Allocate the resources

Loading required modules

Questions

Exercises

Prepare the virtual environment

Questions

Exercises

Running the script with allocated resources

Exercises

Questions

Running MNIST as a sbatch script

The updated MNIST script

The SBATCH script to submit the job

Running and monitoring the MNIST script using the sbatch script

Investigating the output

Running the MNIST on a GPU

Monitoring GPU resources

License

Share This Book