Training Tips

Optimize training performance

DataLoader Setup
cuDNN Auto-Tuner
Learning Rate Scheduler
Mixed Precision Training
Gradient Clipping

Logging

TensorBoard
Checkpointing

DataLoader Setup

num_workers

Allows data to be loaded/preprocessed in parallel, ensuring the GPU is continuously fed with new batches.
Rule of thumb: num_workers = num_cpu_cores (can be higher or lower depending on your system).

shuffle=True

Shuffles the order of the batches before each epoch to prevent patterns from forming.

pin_memory=True

Pre-allocates memory for faster data transfer.

drop_last=True

Ensures consistent batch size for uniform computation graphs.

persistent_workers=True

Keeps workers active across epochs instead of reinitializing them every time.

cuDNN Auto-Tuner

torch.backends.cudnn.benchmark = True

The auto-tuner tests different implementations of algorithms (e.g., for convolution, pooling) and selects the fastest one that can run on the current hardware.
Note: The input size should be constant (which is usually the case). If the input size is dynamic, a recalculation might be triggered, leading to potential inefficiencies.

Learning Rate Scheduler: ReduceLROnPlateau

ReduceLROnPlateau(mode='min', factor, patience, min_lr)

Dynamic Learning Rate Adjustment: The learning rate is dynamically adjusted during training.
Mechanism: If the loss stagnates or does not decrease further for a number of validation iterations specified by patience, the learning rate is multiplied by factor until the optional minimum min_lr is reached.
Recommendation: Use moderate values like factor=0.9 to avoid "stalling" in the training process.

AMP: Automatic Mixed Precision

Running the Model with autocast() and Adjusting Loss with GradScaler().

Mixed Precision Arithmetic: Combines 16-bit (half-precision) and 32-bit (single-precision) floating-point arithmetic.
Efficiency Improvement: Enhances training efficiency without compromising model accuracy.
Increased Batch Size: Allows for nearly double the batch size since only critical operations like summations are performed with 32-bit precision. Note:
Works effectively with "simple" loss functions (e.g., MAE, MSE).
Not compatible with GAN and feature-based loss functions.

Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.4)

Preventing Large Gradients: Ensures that gradients do not become too large, which could destabilize the training process.
Reducing the Risk of Exploding Gradients: Minimizes the risk of gradients "exploding" during backpropagation.
Controlled Model Updates: Updates to the model parameters are scaled, while maintaining the direction of the gradient vector.

Motivation: - The GAN loss often collapsed to NaN after extended training. - The value of max_norm=0.4 was empirically determined to address this issue.

Logging with TensorBoard

Usage with SummaryWriter:

Purpose: Visualization of training and validation losses, as well as output images.

Opening TensorBoard in VS Code:

Open the Command Palette:
- Shortcut: Ctrl+Shift+P
Search for TensorBoard:
- Type: "Python: Launch TensorBoard"
- Press Enter
Select the Logs Folder:
- Choose: Select another folder
- Navigate to: tensorboard_logs
View in Browser:
- Open: http://localhost:6006/

Logging training info with `.json`

Automatic info inside the /model_zoo next to each trained model which looks

something like this: href="#__codelineno-1-1">{ "in_res": 256, "n_channels": 4, "epoch": 30, "total_epochs": 30, "iteration": 224970, "total_iterations": 224970, "duration": 43.3434, "pretrained_model": "model_zoo\\2024_08_13_08_15_32\\model_96750.pth", "model_architecture": "SCUNet", "use_sigmoid": true, "config": "[4,4,4,4,4,4,4]", "batch_size": 12, "loss_info": { "Pixel_Loss": "L1", "Pixel_Loss_Weight": 10, "GAN_Loss": "wgan", "GAN_Loss_Weight": 1, "Feature_Loss": "lpips_vgg", "Feature_Loss_Weight": 1 }, "optimizer": "Adam", "initial_learning_rate": 0.0001, "last_learning_rate": 5.904900000000002e-05, "decay_factor": 0.9, "decay_step": "auto_plateau", "lr_patience": 10, "number_params": 17947224, "n_imgs_train": 89991, "n_imgs_test": 250, "training_path": [ "training_data\\LSDIR\\train", "training_data\\custom" ], "testing_path": [ "training_data\\LSDIR\\val\\HR" ], "transforms_train": { "0": { "class_name": "Downsample", "downsampling_factor": 2, "sigma": 1 }, "1": { "class_name": "RandomCrop", "h": 256, "w": 256 }, "2": { "class_name": "RgbToRawTransform", "iso": 12500, "noise_model": "dng", "wb_gains_mode": "normal" }, "3": { "class_name": "ToTensor2" } }, "transforms_test": { "0": { "class_name": "Downsample", "downsampling_factor": 2, "sigma": 1 }, "1": { "class_name": "RandomCrop", "h": 256, "w": 256 }, "2": { "class_name": "RgbToRawTransform", "iso": 12500, "noise_model": "dng", "wb_gains_mode": "normal" }, "3": { "class_name": "ToTensor2" } }, "final_test_loss": 0.054455384612083435, "final_val_loss": 0.028505839593708514 }

Bonus

If you ever need to prevent your computer from shutting down, use this batch script:

@echo off
setlocal enabledelayedexpansion

rem Ermitteln der aktuellen Stunde
set "currentHour=%TIME:~0,2%"
set "currentHour=!currentHour: =!" rem Leerzeichen entfernen

rem Berechnen der Start- und Endzeit für die nächsten 18 Stunden
set /a "startHour=!currentHour!"
set /a "endHour=(startHour + 18) %% 24"

rem Aktualisieren der "Active Hours" im Registrierungseditor
reg add "HKLM\SOFTWARE\Microsoft\WindowsUpdate\UX\Settings" /v "ActiveHoursStart" /t REG_DWORD /d !startHour! /f
reg add "HKLM\SOFTWARE\Microsoft\WindowsUpdate\UX\Settings" /v "ActiveHoursEnd" /t REG_DWORD /d !endHour! /f

echo Die "Active Hours" wurden auf !startHour! bis !endHour! Uhr festgelegt.