Training Tips
Optimize training performance
- DataLoader Setup
- cuDNN Auto-Tuner
- Learning Rate Scheduler
- Mixed Precision Training
- Gradient Clipping
Logging
- TensorBoard
- Checkpointing
DataLoader Setup
num_workers
-
Allows data to be loaded/preprocessed in parallel, ensuring the GPU is continuously fed with new batches.
-
Rule of thumb:
num_workers = num_cpu_cores(can be higher or lower depending on your system).
shuffle=True
- Shuffles the order of the batches before each epoch to prevent patterns from forming.
pin_memory=True
- Pre-allocates memory for faster data transfer.
drop_last=True
- Ensures consistent batch size for uniform computation graphs.
persistent_workers=True
- Keeps workers active across epochs instead of reinitializing them every time.
cuDNN Auto-Tuner
torch.backends.cudnn.benchmark = True
- The auto-tuner tests different implementations of algorithms (e.g., for convolution, pooling) and selects the fastest one that can run on the current hardware.
- Note: The input size should be constant (which is usually the case). If the input size is dynamic, a recalculation might be triggered, leading to potential inefficiencies.
Learning Rate Scheduler: ReduceLROnPlateau
ReduceLROnPlateau(mode='min', factor, patience, min_lr)
- Dynamic Learning Rate Adjustment: The learning rate is dynamically adjusted during training.
- Mechanism: If the loss stagnates or does not decrease further for a number of validation iterations specified by
patience, the learning rate is multiplied byfactoruntil the optional minimummin_lris reached. - Recommendation: Use moderate values like
factor=0.9to avoid "stalling" in the training process.
AMP: Automatic Mixed Precision
Running the Model with autocast() and Adjusting Loss with GradScaler().
- Mixed Precision Arithmetic: Combines 16-bit (half-precision) and 32-bit (single-precision) floating-point arithmetic.
- Efficiency Improvement: Enhances training efficiency without compromising model accuracy.
- Increased Batch Size: Allows for nearly double the batch size since only critical operations like summations are performed with 32-bit precision. Note:
- Works effectively with "simple" loss functions (e.g., MAE, MSE).
- Not compatible with GAN and feature-based loss functions.
Gradient Clipping
- Preventing Large Gradients: Ensures that gradients do not become too large, which could destabilize the training process.
- Reducing the Risk of Exploding Gradients: Minimizes the risk of gradients "exploding" during backpropagation.
- Controlled Model Updates: Updates to the model parameters are scaled, while maintaining the direction of the gradient vector.
Motivation:
- The GAN loss often collapsed to NaN after extended training.
- The value of max_norm=0.4 was empirically determined to address this issue.
Logging with TensorBoard
Usage with SummaryWriter:
- Purpose: Visualization of training and validation losses, as well as output images.
Opening TensorBoard in VS Code:
- Open the Command Palette:
- Shortcut:
Ctrl+Shift+P
- Shortcut:
- Search for TensorBoard:
- Type:
"Python: Launch TensorBoard" - Press
Enter
- Type:
- Select the Logs Folder:
- Choose:
Select another folder - Navigate to:
tensorboard_logs
- Choose:
- View in Browser:
- Open:
http://localhost:6006/
- Open:
Logging training info with .json
Automatic info inside the /model_zoo next to each trained model which looks something like this:
{
"in_res": 256,
"n_channels": 4,
"epoch": 30,
"total_epochs": 30,
"iteration": 224970,
"total_iterations": 224970,
"duration": 43.3434,
"pretrained_model": "model_zoo\\2024_08_13_08_15_32\\model_96750.pth",
"model_architecture": "SCUNet",
"use_sigmoid": true,
"config": "[4,4,4,4,4,4,4]",
"batch_size": 12,
"loss_info": {
"Pixel_Loss": "L1",
"Pixel_Loss_Weight": 10,
"GAN_Loss": "wgan",
"GAN_Loss_Weight": 1,
"Feature_Loss": "lpips_vgg",
"Feature_Loss_Weight": 1
},
"optimizer": "Adam",
"initial_learning_rate": 0.0001,
"last_learning_rate": 5.904900000000002e-05,
"decay_factor": 0.9,
"decay_step": "auto_plateau",
"lr_patience": 10,
"number_params": 17947224,
"n_imgs_train": 89991,
"n_imgs_test": 250,
"training_path": [
"training_data\\LSDIR\\train",
"training_data\\custom"
],
"testing_path": [
"training_data\\LSDIR\\val\\HR"
],
"transforms_train": {
"0": {
"class_name": "Downsample",
"downsampling_factor": 2,
"sigma": 1
},
"1": {
"class_name": "RandomCrop",
"h": 256,
"w": 256
},
"2": {
"class_name": "RgbToRawTransform",
"iso": 12500,
"noise_model": "dng",
"wb_gains_mode": "normal"
},
"3": {
"class_name": "ToTensor2"
}
},
"transforms_test": {
"0": {
"class_name": "Downsample",
"downsampling_factor": 2,
"sigma": 1
},
"1": {
"class_name": "RandomCrop",
"h": 256,
"w": 256
},
"2": {
"class_name": "RgbToRawTransform",
"iso": 12500,
"noise_model": "dng",
"wb_gains_mode": "normal"
},
"3": {
"class_name": "ToTensor2"
}
},
"final_test_loss": 0.054455384612083435,
"final_val_loss": 0.028505839593708514
}
Bonus
If you ever need to prevent your computer from shutting down, use this batch script:
@echo off
setlocal enabledelayedexpansion
rem Ermitteln der aktuellen Stunde
set "currentHour=%TIME:~0,2%"
set "currentHour=!currentHour: =!" rem Leerzeichen entfernen
rem Berechnen der Start- und Endzeit für die nächsten 18 Stunden
set /a "startHour=!currentHour!"
set /a "endHour=(startHour + 18) %% 24"
rem Aktualisieren der "Active Hours" im Registrierungseditor
reg add "HKLM\SOFTWARE\Microsoft\WindowsUpdate\UX\Settings" /v "ActiveHoursStart" /t REG_DWORD /d !startHour! /f
reg add "HKLM\SOFTWARE\Microsoft\WindowsUpdate\UX\Settings" /v "ActiveHoursEnd" /t REG_DWORD /d !endHour! /f
echo Die "Active Hours" wurden auf !startHour! bis !endHour! Uhr festgelegt.