Optimizers

Riemann provides a variety of optimization algorithms through the riemann.optim package. These optimizers are used to update the parameters of neural networks during training.

Optimizer Basics

This section covers the fundamental concepts and usage patterns for optimizers in Riemann.

All optimizers in Riemann inherit from the optim.Optimizer class. To use an optimizer, you need to:

  1. Create an optimizer instance with the parameters to optimize

  2. Define a loss function

  3. Zero the gradients

  4. Compute the loss and call backward()

  5. Call the optimizer’s step() method

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create a simple model
model = nn.Linear(10, 1)

# Create an optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Define loss function
loss_fn = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    # Zero gradients
    optimizer.zero_grad()

    # Forward pass
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

    # Backward pass
    loss.backward()

    # Update parameters
    optimizer.step()

GD (Gradient Descent)

GD is the most basic optimization algorithm that updates parameters in the direction of the negative gradient.

Basic GD

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Linear(10, 1)

# Create GD optimizer
optimizer = optim.GD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

GD with Weight Decay

Weight decay (L2 regularization) can be added to prevent overfitting.

import riemann as rm
import riemann.optim as optim

# Create GD optimizer with weight decay
optimizer = optim.GD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD (Stochastic Gradient Descent)

SGD is a variant of gradient descent that updates parameters using a subset of data at each iteration. It is one of the most fundamental optimization algorithms in machine learning.

Algorithm Explanation:

  1. Compute gradients on a mini-batch: ∇θL(θ)

  2. Apply weight decay: θ = θ * (1 - η * weight_decay)

  3. Update parameters: θ = θ - η * ∇θL(θ)

  4. Optional: Apply momentum and/or Nesterov momentum

Advantages:

  • Simple and computationally efficient

  • Straightforward implementation

  • No hyperparameters to tune except learning rate

  • Effective for large datasets when used with momentum

Suitable Scenarios:

  • Training large models with large datasets

  • When computational resources are limited

  • As a baseline for comparing other optimization algorithms

  • When fine-tuning models with careful learning rate scheduling

Basic SGD

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Linear(10, 1)

# Create SGD optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD with Momentum

Momentum helps accelerate SGD in the relevant direction and dampens oscillations.

import riemann as rm
import riemann.optim as optim

# Create SGD optimizer with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD with Nesterov Momentum

Nesterov momentum is a variant of momentum that can provide better convergence by looking ahead in the direction of the momentum vector.

import riemann as rm
import riemann.optim as optim

# Create SGD optimizer with Nesterov momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD with Closure Parameter

The SGD optimizer supports a closure parameter, which is a function that reevaluates the model and returns the loss. This is useful for algorithms that need to compute the loss multiple times per optimization step, or for custom optimization logic.

import riemann as rm
import riemann.optim as optim

# Create SGD optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Define closure function
def closure():
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    return loss

# Training loop using closure
for epoch in range(num_epochs):
    loss = optimizer.step(closure)
    print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

When to Use Closure with SGD

The closure parameter is particularly useful in scenarios such as:

  • Implementing custom optimization logic that requires multiple forward/backward passes

  • Using SGD with algorithms that need to compute loss gradients multiple times

  • Creating consistent interfaces across different optimizers (e.g., LBFGS always requires a closure)

  • Debugging optimization issues by inspecting loss values at different stages

Adam (Adaptive Moment Estimation)

Adam combines the best properties of AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. It maintains both a moving average of the gradients (momentum) and a moving average of the squared gradients (adaptive learning rate).

Algorithm Explanation:

  1. Compute first moment estimate: m = β1*m + (1-β1)*∇θL(θ)

  2. Compute second moment estimate: v = β2*v + (1-β2)*(∇θL(θ))²

  3. Apply bias correction: m̂ = m/(1-β1^t), v̂ = v/(1-β2^t)

  4. Apply weight decay: θ = θ * (1 - η * weight_decay)

  5. Update parameters: θ = θ - η*m̂/(√v̂ + ε)

Advantages:

  • Adaptive learning rate per parameter

  • Combines benefits of momentum and adaptive learning rates

  • Well-suited for sparse gradients and noisy problems

  • Generally works well without extensive hyperparameter tuning

Suitable Scenarios:

  • Training deep neural networks with many parameters

  • Problems with sparse gradients (e.g., NLP tasks)

  • Large-scale machine learning problems

  • When quick convergence is desired

  • As a default optimizer for most deep learning tasks

Basic Adam

import riemann as rm
import riemann.optim as optim

# Create Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Adam with Weight Decay

Weight decay (L2 regularization) can be added to prevent overfitting.

import riemann as rm
import riemann.optim as optim

# Create Adam optimizer with weight decay
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Adagrad

AdaGrad adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features. It maintains a per-parameter learning rate that is scaled based on the historical sum of squared gradients.

Algorithm Explanation:

  1. Compute gradients: ∇θL(θ)

  2. Accumulate squared gradients: G = G + (∇θL(θ))²

  3. Apply weight decay: θ = θ * (1 - η * weight_decay)

  4. Update parameters: θ = θ - η*∇θL(θ)/(√G + ε)

  5. Optional: Apply learning rate decay

Advantages:

  • Adaptive learning rate per parameter

  • Automatically scales learning rates for different features

  • Well-suited for sparse data where some features are infrequent

  • No need to manually tune learning rate for different parameters

Suitable Scenarios:

  • Training on sparse datasets (e.g., text classification)

  • Problems with features of varying frequencies

  • When you want learning rates to automatically adapt to feature importance

  • As a baseline for adaptive learning rate methods

Basic Adagrad

import riemann as rm
import riemann.optim as optim

# Create AdaGrad optimizer
optimizer = optim.Adagrad(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

LBFGS

LBFGS is a quasi-Newton method that approximates the BFGS algorithm using a limited amount of memory. It maintains a history of recent gradient evaluations to approximate the Hessian matrix, allowing for more efficient optimization than full Newton methods.

Algorithm Explanation:

  1. Compute gradients using the closure function

  2. Maintain a history of recent (position, gradient) pairs

  3. Approximate the inverse Hessian matrix using this history

  4. Compute search direction based on the approximate inverse Hessian

  5. Perform line search to find optimal step size

  6. Update parameters and repeat

Advantages:

  • Memory efficient compared to full BFGS

  • Faster convergence than first-order methods for smooth objectives

  • Uses second-order information without explicitly computing the Hessian

  • Well-suited for small to medium-sized problems

Suitable Scenarios:

  • Training small to medium-sized models

  • Problems with smooth objective functions

  • When you need faster convergence than first-order methods

  • For batch optimization where you can compute the full gradient

  • When memory is limited but you still want second-order optimization benefits

LBFGS Parameters Detailed Explanation:

  • lr: Learning rate (default: 1.0). Unlike other optimizers, LBFGS uses this as a step size multiplier during line search.

  • max_iter: Maximum number of iterations per optimization step (default: 20). Controls how many times the algorithm will attempt to find an optimal step.

  • max_eval: Maximum number of function evaluations per optimization step (default: None). If None, defaults to max_iter * 1.25.

  • tolerance_grad: Gradient tolerance for convergence (default: 1e-05). The optimization stops when the gradient norm falls below this threshold.

  • tolerance_change: Parameter change tolerance for convergence (default: 1e-09). The optimization stops when parameter updates fall below this threshold.

  • history_size: Update history size (default: 100). Controls how much memory is used to approximate the Hessian matrix.

  • line_search_fn: Line search function (default: None). Custom line search function to use instead of the default implementation.

Basic LBFGS

import riemann as rm
import riemann.optim as optim

# Create LBFGS optimizer
optimizer = optim.LBFGS(model.parameters(), lr=1.0)

# Define closure function for LBFGS
def closure():
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    return loss

# Training loop
for epoch in range(num_epochs):
    loss = optimizer.step(closure)
    print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

LBFGS with Custom Parameters

You can customize LBFGS parameters to better suit your specific optimization problem:

import riemann as rm
import riemann.optim as optim

# Create LBFGS optimizer with custom parameters
optimizer = optim.LBFGS(
    model.parameters(),
    lr=1.0,               # Learning rate (step size multiplier)
    max_iter=50,          # Maximum iterations per step
    max_eval=None,        # Maximum function evaluations (None = max_iter * 1.25)
    tolerance_grad=1e-06,  # Gradient convergence threshold
    tolerance_change=1e-09, # Parameter change convergence threshold
    history_size=100,      # Hessian approximation memory
    line_search_fn=None    # Use default line search
)

# Define closure function
def closure():
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    return loss

# Training loop
for epoch in range(num_epochs):
    loss = optimizer.step(closure)
    print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

Parameter Tuning Tips for LBFGS

  • lr: Start with 1.0 and adjust based on convergence. For ill-conditioned problems, smaller values (0.1-0.5) may be more stable.

  • max_iter: Increase for more complex optimization problems, but be aware of increased computation time.

  • history_size: Larger values may improve convergence but increase memory usage. For small models, 100-200 is typically sufficient.

  • tolerance_grad and tolerance_change: Adjust based on desired precision. Smaller values lead to more precise optimization but may require more iterations.

  • line_search_fn: Only customize if you have specific line search requirements. The default implementation works well for most problems.

AdamW (Adam with Weight Decay)

AdamW is an improved version of Adam that treats weight decay as a separate regularization term instead of modifying the gradients. This allows weight decay to more effectively act as L2 regularization, avoiding the weight decay side effects present in Adam.

Algorithm Explanation:

  1. Compute first moment estimate: m = β1*m + (1-β1)*∇θL(θ)

  2. Compute second moment estimate: v = β2*v + (1-β2)*(∇θL(θ))²

  3. Apply bias correction: m̂ = m/(1-β1^t), v̂ = v/(1-β2^t)

  4. Apply weight decay: θ = θ * (1 - η * weight_decay)

  5. Update parameters: θ = θ - η*m̂/(√v̂ + ε)

Advantages:

  • More effective weight decay as a regularizer

  • Better generalization performance

  • Avoids Adam’s weight decay issues

  • Well-suited for modern deep learning architectures

Suitable Scenarios:

  • Training deep neural networks

  • Transfer learning and fine-tuning

  • Models with many parameters prone to overfitting

  • Any scenario where Adam is used, especially with weight decay

Basic AdamW

import riemann as rm
import riemann.optim as optim

# Create AdamW optimizer
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

AdamW with Custom Betas

import riemann as rm
import riemann.optim as optim

# Create AdamW optimizer with custom betas
optimizer = optim.AdamW(model.parameters(), lr=0.001, betas=(0.9, 0.999), weight_decay=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

RMSprop (Root Mean Square Propagation)

RMSprop is an adaptive learning rate optimization algorithm that maintains a moving average of the squared gradients for each parameter. It is particularly well-suited for recurrent neural networks (RNNs) and other sequence models.

Algorithm Explanation:

  1. Compute moving average of squared gradients: v = α*v + (1-α)*(∇θL(θ))²

  2. Update parameters: θ = θ - η*∇θL(θ)/(√v + ε)

  3. Optional: Apply momentum and/or use centered version

Advantages:

  • Adaptive learning rate per parameter

  • Effective for non-stationary objectives

  • Well-suited for RNNs and sequence models

  • Helps with the vanishing/exploding gradient problem

Suitable Scenarios:

  • Training recurrent neural networks (RNNs)

  • Training long short-term memory (LSTM) networks

  • Training gated recurrent units (GRUs)

  • Any model with sequence data

  • Models where gradients can vary significantly over time

Basic RMSprop

import riemann as rm
import riemann.optim as optim

# Create RMSprop optimizer
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

RMSprop with Momentum

import riemann as rm
import riemann.optim as optim

# Create RMSprop optimizer with momentum
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

RMSprop with Centered Gradients

import riemann as rm
import riemann.optim as optim

# Create RMSprop optimizer with centered gradients
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, centered=True)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Optimizer Methods

Zero Gradients

import riemann as rm
import riemann.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

# Zero gradients
optimizer.zero_grad()

Step

import riemann as rm
import riemann.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

# Perform a single optimization step
optimizer.step()

State Dict

import riemann as rm
import riemann.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

# Get optimizer state
state_dict = optimizer.state_dict()

# Load optimizer state
optimizer.load_state_dict(state_dict)

Parameter Groups

Parameter groups are a powerful feature of Riemann optimizers that allow you to configure different hyperparameters for different sets of parameters in your model. This is particularly useful in scenarios such as:

  • Setting different learning rates for different layers of your model

  • Applying different hyperparameters to weight and bias parameters

  • Using different learning rates when fine-tuning pre-trained models

Basic Structure of Parameter Groups

Parameter groups are defined through a list of dictionaries, where each dictionary contains:

  • params:The set of parameters to optimize

  • Other key-value pairs:Hyperparameters specific to this parameter group (e.g., lr, weight_decay)

Basic Usage Example

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

# Create parameter groups
optimizer = optim.SGD([
    {'params': model[0].parameters(), 'lr': 0.01},
    {'params': model[2].parameters(), 'lr': 0.001}
], momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

In the example above: - The first parameter group contains all parameters from the first layer (model[0]) with a learning rate of 0.01 - The second parameter group contains all parameters from the third layer (model[2]) with a learning rate of 0.001 - Both parameter groups share the momentum=0.9 hyperparameter

Setting Different Hyperparameters for Weights and Biases

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Linear(10, 1)

# Separate weight and bias parameters
weight_params = [p for name, p in model.named_parameters() if 'weight' in name]
bias_params = [p for name, p in model.named_parameters() if 'bias' in name]

# Create parameter groups
optimizer = optim.SGD([
    {'params': weight_params, 'lr': 0.01, 'weight_decay': 1e-4},
    {'params': bias_params, 'lr': 0.02, 'weight_decay': 0}
])

In this example: - Weight parameters use a smaller learning rate (0.01) and weight decay (1e-4) - Bias parameters use a larger learning rate (0.02) and no weight decay

Using Parameter Groups in Pre-trained Models

import riemann as rm
import riemann.optim as optim

# Create a pre-trained model (simplified example)
class PretrainedModel(nn.Module):
    def __init__(self):
        super(PretrainedModel, self).__init__()
        # Assume this part is a pre-trained feature extractor
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        # Assume this part is a newly added classifier
        self.classifier = nn.Linear(64, 10)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

model = PretrainedModel()

# Create parameter groups
optimizer = optim.SGD([
    {'params': model.features.parameters(), 'lr': 0.001},  # Lower LR for pre-trained features
    {'params': model.classifier.parameters(), 'lr': 0.01}  # Higher LR for new classifier
], momentum=0.9)

In this example: - The pre-trained feature extractor uses a smaller learning rate (0.001) to avoid disrupting learned features - The newly added classifier uses a larger learning rate (0.01) to speed up its convergence

How Parameter Groups Work

When you create an optimizer with parameter groups:

  1. The optimizer maintains separate states for each parameter group

  2. During each step() call, the optimizer updates parameters according to the hyperparameters of their respective groups

  3. If a hyperparameter is not specified in a parameter group, the optimizer uses the default value provided in the constructor

Best Practices

  1. Clear Naming:Use named_parameters() to create parameter groups based on parameter names for better code readability

  2. Logical Grouping:Group parameters based on their nature and importance, such as: - Parameters from different layers - Weight and bias parameters - Pre-trained and newly added parameters

  3. Learning Rate Scheduling:Parameter group learning rates can be used with learning rate schedulers, which will adjust them based on their initial learning rates

  4. Hyperparameter Search:Parameter groups enable more flexible hyperparameter searches by allowing different configurations for different parts of your model

Gradient Clipping

Gradient clipping is a technique to prevent exploding gradients in deep networks by limiting the size of gradients to ensure training stability. During the training of deep neural networks, gradients can become extremely large, leading to excessive parameter updates and divergence of the training process.

Benefits of Gradient Clipping

  • Prevents Exploding Gradients: Limits the maximum gradient value, avoiding excessive parameter updates

  • Prevents Exploding Gradients: Limits the maximum gradient value, avoiding excessive parameter updates

  • Improves Training Stability: Makes the training process more stable and reduces training fluctuations

  • Speeds Up Convergence: Can help the model converge faster in some cases

  • Allows Larger Learning Rates: By limiting gradients, larger initial learning rates can be used

Clip by Norm

Clipping by norm works by calculating the L2 norm of the gradients and limiting it to a maximum norm. This method preserves the direction of the gradients while adjusting their magnitude.

Parameter Description: - parameters: The parameter collection whose gradients to clip - max_norm: The maximum norm of the gradients - norm_type: The type of norm to use, default is 2 (L2 norm) - error_if_nonfinite: Whether to throw an error if gradients contain non-finite values (like NaN or inf), default is False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Clip gradients by norm
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Use Cases: - Suitable for most deep neural network training - Especially useful when using RNNs or LSTMs - When you observe NaN or inf in training loss

Practical Application Example:

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Create optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()

        # Clip gradients before optimizer step
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()

Clip by Value

Clipping by value works by limiting each element of the gradients to a specified range. This method directly truncates extreme values in the gradients.

Parameter Description: - parameters: The parameter collection whose gradients to clip - clip_value: The maximum absolute value of gradients - error_if_nonfinite: Whether to throw an error if gradients contain non-finite values (like NaN or inf), default is False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Clip gradients by value
nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Use Cases: - When you want to directly control the maximum absolute value of gradients - When there are extreme outliers in the gradients - For specific network architectures like discriminators in GANs

Practical Application Example:

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()

        # Clip gradients before optimizer step
        nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

        optimizer.step()

Best Practices for Gradient Clipping

  1. Choose the Right Clipping Method: - Clip by norm (clip_grad_norm_): Suitable for most cases, especially RNNs - Clip by value (clip_grad_value_): Suitable for cases with extreme gradient values

  2. Set Reasonable Clipping Thresholds: - Clip by norm: max_norm is typically set between 0.5 and 5.0 - Clip by value: clip_value is typically set between 0.1 and 1.0

  3. Clipping Timing: - Must be executed after loss.backward() and before optimizer.step() - Should be applied to every batch

  4. Combining with Other Techniques: - Use with learning rate schedulers - Use with parameter groups to apply different clipping strategies to different layers

  5. Monitoring Effectiveness: - Observe if training loss becomes more stable - Check if exploding gradients still occur - Adjust clipping thresholds for optimal results

Examples

Training a Neural Network with Adam

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Training loop
for epoch in range(50):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()

        # Gradient clipping
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()

    print(f'Epoch {epoch}, Loss: {total_loss/len(train_loader):.4f}')

Training with Different Learning Rates for Different Layers

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create a pre-trained model (simplified example)
class PretrainedModel(nn.Module):
    def __init__(self):
        super(PretrainedModel, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.classifier = nn.Linear(64, 10)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

model = PretrainedModel()

# Create parameter groups with different learning rates
optimizer = optim.SGD([
    {'params': model.features.parameters(), 'lr': 0.001},  # Lower LR for pretrained features
    {'params': model.classifier.parameters(), 'lr': 0.01}  # Higher LR for new classifier
], momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Custom Optimizer

You can create custom optimizers by inheriting from optim.Optimizer.

import riemann as rm
import riemann.optim as optim

class CustomSGD(optim.Optimizer):
    def __init__(self, params, lr=0.01, momentum=0):
        defaults = dict(lr=lr, momentum=momentum)
        super(CustomSGD, self).__init__(params, defaults)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue

                d_p = p.grad.data
                if group['momentum'] != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = rm.clone(d_p).detach()
                    else:
                        buf = param_state['momentum_buffer']
                        buf.mul_(group['momentum']).add_(d_p)
                        d_p = buf

                p.data.add_(-group['lr'], d_p)

        return loss

# Use custom optimizer
optimizer = CustomSGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Learning Rate Schedulers

Learning rate schedulers are used to dynamically adjust the learning rate during training, which is crucial for model convergence and performance optimization. Riemann provides several learning rate schedulers, each with its specific adjustment strategy.

Types of Learning Rate Schedulers

StepLR

Function: Adjusts the learning rate by a fixed step size and decay factor.

Parameters:

  • optimizer: The optimizer whose learning rate is to be adjusted

  • step_size: The step size for learning rate decay

  • gamma: The learning rate decay factor, default is 0.1

  • last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where learning rate needs to be reduced at fixed intervals.

MultiStepLR

Function: Adjusts the learning rate at specified milestones.

Parameters:

  • optimizer: The optimizer whose learning rate is to be adjusted

  • milestones: A list of milestones for learning rate decay

  • gamma: The learning rate decay factor, default is 0.1

  • last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where learning rate needs to be reduced at specific epochs.

ExponentialLR

Function: Adjusts the learning rate with exponential decay.

Parameters:

  • optimizer: The optimizer whose learning rate is to be adjusted

  • gamma: The learning rate decay factor

  • last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where a continuous smooth decay of learning rate is needed.

CosineAnnealingLR

Function: Adjusts the learning rate according to the shape of a cosine function.

Parameters:

  • optimizer: The optimizer whose learning rate is to be adjusted

  • T_max: The period of cosine annealing

  • eta_min: The minimum learning rate, default is 0

  • last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where learning rate needs to first decrease and then increase, helping to escape local optima.

ReduceLROnPlateau

Function: Adjusts the learning rate when a metric stops improving.

Parameters:

  • optimizer: The optimizer whose learning rate is to be adjusted

  • mode: Mode, ‘min’ or ‘max’, default is ‘min’

  • factor: The learning rate decay factor, default is 0.1

  • patience: Number of epochs with no improvement after which learning rate will be reduced, default is 10

  • threshold: Threshold for measuring new best, default is 1e-4

  • threshold_mode: Threshold mode, ‘rel’ or ‘abs’, default is ‘rel’

  • cooldown: Number of epochs to wait before resuming normal operation after lr has been reduced, default is 0

  • min_lr: Minimum learning rate, default is 0

  • eps: Minimum change in learning rate to qualify as an improvement, default is 1e-8

Use Case: Suitable for scenarios where learning rate needs to be dynamically adjusted based on validation metrics.

Using Learning Rate Schedulers

The basic usage flow of learning rate schedulers is as follows:

  1. Create an optimizer

  2. Create a learning rate scheduler, passing in the optimizer and relevant parameters

  3. In the training loop, first call the optimizer’s step() method to update parameters

  4. Then call the scheduler’s step() method to update the learning rate

Scheduler and Optimizer Interaction

  • Order: It is recommended to call optimizer.step() first, then scheduler.step()

  • Parameter Groups: The scheduler will adjust each parameter group based on its initial learning rate

  • State Saving: The scheduler’s state can be saved and loaded using state_dict() and load_state_dict() methods

  • Special Case: The ReduceLROnPlateau scheduler requires a validation metric to be passed in the step() method

Complete Example Code

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim
from riemann.optim.lr_scheduler import StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau

# Create model
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

# Create optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

# Create learning rate scheduler (choose one)
# 1. StepLR example
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# 2. MultiStepLR example
# scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)

# 3. ExponentialLR example
# scheduler = ExponentialLR(optimizer, gamma=0.99)

# 4. CosineAnnealingLR example
# scheduler = CosineAnnealingLR(optimizer, T_max=50, eta_min=0.001)

# 5. ReduceLROnPlateau example (needs validation loss in step())
# scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)

# Define loss function
loss_fn = nn.MSELoss()

# Generate example data
inputs = rm.randn(100, 10)
targets = rm.randn(100, 1)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Update learning rate
    scheduler.step()  # For ReduceLROnPlateau, use scheduler.step(loss.item())

    # Print information
    if (epoch + 1) % 10 == 0:
        current_lr = scheduler.get_lr()[0] if hasattr(scheduler, 'get_lr') else optimizer.param_groups[0]['lr']
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, LR: {current_lr:.6f}')

Best Practices for Learning Rate Schedulers

  1. Choose the Right Scheduler: Select the appropriate learning rate scheduling strategy based on task characteristics

  2. Set a Reasonable Initial Learning Rate: The initial learning rate should not be too large or too small

  3. Combine with Validation Set: When using ReduceLROnPlateau, it should be based on validation metrics rather than training metrics

  4. Learning Rate Lower Bound: Set a reasonable min_lr for ReduceLROnPlateau to prevent training stagnation due to extremely small learning rates

  5. Warm-up Phase: For large models, consider using a smaller learning rate for warm-up in the early training stages

  6. Parameter Group Compatibility: When used with parameter groups, ensure that each parameter group has a reasonable initial learning rate